# Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models
**Authors**: Zhenyuan Guo, Tong Chen, Wenlong Meng, Chen Gong, Xin Yu, Chengkun Wei, Wenzhi Chen
## Abstract
Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs’ efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dyn amic T hinking-Token S election (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency. Across six benchmarks, DynTS surpasses the state-of-the-art KV cache compression methods, improving Pass@1 by $2.6\$ under the same budget. Compared to vanilla Transformers, it reduces inference latency by $1.84–2.62\times$ and peak KV-cache memory footprint by $3.32–5.73\times$ without compromising LRMs’ reasoning performance. The code is available at the github link. https://github.com/Robin930/DynTS
KV Cache Compression, Efficient LRM, LLM
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Comparison of Token Handling in Different Methods
### Overview
The image is a diagram comparing how different methods (Transformers, SnapKV, StreamingLLM, H2O, and DynTS) handle tokens during processing. Each method is represented by a horizontal row of boxes representing tokens, with different colors indicating different types of tokens or their importance. Arrows in the DynTS row indicate predicted importance of tokens to the final answer.
### Components/Axes
The diagram is organized into three columns:
* **Methods:** Lists the different methods being compared: Transformers, SnapKV, StreamingLLM, H2O, and DynTS.
* **Tokens:** A visual representation of tokens for each method, using colored boxes. Each row represents a method.
* **Keeps:** A legend explaining the color coding of the tokens in each method.
### Detailed Analysis or Content Details
**1. Transformers:**
* All tokens are represented by gray boxes.
* The "Keeps" label indicates "All Tokens" are kept.
* The tokens extend to the right with an ellipsis (...), indicating a potentially large number of tokens.
**2. SnapKV:**
* The first few tokens are orange, representing "High Importance Prefill Tokens".
* The subsequent tokens are gray.
* A label "Prompt" spans the orange tokens.
* A label "Observation Window" spans a section of the gray tokens.
* The tokens extend to the right with an ellipsis (...).
**3. StreamingLLM:**
* The first token is yellow, representing "Attention Sink Tokens".
* The subsequent tokens are light blue, representing "Local Tokens".
* The tokens extend to the right with an ellipsis (...).
**4. H2O:**
* The first few tokens are green, representing "Heavy-Hitter Tokens".
* The subsequent tokens are light blue, representing "Local Tokens".
* The tokens extend to the right with an ellipsis (...).
**5. DynTS:**
* The first few tokens are red, representing "Predicted Importance Tokens".
* The subsequent tokens are light blue, representing "Local Tokens".
* Curved arrows originate from the red tokens and point towards a box labeled "Answer", indicating the predicted importance of these tokens to the final answer.
* The tokens extend to the right with an ellipsis (...).
### Key Observations
* Transformers process all tokens equally.
* SnapKV prioritizes "High Importance Prefill Tokens" during the prompt phase.
* StreamingLLM uses "Attention Sink Tokens" followed by "Local Tokens".
* H2O identifies and prioritizes "Heavy-Hitter Tokens" followed by "Local Tokens".
* DynTS dynamically predicts the importance of tokens and focuses on those deemed most relevant to the answer.
* The use of ellipsis (...) suggests that the token sequences can be much longer than depicted.
### Interpretation
The diagram illustrates different strategies for managing tokens in language models. Traditional Transformers process all tokens uniformly, which can be computationally expensive. The other methods (SnapKV, StreamingLLM, H2O, and DynTS) attempt to improve efficiency by selectively focusing on the most important tokens.
SnapKV focuses on the initial prompt tokens. StreamingLLM uses attention sink tokens to manage context. H2O identifies key tokens within the sequence. DynTS dynamically assesses token importance and prioritizes those contributing most to the final answer.
The arrows in DynTS visually emphasize the concept of attention and how the model weighs different tokens when generating a response. The diagram suggests a trend towards more sophisticated token management techniques to improve the performance and scalability of language models. The diagram does not provide quantitative data, but rather a qualitative comparison of different approaches.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy vs. KV Cache Length for Different Models
### Overview
This image presents a bar chart comparing the accuracy and KV cache length of several models: Transformers, DynTS, Window StreamingLLM, SepLLM, H2O, SnapKV, and R-KV. Accuracy is represented by the height of the bars, while KV cache length is indicated by a dashed line with markers.
### Components/Axes
* **X-axis:** Model names (Transformers, DynTS, Window StreamingLLM, SepLLM, H2O, SnapKV, R-KV).
* **Y-axis (left):** Accuracy (%) - Scale ranges from 0 to 70.
* **Y-axis (right):** KV Cache Length (k) - Scale ranges from 2k to 20k.
* **Legend:**
* Gray bars: Accuracy
* Blue dashed line with markers: KV Cache Length
* **Data Series:**
* Accuracy for each model.
* KV Cache Length for each model.
### Detailed Analysis
The chart displays the following data:
* **Transformers:** Accuracy ≈ 63.6%, KV Cache Length ≈ 63.6k.
* **DynTS:** Accuracy ≈ 63.5%, KV Cache Length ≈ 5k.
* **Window StreamingLLM:** Accuracy ≈ 49.4%, KV Cache Length ≈ 5k.
* **SepLLM:** Accuracy ≈ 51.6%, KV Cache Length ≈ 5k.
* **H2O:** Accuracy ≈ 54.5%, KV Cache Length ≈ 5k.
* **SnapKV:** Accuracy ≈ 58.8%, KV Cache Length ≈ 5k.
* **R-KV:** Accuracy ≈ 59.8%, KV Cache Length ≈ 5k.
**Trends:**
* Accuracy generally increases from Window StreamingLLM to R-KV, with a significant jump from Window StreamingLLM to SepLLM.
* KV Cache Length remains relatively constant at approximately 5k for all models except Transformers and DynTS.
* Transformers and DynTS have significantly higher accuracy than the other models, but also have much larger KV Cache Lengths.
### Key Observations
* Transformers and DynTS exhibit the highest accuracy, but at the cost of a substantially larger KV cache length.
* The remaining models (Window StreamingLLM, SepLLM, H2O, SnapKV, R-KV) have similar KV cache lengths, but varying levels of accuracy.
* There is a clear trade-off between accuracy and KV cache length.
### Interpretation
The data suggests that achieving high accuracy in these models requires a larger KV cache. However, the other models demonstrate that reasonable accuracy can be achieved with a significantly smaller cache size. This could be important in resource-constrained environments where memory is limited.
The large difference in KV cache length between Transformers/DynTS and the other models indicates a different architectural approach or optimization strategy. Transformers and DynTS may store more information in the KV cache to achieve higher accuracy, while the other models prioritize efficiency.
The relatively flat KV cache length across most models suggests that this parameter may be limited by a system constraint or design choice, rather than being directly optimized for each model. The accuracy differences within this constraint show the effectiveness of different model architectures.
The data points to a potential area for further research: exploring methods to reduce the KV cache length of high-accuracy models like Transformers without significantly sacrificing performance.
</details>
Figure 1: (Left) Comparison of token selection strategies across different KV cache eviction methods. In each row, colored blocks denote the retained high-importance tokens, while grey blocks represent the evicted tokens during LRM inference. (Right) The average reasoning performance and KV cache memory footprint on DeepSeek-R1-Distall-Llama-8B and DeepSeek-R1-Distall-Qwen-7B across six reasoning benchmarks.
Recent advancements in Large Reasoning Models (LRMs) (Chen et al., 2025) have significantly strengthened the reasoning capabilities of Large Language Models (LLMs). Representative models such as DeepSeek-R1 (Guo et al., 2025), Gemini-3-Pro (DeepMind, 2025), and ChatGPT-5.2 (OpenAI, 2025) support deep thinking mode to strengthen reasoning capability in the challenging mathematics, programming, and science tasks (Zhang et al., 2025b). These models spend a substantial number of intermediate thinking tokens on reflection, reasoning, and verification to derive the correct response during inference (Feng et al., 2025). However, the thinking process necessitates the immense KV cache memory footprint and attention-related computational cost, posing a critical deployment challenge in resource-constrained environments.
KV cache compression techniques aim to optimize the cache state by periodically evicting non-essential tokens (Shi et al., 2024; WEI et al., 2025; Liu et al., 2025b; Qin et al., 2025), typically guided by predefined token retention rules (Chen et al., 2024; Xiao et al., 2024; Devoto et al., 2024) or attention-based importance metrics (Zhang et al., 2023; Li et al., 2024; Choi et al., 2025). Nevertheless, incorporating them into the inference process of LRMs faces two key limitations: (1) Methods designed for long-context prefilling are ill-suited to the short-prefill and long-decoding scenarios of LRMs; (2) Methods tailored for long-decoding struggle to match the reasoning performance of the Full KV baseline (SOTA $60.9\$ vs. Full KV $63.6\$ , Fig. 1 Left). Specifically, in LRM inference, the model conducts an extensive reasoning process and then summarizes the reasoning content to derive the final answer (Minegishi et al., 2025). This implies that the correctness of the final answer relies on the thinking tokens within the preceding reasoning (Bogdan et al., 2025). However, existing compression methods cannot identify the tokens that are essential to the future answer. This leads to a significant misalignment between the retained tokens and the critical thinking tokens, resulting in degradation in the model’s reasoning performance.
To address this issue, we analyze the LRM’s generated content and study which tokens are most important for the model to steer the final answer. Some works point out attention weights capturing inter-token dependencies (Vaswani et al., 2017; Wiegreffe and Pinter, 2019; Bogdan et al., 2025), which can serve as a metric to assess the importance of tokens. Consequently, we decompose the generated content into a reasoning trace and a final answer, and then calculate the importance score of each thinking token in the trajectory by aggregating the attention weights from the answer to thinking tokens. We find that only a small subset of thinking tokens ( $\sim 20\$ tokens in the reasoning trace, see Section § 3.1) have significant scores, which may be critical for the final answer. To validate these hypotheses, we retain these tokens and prompt the model to directly generate the final answer. Experimental results show that the model maintains close accuracy compared to using the whole KV cache. This reveals a Pareto principle The Pareto principle, also known as the 80/20 rule, posits that $20\$ of critical factors drive $80\$ of the outcomes. In this paper, it implies that a small fraction of pivotal thinking tokens dictates the correctness of the model’s final response. in LRMs: only a small subset of decision-critical thinking tokens with high importance scores drives the model toward the final answer, while the remaining tokens contribute negligibly.
Based on the above insight, we introduce DynTS (Dyn amic T hinking-Token S election), a novel method for dynamically predicting and selecting decision-critical thinking tokens on-the-fly during decoding, as shown in Fig. 1 (Left). The key innovation of DynTS is the integration of a trainable, lightweight Importance Predictor at the final layer of LRMs, enabling the model to dynamically predict the importance of each thinking token to the final answer. By utilizing importance scores derived from sampled reasoning traces as supervision signals, the predictor learns to distinguish between critical tokens and redundant tokens. During inference, DynTS manages memory through a dual-window mechanism: generated tokens flow from a Local Window (which captures recent context) into a Selection Window (which stores long-term history). Once the KV cache reaches the budget, the system retains the KV cache of tokens with higher predicted importance scores in the Select Window and all tokens in the Local Window (Zhang et al., 2023; Chen et al., 2024). By evicting redundant KV cache entries, DynTS effectively reduces both system memory pressure and computational overhead. We also theoretically analyze the computational overhead introduced by the importance predictor and the savings from cache eviction, and derive a Break-Even Condition for net computational gain.
Then, we train the Importance Predictor based on the MATH (Hendrycks et al., 2021) train set, and evaluate DynTS on the other six reasoning benchmarks. The reasoning performance and KV cache length compare with the SOTA KV cache compression method, as reported in Fig. 1 (Right). Our method reduces the KV cache memory footprint by up to $3.32–5.73\times$ without compromising reasoning performance compared to the full-cache transformer baseline. Within the same budget, our method achieves a $2.6\$ improvement in accuracy over the SOTA KV cache compression approach.
## 2 Preliminaries
#### Large Reasoning Model (LRM).
Unlike standard LLMs that directly generate answers, LRMs incorporate an intermediate reasoning process prior to producing the final answer (Chen et al., 2025; Zhang et al., 2025a; Sui et al., 2025). Given a user prompt $\mathbf{x}=(x_{1},\dots,x_{M})$ , the model generated content represents as $\mathbf{y}$ , which can be decomposed into a reasoning trace $\mathbf{t}$ and a final answer $\mathbf{a}$ . The trajectory is delimited by a start tag <think> and an end tag </think>. Formally, the model output is defined as:
$$
\mathbf{y}=[\texttt{<think>},\mathbf{t},\texttt{</think>},\mathbf{a}], \tag{1}
$$
where the trajectory $\mathbf{t}=(t_{1},\dots,t_{L})$ composed of $L$ thinking tokens, and $\mathbf{a}=(a_{1},\dots,a_{K})$ represents the answer composed of $K$ tokens. During autoregressive generation, the model conducts a reasoning phase that produces thinking tokens $t_{i}$ , followed by an answer phase that generates the answer token $a_{i}$ . This process is formally defined as:
$$
P(\mathbf{y}|\mathbf{x})=\underbrace{\prod_{i=1}^{L}P(t_{i}|\mathbf{x},\mathbf{t}_{<i})}_{\text{Reasoning Phase}}\cdot\underbrace{\prod_{j=1}^{K}P(a_{j}|\mathbf{x},\mathbf{t},\mathbf{a}_{<j})}_{\text{Answer Phase}} \tag{2}
$$
Since the length of the reasoning trace significantly exceeds that of the final answer ( $L\gg K$ ) (Xu et al., 2025), we focus on selecting critical thinking tokens in the reasoning trace to reduce memory and computational overhead.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Importance Score over Steps
### Overview
The image presents a line chart illustrating the "Importance Score" over "Step" number. The chart appears to represent the evolution of importance during a process divided into "Question" and "Thinking" phases. The Importance Score is plotted on the y-axis, ranging from "Low" to "High", while the Step number is plotted on the x-axis, ranging from 0 to approximately 12,500. A horizontal dashed line indicates the mean importance score.
### Components/Axes
* **X-axis:** "Step" - ranging from 0 to approximately 12,500.
* **Y-axis:** "Importance Score" - labeled with "Low" at the bottom and "High" at the top. The scale is not explicitly numerical, but represents a relative importance level.
* **Data Series:** A single blue line representing the Importance Score over time.
* **Annotations:**
* "Question" - Label above the initial portion of the chart (Steps 0-200 approximately).
* "Thinking" - Label above the remaining portion of the chart (Steps 200-12,500 approximately).
* "Mean Score: 0.126; Ratio: 0.211" - Text annotation positioned near the center of the chart.
* **Horizontal Line:** A dashed red line representing the mean Importance Score.
### Detailed Analysis
The blue line representing the Importance Score exhibits high variability.
* **Question Phase (Steps 0-200):** The line starts at a low Importance Score and rapidly increases to a high level within the first 200 steps. The line fluctuates significantly within this range.
* **Thinking Phase (Steps 200-12,500):** After the initial peak, the Importance Score generally decreases to a lower level, but continues to fluctuate considerably. There are several peaks and troughs throughout this phase. The line generally remains closer to the mean score than in the Question phase.
* **Mean Score:** The horizontal dashed red line indicates a mean Importance Score of approximately 0.126.
* **Ratio:** The ratio is given as 0.211, but its meaning is not explicitly defined in the image.
The data points are too dense to extract precise values, but the following observations can be made:
* Around Step 5000, there is a prominent peak in the Importance Score.
* Around Step 6500, there is another significant peak.
* The Importance Score generally remains below the mean score for the majority of the "Thinking" phase, but with frequent excursions above it.
### Key Observations
* The Importance Score is significantly higher during the "Question" phase compared to the "Thinking" phase.
* The "Thinking" phase is characterized by a more stable, but still fluctuating, Importance Score.
* The ratio of 0.211 may represent the proportion of time the Importance Score is above the mean.
### Interpretation
The chart suggests a process that begins with a focused "Question" phase where the Importance Score is high, indicating a strong signal or relevance. This is followed by a "Thinking" phase where the Importance Score is generally lower and more variable, suggesting a period of exploration and refinement. The fluctuations in the "Thinking" phase likely represent different ideas or considerations being evaluated. The mean Importance Score provides a baseline for assessing the overall relevance of the process. The ratio of 0.211 could indicate that 21.1% of the time during the process, the importance score is above the mean.
The sharp transition between the "Question" and "Thinking" phases suggests a distinct shift in the nature of the process. The high variability in the "Thinking" phase indicates a complex and dynamic process where the importance of different factors changes over time. The chart could be used to evaluate the effectiveness of the process by analyzing the distribution of Importance Scores and identifying areas where the signal is consistently strong or weak.
</details>
Figure 2: Importance scores of question tokens and thinking tokens in a reasoning trace, computed based on attention contributions to the answer. Darker colors indicate higher importance. The red dashed line shows the mean importance score, and the annotated ratio indicates the fraction of tokens with importance above the mean.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Ratio for Different Data Distributions
### Overview
This image presents a line chart illustrating the relationship between 'Ratio (%)' and 'Accuracy (%)' for four different data distributions: 'Full', 'Random', 'Bottom', and 'Top'. The chart aims to compare the performance (accuracy) of a model or system as the ratio of a certain data characteristic changes.
### Components/Axes
* **X-axis:** 'Ratio (%)' - Ranges from 2% to 50%, with markers at 2, 4, 6, 8, 10, 20, 30, 40, and 50.
* **Y-axis:** 'Accuracy (%)' - Ranges from approximately 60% to 96%, with markers at 60, 65, 70, 75, 80, 85, 90, 95.
* **Legend:** Located in the top-right corner, identifying the four data series:
* 'Full' - Represented by a grey dashed line with 'x' markers.
* 'Random' - Represented by a green solid line with triangle markers.
* 'Bottom' - Represented by a blue solid line with square markers.
* 'Top' - Represented by a red solid line with circle markers.
### Detailed Analysis
Here's a breakdown of each data series, noting trends and approximate values:
* **Full (Grey, 'x' markers):** The line is relatively flat, hovering around 95% accuracy across all ratios.
* Ratio 2%: Accuracy ≈ 95.5%
* Ratio 4%: Accuracy ≈ 95.5%
* Ratio 6%: Accuracy ≈ 95%
* Ratio 8%: Accuracy ≈ 95%
* Ratio 10%: Accuracy ≈ 95%
* Ratio 20%: Accuracy ≈ 95.5%
* Ratio 30%: Accuracy ≈ 95.5%
* Ratio 40%: Accuracy ≈ 95.5%
* Ratio 50%: Accuracy ≈ 95.5%
* **Random (Green, triangle markers):** The line shows an increasing trend, starting low and rising significantly towards the higher ratios.
* Ratio 2%: Accuracy ≈ 62%
* Ratio 4%: Accuracy ≈ 58%
* Ratio 6%: Accuracy ≈ 63%
* Ratio 8%: Accuracy ≈ 65%
* Ratio 10%: Accuracy ≈ 67%
* Ratio 20%: Accuracy ≈ 70%
* Ratio 30%: Accuracy ≈ 72%
* Ratio 40%: Accuracy ≈ 80%
* Ratio 50%: Accuracy ≈ 85%
* **Bottom (Blue, square markers):** The line exhibits a generally increasing trend, but with more fluctuations than the 'Top' line.
* Ratio 2%: Accuracy ≈ 66%
* Ratio 4%: Accuracy ≈ 65%
* Ratio 6%: Accuracy ≈ 62%
* Ratio 8%: Accuracy ≈ 63%
* Ratio 10%: Accuracy ≈ 65%
* Ratio 20%: Accuracy ≈ 69%
* Ratio 30%: Accuracy ≈ 71%
* Ratio 40%: Accuracy ≈ 74%
* Ratio 50%: Accuracy ≈ 80%
* **Top (Red, circle markers):** The line shows a strong increasing trend initially, then plateaus at a high accuracy level.
* Ratio 2%: Accuracy ≈ 88%
* Ratio 4%: Accuracy ≈ 90%
* Ratio 6%: Accuracy ≈ 92%
* Ratio 8%: Accuracy ≈ 92%
* Ratio 10%: Accuracy ≈ 93%
* Ratio 20%: Accuracy ≈ 94%
* Ratio 30%: Accuracy ≈ 95%
* Ratio 40%: Accuracy ≈ 95%
* Ratio 50%: Accuracy ≈ 95%
### Key Observations
* The 'Full' distribution consistently achieves the highest accuracy, remaining stable across all ratios.
* The 'Top' distribution shows the fastest initial increase in accuracy with increasing ratio, quickly reaching a plateau.
* The 'Random' distribution starts with the lowest accuracy but demonstrates the most significant improvement as the ratio increases.
* The 'Bottom' distribution shows a moderate increase in accuracy, but remains lower than 'Top' and 'Full' across all ratios.
### Interpretation
The data suggests that the distribution of the data significantly impacts the accuracy of the system being evaluated. A 'Full' distribution, presumably representing a balanced dataset, yields the best and most consistent performance. Focusing on the 'Top' portion of the data provides a quick initial boost in accuracy, but gains diminish as the ratio increases. The 'Random' distribution indicates that the system benefits from a more representative sample, as accuracy improves with a higher ratio. The 'Bottom' distribution consistently underperforms, suggesting that this portion of the data is less informative or more challenging for the system.
The plateau observed in the 'Full' and 'Top' distributions suggests a point of diminishing returns – increasing the ratio beyond a certain point does not lead to further significant improvements in accuracy. The steep climb of the 'Random' distribution highlights the importance of data representation for effective model training or system performance. The differences between the lines suggest that the system is sensitive to the characteristics of the data it is processing.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
## Radar Chart: Performance Comparison Across Datasets
### Overview
The image presents a radar chart comparing the performance of four different data sampling strategies – Full, Bottom, Random, and Top – across six datasets: AMC23, AIME25, AIME24, GPQA-D, GAOKAO2023EN, and MATH500. The chart uses a radial layout with values ranging from approximately 0 to 100, indicated by concentric circles.
### Components/Axes
* **Datasets (Axes):** AMC23, AIME25, AIME24, GPQA-D, GAOKAO2023EN, MATH500. These are evenly spaced around the circular chart.
* **Performance Scale (Radial Axis):** Concentric circles representing values from 0 to 100, in increments of 20.
* **Legend:** Located in the top-right corner, identifying the data series:
* Full (Grey, 'x' marker)
* Bottom (Blue, square marker)
* Random (Green, square marker)
* Top (Red, circle marker)
### Detailed Analysis
The chart displays the performance of each sampling strategy on each dataset as a polygon connecting the performance values.
* **Top (Red):** This line generally exhibits the highest performance, peaking at approximately 90 on the AIME25 axis. It dips to around 20 on the GAOKAO2023EN axis. The trend is highly variable, with significant peaks and troughs.
* AMC23: ~70
* AIME25: ~90
* AIME24: ~60
* GPQA-D: ~40
* GAOKAO2023EN: ~20
* MATH500: ~30
* **Random (Green):** This line shows moderate performance, generally lower than "Top" but higher than "Bottom". It has a relatively smooth profile.
* AMC23: ~40
* AIME25: ~60
* AIME24: ~50
* GPQA-D: ~40
* GAOKAO2023EN: ~40
* MATH500: ~40
* **Bottom (Blue):** This line consistently shows the lowest performance across all datasets, remaining generally below 40. It has a relatively flat profile.
* AMC23: ~20
* AIME25: ~30
* AIME24: ~30
* GPQA-D: ~30
* GAOKAO2023EN: ~30
* MATH500: ~30
* **Full (Grey):** This line shows intermediate performance, generally between "Random" and "Bottom". It has a somewhat irregular profile.
* AMC23: ~40
* AIME25: ~50
* AIME24: ~40
* GPQA-D: ~40
* GAOKAO2023EN: ~40
* MATH500: ~40
### Key Observations
* The "Top" sampling strategy consistently outperforms the others on most datasets, particularly AIME25 and AMC23.
* The "Bottom" sampling strategy consistently underperforms across all datasets.
* GAOKAO2023EN appears to be the most challenging dataset for all sampling strategies, resulting in the lowest performance scores.
* AIME25 appears to be the easiest dataset, with the highest performance scores.
* The performance differences between "Random" and "Full" are relatively small.
### Interpretation
The radar chart suggests that selecting the "Top" performing samples yields the best results across the evaluated datasets. This could indicate that the most challenging or informative samples are crucial for achieving high performance. Conversely, selecting the "Bottom" performing samples consistently leads to the worst results, suggesting these samples are less representative or contain less valuable information. The consistent low performance on GAOKAO2023EN suggests this dataset possesses unique characteristics that make it difficult for all sampling strategies to effectively capture its underlying patterns. The relatively similar performance of "Random" and "Full" suggests that, for these datasets, a random sample provides comparable results to using the entire dataset, potentially offering a computational efficiency benefit. The chart highlights the importance of sample selection in influencing model performance and suggests that a targeted approach (e.g., "Top" sampling) can significantly improve results.
</details>
Figure 3: (Left) Reasoning performance trends as a function of thinking token retention ratio, where the $x$ -axis indicates the retention percentage and the $y$ -axis is the accuracy. (Right) Accuracy across all datasets when retaining $30\$ of the thinking tokens.
#### Attention Mechanism.
Attention Mechanism is a core component of Transformer-based LRMs, such as Multi-Head Attention (Vaswani et al., 2017), Grouped-Query Attention (Ainslie et al., 2023), and their variants. To highlight the memory challenges in LRMs, we formulate the attention computation at the token level. Consider the decode step $t$ . Let $\mathbf{h}_{t}\in\mathbb{R}^{d}$ be the input hidden state of the current token. The model projects $\mathbf{h}_{t}$ into query, key, and value vectors:
$$
\mathbf{q}_{t}=\mathbf{W}_{Q}\mathbf{h}_{t},\quad\mathbf{k}_{t}=\mathbf{W}_{K}\mathbf{h}_{t},\quad\mathbf{v}_{t}=\mathbf{W}_{V}\mathbf{h}_{t}, \tag{3}
$$
where $\mathbf{W}_{Q},\mathbf{W}_{K},\mathbf{W}_{V}$ are learnable projection matrices. The query $\mathbf{q}_{t}$ attends to the keys of all preceding positions $j\in\{1,\dots,t\}$ . The attention weight $a_{t,j}$ between the current token $t$ and a past token $j$ is:
$$
\alpha_{t,j}=\frac{\exp(e_{t,j})}{\sum_{i=1}^{t}\exp(e_{t,i})},\qquad e_{t,j}=\frac{\mathbf{q}_{t}^{\top}\mathbf{k}_{j}}{\sqrt{d_{k}}}. \tag{4}
$$
These scores represent the relevance of the current step to the $j$ -th token. Finally, the output of the attention head $\mathbf{o}_{t}$ is the weighted sum of all historical value vectors:
$$
\mathbf{o}_{t}=\sum_{j=1}^{t}\alpha_{t,j}\mathbf{v}_{j}. \tag{5}
$$
As Equation 5 implies, calculating $\mathbf{o}_{t}$ requires access to the entire sequence of past keys and values $\{\mathbf{k}_{j},\mathbf{v}_{j}\}_{j=1}^{t-1}$ . In standard implementation, these vectors are stored in the KV cache to avoid redundant computation (Vaswani et al., 2017; Pope et al., 2023). In the LRMs’ inference, the reasoning trace is exceptionally long, imposing significant memory bottlenecks and increasing computational overhead.
## 3 Observations and Insight
This section presents the observed sparsity of thinking tokens and the Pareto Principle in LRMs, serving as the basis for DynTS. Detailed experimental settings and additional results are provided in Appendix § B.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Diagram: Reaching Budget for Critical Tokens in a Reasoning Model
### Overview
This diagram illustrates the training and inference processes of a reasoning model, focusing on how critical tokens are selected and managed within a reach budget. The left side depicts the training phase, while the right side details the inference process, broken down into multiple steps. The core concept revolves around an Importance Predictor (IP) and a Large Reasoning Model (LRM) working together to identify and retain relevant information.
### Components/Axes
The diagram is divided into two main sections: "Training" (left) and "Inference" (right).
**Training Section:**
* **Input Tokens:** Represented as a grid of small rectangles.
* **Thinking Tokens:** A grid of small rectangles above the Input Tokens.
* **Answer Tokens:** A grid of small rectangles below the Input Tokens.
* **Importance Predictor (IP):** A flame-shaped icon.
* **Large Reasoning Model (LRM):** A snowflake-shaped icon.
* **Mean Squared Error Loss:** Label above the Thinking Tokens.
* **Aggregate & Backward:** Arrows indicating the flow of information.
**Inference Section:**
* **Question:** Label at the top-left of the inference section.
* **Multi Step:** Label indicating multiple steps in the inference process.
* **Reach Budget:** Label above the token selection grids.
* **Select Critical Tokens:** Label above the token selection grids.
* **Keep/Evict/Retain:** Labels indicating actions taken on tokens.
* **KV Cache Budget:** Label on the right side of the token selection grids.
* **Current Token X:** Label at the bottom-left.
* **LRM with IP:** Label indicating the integration of the LRM and IP.
* **Input:** Label indicating the input to the LRM.
* **Output:** Label indicating the output of the LRM.
* **Y Next Token:** Label indicating the next token.
* **Predicted Score:** Label indicating the predicted score.
* **Steps:** Label at the bottom-right.
* **Tokens A-O:** Represented as rectangles within the grids. Each token has an associated numerical value.
### Detailed Analysis or Content Details
**Training Section:**
The training section shows a flow of information from Input Tokens to Thinking Tokens to Answer Tokens. The Mean Squared Error Loss is calculated and used to refine the Importance Predictor (IP) and Large Reasoning Model (LRM). The flow is indicated by "Aggregate" and "Backward" arrows.
**Inference Section:**
The inference section is broken down into multiple steps, each involving the selection of critical tokens.
* **Step 1:**
* Tokens A, B, C, D, E, F, G, H, I, J, K are present.
* Token A has a value of ∞ (infinity).
* Token B has a value of ∞ (infinity).
* Token C has a value of 0.2.
* Token D has a value of 0.1.
* Token E has a value of 0.5.
* Token F has a value of 0.1.
* Token G has a value of 0.4.
* Token H has a value of 0.2.
* Token I has a value of 0.1.
* Token J has a value of 0.3.
* Token K has a value of 0.3.
* "Evict" is indicated for tokens C and E.
* "Retain" is indicated for token I.
* **Step 2:**
* Tokens A, B, C, D, E, G, H, I, J, K, L, M, N, O are present.
* Token A has a value of ∞ (infinity).
* Token B has a value of ∞ (infinity).
* Token C has a value of 0.2.
* Token D has a value of 0.1.
* Token E has a value of 0.5.
* Token G has a value of 0.4.
* Token H has a value of 0.2.
* Token I has a value of 0.1.
* Token J has a value of 0.3.
* Token K has a value of 0.3.
* Token L has a value of 0.7.
* Token M has a value of 0.4.
* Token N has a value of 0.1.
* Token O has a value of 0.2.
* "Evict" is indicated for tokens D and H.
* "Keep" is indicated for token B.
* **Bottom Flow:**
* Current Token X is input into the LRM with IP, producing Output Y (Next Token) and a Predicted Score of 0.2.
* The next token is X, with a value of 0.2.
### Key Observations
* The values associated with the tokens appear to represent some form of importance or relevance score.
* The "Reach Budget" and "KV Cache Budget" suggest a constraint on the number of tokens that can be retained.
* The Importance Predictor (IP) plays a crucial role in guiding the selection of critical tokens.
* The process is iterative, with multiple steps involved in refining the selection of tokens.
* The infinity values (∞) likely represent tokens that are considered highly important and are always retained.
### Interpretation
This diagram illustrates a mechanism for efficient reasoning by selectively retaining critical information. The LRM, guided by the IP, prioritizes tokens based on their relevance to the question. The reach and cache budgets impose constraints, forcing the model to make choices about which tokens to keep and which to evict. The iterative process allows the model to refine its selection over multiple steps, ultimately leading to a more focused and efficient reasoning process. The numerical values associated with the tokens likely represent the IP's confidence in their importance. The "Keep," "Evict," and "Retain" actions demonstrate a dynamic memory management strategy. The diagram suggests a system designed to handle long-context reasoning tasks by intelligently managing the information available to the model. The flow from Input Tokens to Output Y represents the core reasoning process, with the IP acting as a filter to enhance performance. The Mean Squared Error Loss in the training phase indicates that the IP is being trained to accurately predict token importance.
</details>
Figure 4: Overview of DynTS. (Left) Importance Predictor Training. The upper heatmap visualizes attention weights, where orange intensity represents the importance of thinking tokens to the answer. The lower part shows a LRM integrated with an Importance Predictor (IP) to learn these importance scores. (Right) Inference with KV Cache Selection. The model outputs the next token and a predicted importance score of the current token. When the cache budget is reached, the selection strategy retains the KV cache of question tokens, local tokens, and top-k thinking tokens based on the predicted importance score.
### 3.1 Sparsity for Thinking Tokens
Previous works (Bogdan et al., 2025; Zhang et al., 2023; Singh et al., 2024) have shown that attention weights (Eq. 4) serve as a reliable proxy for token importance. Building on this insight, we calculate an importance score for each question and thinking token by accumulating the attention they receive from all answer tokens. Formally, the importance scores are defined as:
$$
I_{x_{j}}=\sum_{i=1}^{K}\alpha_{a_{i},x_{j}},\qquad I_{t_{j}}=\sum_{i=1}^{K}\alpha_{a_{i},t_{j}}, \tag{6}
$$
where $I_{x_{j}}$ and $I_{t_{j}}$ denote the importance scores of the $j$ -th question token $x_{j}$ and thinking token $t_{j}$ . Here, $\alpha_{a_{i},x_{j}}$ and $\alpha_{a_{i},t_{j}}$ represent the attention weights from the $i$ -th answer token $a_{i}$ to the corresponding question or thinking token, and $K$ is the total number of answer tokens. We perform full autoregressive inference on LRMs to extract attention weights and compute token-level importance scores for both question and thinking tokens.
Observation. As illustrated in Fig. 2, the question tokens (left panel) exhibit consistently significant and dense importance scores. In contrast, the thinking tokens (right panel) display a highly sparse distribution. Despite the extensive reasoning trace (exceeding 12k tokens), only $21.1\$ of thinking tokens exceed the mean importance score. This indicates that the vast majority of reasoning steps exert only a marginal influence on the final answer.
Analysis. Follow attention-based methods (Cai et al., 2025; Li et al., 2024; Cai et al., 2024), tokens with higher importance scores intuitively correspond to decision-critical reasoning steps, which are critical for the model to generate the final answer. The low-importance tokens serve as syntactic scaffolding or intermediate states that become redundant after reasoning progresses (We report the ratio of Content Words, see Appendix B.2). Consequently, we hypothesize that the model maintains close reasoning performance to that of the full token sequence, even when it selectively retains only these critical thinking tokens.
### 3.2 Pareto Principle in LRMs
To validate the aforementioned hypothesis, we retain all question tokens while preserving only the top- $p\$ of thinking tokens ranked by importance score, and prompt the model to directly generate the final answer.
Observation. As illustrated in Fig. 3 (Left), the importance-based top- $p\$ selection strategy substantially outperforms both random- and bottom-selection baselines. Notably, the model recovers nearly its full performance (grey dashed line) when retaining only $\sim 30\$ thinking tokens with top importance scores. Fig. 3 (Right) further confirms this trend across six diverse datasets, where the performance polygon under the top- $30\$ retention strategy almost completely overlaps with the full thinking tokens.
Insights. These empirical results illustrate and reveal the Pareto Principle in LRM reasoning: Only a small subset of thinking tokens ( $30\$ ) with high importance scores serve as “pivotal nodes,” which are critical for the model to output a final answer, while the remaining tokens contribute negligibly to the outcome. This finding provides strong empirical support for LRMs’ KV cache compression, indicating that it is possible to reduce memory footprint and computational overhead without sacrificing performance.
## 4 Dynamic Thinking-Token Selection
Building on the Pareto Principle in LRMs, critical thinking tokens can be identified via the importance score computed by Equation 6. However, this computation requires the attention weights from the answer to the thinking tokens, which are inaccessible until the model completes the entire decoding stage. To address this limitation, we introduce an Importance Predictor that dynamically estimates the importance score of each thinking token during inference time. Furthermore, we design a decoding-time KV cache Selection Strategy that retains critical thinking tokens and evicts redundant ones. We refer to this approach as DynTS (Dyn amic T hinking Token S election), and the overview is illustrated in Fig. 4.
### 4.1 Importance Predictor
#### Integrate Importance Predictor in LRMs.
Transformer-based Large Language Models (LLMs) typically consist of stacked Transformer blocks followed by a language modeling head (Vaswani et al., 2017), where the output of the final block serves as a feature representation of the current token. Building on this architecture, we attach an additional lightweight MLP head to the final hidden state, named as Importance Predictor (Huang et al., 2024). It is used to predict the importance score of the current thinking token during model inference, capturing its contribution to the final answer. Formally, we define the modified LRM as a mapping function $\mathcal{M}$ that processes the input sequence $\mathbf{x}_{\leq t}$ to produce a dual-output tuple comprising the next token $x_{t+1}$ and the current importance score $s_{x_{t}}$ :
$$
\mathcal{M}(\mathbf{x}_{\leq t})\rightarrow(x_{t+1},s_{x_{t}}) \tag{7}
$$
#### Predictor Training.
To obtain supervision signals for training, we prompt the LRMs based on the training dataset to generate complete sequences denoted as $\{x_{1\dots M},t_{1\dots L},a_{1\dots K}\}$ , filtering out incorrect or incomplete reasoning. Here, $x$ , $t$ , and $a$ represent the question, thinking, and answer tokens, respectively. Based on the observation in Section § 3, the thinking tokens significantly outnumber answer tokens ( $L\gg K$ ), and question tokens remain essential. Therefore, DynTS only focuses on predicting the importance of thinking tokens. By utilizing the attention weights from answer to thinking tokens, we derive the ground-truth importance score $I_{t_{i}}$ for each thinking token according to Equation 6. Finally, the Importance Predictor parameters can be optimized by minimizing the Mean Squared Error (MSE) loss (Wang and Bovik, 2009) as follows:
$$
\mathcal{L}_{\text{MSE}}=\frac{1}{K}\sum_{i=1}^{K}(I_{t_{i}}-s_{t_{i}})^{2}. \tag{8}
$$
To preserve the LRMs’ original performance, we freeze the backbone parameters and optimize the Importance Predictor exclusively. The trained model can predict the importance of thinking tokens to the answer. This paper focuses on mathematical reasoning tasks. We optimize the Importance Predictor only on the MATH training set and validated across six other datasets (See Section § 6.1).
### 4.2 KV Cache Selection
During LRMs’ inference, we establish a maximum KV cache budget $B$ , which is composed of a question window $W_{q}$ , a selection window $W_{s}$ , and a local window $W_{l}$ , formulated as $B=W_{q}+W_{s}+W_{l}$ . Specifically, the question window stores the KV caches of question tokens generated during the prefilling phase, i.e., the window size $W_{q}$ is equal to the number of question tokens $M$ ( $W_{q}=M$ ). Since these tokens are critical for the final answer (see Section § 3), we assign an importance score of $+\infty$ to these tokens, ensuring their KV caches are immune to eviction throughout the inference process.
In the subsequent decoding phase, we maintain a sequential stream of tokens. Newly generated KV caches and their corresponding importance scores are sequentially appended to the selection window ( $W_{s}$ ) and the local window ( $W_{l}$ ). Once the total token count reaches the budget limit $B$ , the critical token selection process is triggered, as illustrated in Fig. 4 (Right). Within the selection window, we retain the KV caches of the top- $k$ tokens with the highest scores and evict the remainder. Simultaneously, drawing inspiration from (Chen et al., 2024; Zhang et al., 2023; Zhao et al., 2024), we maintain the KV caches within the local window to ensure the overall coherence of the subsequently generated sequence. This inference process continues until decoding terminates.
## 5 Theoretical Overhead Analysis
In DynTS, the KV cache selection strategy reduces computational overhead by constraining cache length, while the importance predictor introduces a slight overhead. In this section, we theoretically analyze the trade-off between these two components and derive the Break-even Condition required to achieve net computational gains.
Notions. Let $\mathcal{M}_{\text{base}}$ be the vanilla LRM with $L$ layers and hidden dimension $d$ , and $\mathcal{M}_{\text{opt}}$ be the LRM with Importance Predictor (MLP: $d\to 2d\to d/2\to 1$ ). We define the prefill length as $M$ and the current decoding step as $i\in\mathbb{Z}^{+}$ . For vanilla decoding, the effective KV cache length grows linearly as $S_{i}^{\text{base}}=M+i$ . While DynTS evicts $K$ tokens by the KV Cache Selection when the effective KV cache length reaches the budget $B$ . Resulting in the effective length $S_{i}^{\text{opt}}=M+i-n_{i}\cdot K$ , where $n_{i}=\max\left(0,\left\lfloor\frac{(M+i)-B}{K}\right\rfloor+1\right)$ denotes the count of cache eviction event at step $i$ . By leveraging Floating-Point Operations (FLOPs) to quantify computational overhead, we establish the following theorem. The detailed proof is provided in Appendix A.
**Theorem 5.1 (Computational Gain)**
*Let $\Delta\mathcal{C}(i)$ be the reduction FLOPs achieved by DynTS at decoding step $i$ . The gain function is derived as the difference between the eviction event savings from KV Cache Selection and the introduced overhead of the predictor:
$$
\Delta\mathcal{C}(i)=\underbrace{n_{i}\cdot 4LdK}_{\text{Eviction Saving}}-\underbrace{(6d^{2}+d)}_{\text{Predictor Overhead}}, \tag{9}
$$*
Based on the formulation above, we derive a critical corollary regarding the net computational gain.
**Corollary 5.2 (Break-even Condition)**
*To achieve a net computational gain ( $\Delta\mathcal{C}(i)>0$ ) at the $n_{i}$ -th eviction event, the eviction volume $K$ must satisfy the following inequality:
$$
K>\frac{6d^{2}+d}{n_{i}\cdot 4Ld}\approx\frac{1.5d}{n_{i}L} \tag{10}
$$*
This inequality provides a theoretical lower bound for the eviction volume $K$ . demonstrating that the break-even point is determined by the model’s architectural (hidden dimension $d$ and layer count $L$ ).
Table 1: Performance comparison of different methods on R1-Llama and R1-Qwen. We report the average Pass@1 and Throughput (TPS) across six benchmarks. “Transformers” denotes the full cache baseline, and “Window” represents the local window baseline.
| Transformers Window StreamingLLM | 47.3 18.6 20.6 | 215.1 447.9 445.8 | 28.6 14.6 16.6 | 213.9 441.3 445.7 | 86.5 59.5 65.0 | 200.6 409.4 410.9 | 46.4 37.6 37.8 | 207.9 408.8 407.4 | 73.1 47.0 53.4 | 390.9 622.6 624.6 | 87.5 58.1 66.1 | 323.4 590.5 592.1 | 61.6 39.2 43.3 | 258.6 486.7 487.7 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| SepLLM | 30.0 | 448.2 | 20.0 | 445.1 | 71.0 | 414.1 | 39.7 | 406.6 | 61.4 | 635.0 | 74.5 | 600.4 | 49.4 | 491.6 |
| H2O | 38.6 | 426.2 | 22.6 | 423.4 | 82.5 | 396.1 | 41.6 | 381.5 | 67.5 | 601.8 | 82.7 | 573.4 | 55.9 | 467.1 |
| SnapKV | 39.3 | 438.2 | 24.6 | 436.3 | 80.5 | 406.9 | 41.9 | 394.1 | 68.7 | 615.7 | 83.1 | 584.5 | 56.3 | 479.3 |
| R-KV | 44.0 | 437.4 | 26.0 | 434.7 | 86.5 | 409.5 | 44.5 | 394.9 | 71.4 | 622.6 | 85.2 | 589.2 | 59.6 | 481.4 |
| DynTS (Ours) | 49.3 | 444.6 | 29.3 | 443.5 | 87.0 | 412.9 | 46.3 | 397.6 | 72.3 | 631.8 | 87.2 | 608.2 | 61.9 | 489.8 |
| R1-Qwen | | | | | | | | | | | | | | |
| Transformers | 52.0 | 357.2 | 35.3 | 354.3 | 87.5 | 376.2 | 49.0 | 349.4 | 77.9 | 593.7 | 91.3 | 517.3 | 65.5 | 424.7 |
| Window | 41.3 | 650.4 | 31.3 | 643.0 | 82.0 | 652.3 | 45.9 | 634.1 | 71.8 | 815.2 | 85.0 | 767.0 | 59.5 | 693.7 |
| StreamingLLM | 42.0 | 655.7 | 29.3 | 648.5 | 85.0 | 657.2 | 45.9 | 631.1 | 71.2 | 824.0 | 85.8 | 786.1 | 59.8 | 700.5 |
| SepLLM | 38.6 | 650.0 | 31.3 | 647.6 | 85.5 | 653.2 | 45.6 | 639.5 | 72.0 | 820.1 | 84.4 | 792.2 | 59.6 | 700.4 |
| H2O | 42.6 | 610.9 | 33.3 | 610.7 | 84.5 | 609.9 | 48.1 | 593.6 | 74.1 | 780.1 | 87.0 | 725.4 | 61.6 | 655.1 |
| SnapKV | 48.6 | 639.6 | 33.3 | 633.1 | 87.5 | 633.2 | 46.5 | 622.0 | 74.9 | 787.4 | 88.2 | 768.7 | 63.2 | 680.7 |
| R-KV | 44.0 | 639.5 | 32.6 | 634.7 | 85.0 | 636.8 | 47.2 | 615.1 | 75.8 | 792.8 | 88.8 | 765.5 | 62.2 | 680.7 |
| DynTS (Ours) | 52.0 | 645.6 | 36.6 | 643.0 | 88.5 | 646.0 | 48.1 | 625.7 | 76.4 | 788.5 | 90.0 | 779.5 | 65.3 | 688.1 |
## 6 Experiment
This section introduces experimental settings, followed by the results, ablation studies on retanind tokens and hyperparameters, and the Importance Predictor analysis. For more detailed configurations and additional results, please refer to Appendix C and D.
### 6.1 Experimental Setup
Models and Datasets. We conduct experiments on two mainstream LRMs: R1-Qwen (DeepSeek-R1-Distill-Qwen-7B) and R1-Llama (DeepSeek-R1-Distill-Llama-8B) (Guo et al., 2025). To evaluate the performance and robustness of our method across diverse tasks, we select five mathematical reasoning datasets of varying difficulty levels—AIME24 (Zhang and Math-AI, 2024), AIME25 (Zhang and Math-AI, 2025), AMC23 https://huggingface.co/datasets/math-ai/amc23, GK23EN (GAOKAO2023EN) https://huggingface.co/datasets/MARIO-Math-Reasoning/Gaokao2023-Math-En, and MATH500 (Hendrycks et al., 2021) —along with the GPQA-D (GPQA-Diamond) (Rein et al., 2024) scientific question-answering dataset as evaluation benchmarks.
Implementation Details. (1) Training Settings: To train the importance predictor, we sample the model-generated contents with correct answers from the MATH training set and calculate the importance scores of thinking tokens. We freeze the model backbone and optimize only the predictor ( $3$ -layer MLP), setting the number of training epochs to 15, the learning rate to $5\text{e-}4$ , and the maximum sequence length to 18,000. (2) Inference Settings. Following (Guo et al., 2025), setting the maximum decoding steps to 16,384, the sampling temperature to 0.6, top- $p$ to 0.95, and top- $k$ to 20. We apply budget settings based on task difficulty. For challenging benchmarks (AIME24, AIME25, AMC23, and GPQA-D), we set the budget $B$ to 5,000 with a local window size of 2,000; For simpler tasks, the budget is set to 3,000 with a local window of 1,500 for R1-Qwen and 1,000 for R1-Llama. The token retention ratio in the selection window is set to 0.4 for R1-Qwen and 0.3 for R1-Llama. We generate 5 responses for each problem and report the average Pass@1 as the evaluation metric.
Baselines. Our approach focuses on compressing the KV cache by selecting critical tokens. Therefore, we compare our method against the state-of-the-art KV cache compressing approaches. These include StreamingLLM (Xiao et al., 2024), H2O (Zhang et al., 2023), SepLLM (Chen et al., 2024), and SnapKV (Li et al., 2024) (decode-time variant (Liu et al., 2025a)) for LLMs, along with R-KV (Cai et al., 2025) for LRMs. To ensure a fair comparison, all methods were set with the same token overhead and maximum budget. We also report results for standard Transformers and local window methods as evaluation baselines.
### 6.2 Main Results
Reasoning Accuracy. As shown in Table 1, our proposed DynTS consistently outperforms all other KV cache eviction baselines. On R1-Llama and R1-Qwen, DynTS achieves an average accuracy of $61.9\$ and $65.3\$ , respectively, significantly surpassing the runner-up methods R-KV ( $59.6\$ ) and SnapKV ( $63.2\$ ). Notably, the overall reasoning capability of DynTS is on par with the Full Cache Transformers baseline ( $61.9\$ vs. $61.6\$ on R1-Llama, $65.3\$ vs. $65.5\$ on R1-Llama). Even outperforms the Transformers on several challenging tasks, such as AIME24 on R1-Llama, where it improves accuracy by $2.0\$ ; and AIME25 on R1-Qwen, where it improves accuracy by $1.3\$ .
Table 2: Ablation study on different token retention strategies in DynTS, where w.o. Q / T / L denotes the removal of Question tokens (Q), critical Thinking tokens (T), and Local window tokens (L), respectively. T-Random and T-Bottom represent strategies that select thinking tokens randomly and the tokens with the bottom-k importance scores, respectively.
| DynTS w.o. L w.o. Q | 49.3 40.6 19.3 | 29.3 23.3 14.6 | 87.0 86.5 59.0 | 46.3 46.3 38.1 | 72.3 72.0 47.8 | 87.2 85.5 59.8 | 61.9 59.0 39.8 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| w.o. T | 44.0 | 27.3 | 85.0 | 44.0 | 71.5 | 85.9 | 59.6 |
| T-Random | 24.6 | 16.0 | 59.5 | 37.4 | 51.7 | 63.9 | 42.2 |
| T-Bottom | 20.6 | 15.3 | 59.0 | 37.3 | 47.3 | 59.5 | 39.8 |
| R1-Qwen | | | | | | | |
| DynTS | 52.0 | 36.6 | 88.5 | 48.1 | 76.4 | 90.0 | 65.3 |
| w.o. L | 42.0 | 32.0 | 87.5 | 46.3 | 75.2 | 87.0 | 61.6 |
| w.o. Q | 46.0 | 36.0 | 86.0 | 43.9 | 75.1 | 89.0 | 62.6 |
| w.o. T | 47.3 | 34.6 | 85.5 | 49.1 | 75.1 | 89.2 | 63.5 |
| T-Random | 46.0 | 32.6 | 84.5 | 47.5 | 73.8 | 86.9 | 61.9 |
| T-Bottom | 38.0 | 30.0 | 80.0 | 44.3 | 69.8 | 83.3 | 57.6 |
Inference Efficiency. Referring to Table 1, DynTS achieves $1.9\times$ and $1.6\times$ speedup compared to standard Transformers on R1-Llama and R1-Qwen, respectively, across all benchmarks. While DynTS maintains throughput comparable to other KV Cache compression methods. Further observing Figure 5, as the generated sequence length grows, standard Transformers suffer from linear accumulation in both memory footprint and compute overhead (GFLOPs), leading to continuous throughput degradation. In contrast, DynTS effectively bounds resource consumption. The distinctive sawtooth pattern illustrates our periodic compression mechanism, where the inflection points correspond to the execution of KV Cache Selection to evict the KV pairs of non-essential thinking tokens. Consequently, the efficiency advantage escalates as the decoding step increases, DynTS achieves a peak speedup of 4.51 $\times$ , compresses the memory footprint to 0.19 $\times$ , and reduces the compute overhead to 0.52 $\times$ compared to the full-cache baseline. The zoom-in view reveals that the computational cost drops below the baseline immediately after the first KV cache eviction. This illustrates that our experimental settings are rational, as they satisfy the break-even condition ( $K=900\geq\frac{1.5d}{n_{i}L}=192$ ) outlined in Corollary 5.2.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Performance Comparison of Transformers and DynTS
### Overview
This image presents a comparative performance analysis of two models, "Transformers" and "DynTS", across three metrics: Throughput (TPS), KV Memory (GB), and GFLOPs. The performance is evaluated as a function of "Decoding Steps", ranging from 0 to 15k. Each metric is displayed in a separate subplot. The charts include multipliers indicating the relative change in performance between the two models at specific decoding step intervals.
### Components/Axes
* **X-axis (all subplots):** Decoding Steps (0k to 15k, with markers at 2k, 5k, 7k, 10k, 12k, and 15k).
* **Top Subplot (Throughput):**
* Y-axis: Throughput (TPS) - Scale from 0 to 1250.
* Legend (top-right):
* Black Line: Transformers
* Red Line: DynTS
* **Middle Subplot (KV Memory):**
* Y-axis: KV Memory (GB) - Scale from 0 to 40.
* Legend (top-right): Same as above.
* **Bottom Subplot (GFLOPs):**
* Y-axis: GFLOPs - Scale from 15 to 35. A zoomed-in inset chart is present, showing a smaller range (approximately 20 to 20.5).
* Legend (top-right): Same as above.
* **Multipliers:** Red boxes with text indicating the performance ratio (e.g., "1.55x", "0.58x").
### Detailed Analysis or Content Details
**Top Subplot (Throughput):**
* The black line (Transformers) starts at approximately 1100 TPS and rapidly declines to around 250 TPS at 15k decoding steps. The decline appears exponential.
* The red line (DynTS) starts at approximately 200 TPS and decreases more slowly, reaching around 230 TPS at 15k decoding steps. The decline is more linear.
* Multipliers:
* 2k: 1.55x
* 5k: 2.18x
* 7k: 2.69x
* 10k: 3.33x
* 12k: 3.84x
* 15k: 4.51x
**Middle Subplot (KV Memory):**
* The black line (Transformers) starts at approximately 2 GB and increases linearly to around 36 GB at 15k decoding steps.
* The red line (DynTS) starts at approximately 0 GB and increases linearly to around 6 GB at 15k decoding steps.
* Multipliers:
* 2k: 0.58x
* 5k: 0.41x
* 7k: 0.32x
* 10k: 0.26x
* 12k: 0.22x
* 15k: 0.19x
**Bottom Subplot (GFLOPs):**
* The black line (Transformers) starts at approximately 16 GFLOPs and increases to around 31 GFLOPs at 15k decoding steps. The increase is roughly linear.
* The red line (DynTS) starts at approximately 15 GFLOPs and increases to around 23 GFLOPs at 15k decoding steps. The increase is roughly linear.
* Inset Chart: Shows a zoomed-in view of the Transformers line between approximately 4500 and 4900 decoding steps, where the GFLOPs value ranges from 20.0 to 20.5.
* Multipliers:
* 2k: 0.87x
* 5k: 0.77x
* 7k: 0.69x
* 10k: 0.62x
* 12k: 0.57x
* 15k: 0.52x
### Key Observations
* DynTS consistently outperforms Transformers in terms of throughput as decoding steps increase. The multiplier values demonstrate a growing advantage for DynTS.
* DynTS uses significantly less KV memory than Transformers. The memory usage ratio decreases as decoding steps increase.
* Transformers requires more GFLOPs than DynTS, but the difference decreases as decoding steps increase.
* The inset chart in the GFLOPs subplot highlights a small range of values for Transformers, suggesting a relatively stable performance within that decoding step interval.
### Interpretation
The data suggests that DynTS is a more efficient model than Transformers, particularly for longer decoding sequences. While Transformers initially exhibits higher throughput, its performance degrades rapidly with increasing decoding steps, accompanied by a substantial increase in memory usage and computational cost (GFLOPs). DynTS, on the other hand, maintains a more stable throughput and requires significantly less memory, making it more scalable and practical for tasks involving long sequences. The increasing multipliers across all three metrics confirm this trend. The consistent decrease in the memory usage ratio (DynTS/Transformers) indicates that the memory efficiency advantage of DynTS becomes more pronounced as the decoding sequence length grows. The GFLOPs data suggests that DynTS achieves comparable performance with lower computational resources. This could be due to architectural differences or optimization techniques employed in DynTS. The inset chart in the GFLOPs subplot doesn't reveal any significant anomalies but provides a closer look at the Transformers' performance in a specific range.
</details>
Figure 5: Real-time throughput, memory, and compute overhead tracking over total decoding step. The inflection points in the sawtooth correspond to the steps where DynTS executes KV Cache Selection.
### 6.3 Ablation Study
Impact of Retained Token. As shown in Tab. 2, the full DynTS method outperforms all ablated variants, achieving the highest average accuracy on both R1-Llama ( $61.9\$ ) and R1-Qwen ( $65.3\$ ). This demonstrates that all retained tokens of DynTS are critical for the model to output the correct final answer. Moreover, we observe that the strategy for selecting thinking tokens plays a critical role in the model’s reasoning performance. When some redundant tokens are retained (T-Random and T-Bottom strategies), there is a significant performance drop compared to completely removing thinking tokens ( $59.6\$ on R1-Llama and $63.5\$ on R1-Qwen). This finding demonstrates the effectiveness of our Importance Predictor to identify critical tokens. It also explains why existing KV cache compression methods hurt model performance: inadvertently retaining redundant tokens. Finally, the local window is crucial for preserving local linguistic coherence, which contributes to stable model performance.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Heatmap: Pass@1 Performance Comparison - R1-Llama vs. R1-Qwen
### Overview
This image presents a heatmap comparing the Pass@1 performance of two models, R1-Llama and R1-Qwen, across varying combinations of 'Ratio' and 'Local Window Size'. The heatmap uses a color gradient to represent the Pass@1 scores, with warmer colors indicating higher performance.
### Components/Axes
* **X-axis:** Ratio, ranging from 0.1 to 0.5, with increments of 0.1.
* **Y-axis:** Local Window Size, with categories 500, 1000, 2000, and 3000.
* **Two Heatmaps:** One for R1-Llama (left) and one for R1-Qwen (right).
* **Colorbar:** Located on the right side, representing Pass@1 scores ranging from approximately 50 to 56.
* **Titles:** "R1-Llama" above the left heatmap and "R1-Qwen" above the right heatmap.
### Detailed Analysis or Content Details
**R1-Llama Heatmap:**
* **Trend:** Generally, performance increases with increasing Local Window Size and Ratio, but the effect is not uniform.
* **Data Points:**
* Ratio 0.1:
* Local Window Size 500: 49.8
* Local Window Size 1000: 49.9
* Local Window Size 2000: 49.5
* Local Window Size 3000: 49.1
* Ratio 0.2:
* Local Window Size 500: 52.1
* Local Window Size 1000: 51.7
* Local Window Size 2000: 50.9
* Local Window Size 3000: 50.1
* Ratio 0.3:
* Local Window Size 500: 50.7
* Local Window Size 1000: 52.7
* Local Window Size 2000: 52.8
* Local Window Size 3000: 50.6
* Ratio 0.4:
* Local Window Size 500: 50.8
* Local Window Size 1000: 51.9
* Local Window Size 2000: 52.5
* Local Window Size 3000: 50.7
* Ratio 0.5:
* Local Window Size 500: 51.7
* Local Window Size 1000: 51.7
* Local Window Size 2000: 51.4
* Local Window Size 3000: 51.4
**R1-Qwen Heatmap:**
* **Trend:** Similar to R1-Llama, performance generally increases with increasing Local Window Size and Ratio, but with some variations.
* **Data Points:**
* Ratio 0.1:
* Local Window Size 500: 51.5
* Local Window Size 1000: 52.2
* Local Window Size 2000: 52.4
* Local Window Size 3000: 53.9
* Ratio 0.2:
* Local Window Size 500: 51.8
* Local Window Size 1000: 54.4
* Local Window Size 2000: 51.9
* Local Window Size 3000: 53.9
* Ratio 0.3:
* Local Window Size 500: 52.0
* Local Window Size 1000: 53.8
* Local Window Size 2000: 54.6
* Local Window Size 3000: 53.2
* Ratio 0.4:
* Local Window Size 500: 54.3
* Local Window Size 1000: 53.3
* Local Window Size 2000: 56.3
* Local Window Size 3000: 54.4
* Ratio 0.5:
* Local Window Size 500: 54.6
* Local Window Size 1000: 53.0
* Local Window Size 2000: 53.7
* Local Window Size 3000: 53.8
### Key Observations
* R1-Qwen consistently outperforms R1-Llama across all combinations of Ratio and Local Window Size.
* For both models, increasing the Local Window Size from 500 to 2000 generally leads to performance improvements, but increasing it further to 3000 doesn't always yield the same benefit.
* The highest Pass@1 score for R1-Llama is 52.8, while the highest for R1-Qwen is 56.3.
* The performance difference between the models is most pronounced at higher Ratio values (0.4 and 0.5).
### Interpretation
The heatmap demonstrates the impact of 'Ratio' and 'Local Window Size' on the Pass@1 performance of two language models, R1-Llama and R1-Qwen. The 'Ratio' likely represents a parameter controlling the amount of context considered during evaluation, while 'Local Window Size' might relate to the size of the input sequence processed at a time.
The consistent outperformance of R1-Qwen suggests that it is more robust to variations in these parameters or benefits more from larger context windows. The non-linear relationship between Local Window Size and performance indicates that there's an optimal window size beyond which the benefits diminish, potentially due to computational constraints or the model's ability to effectively utilize the additional context.
The heatmap provides valuable insights for optimizing the configuration of these models for specific tasks. It suggests that for R1-Qwen, a Ratio of 0.4 or 0.5 and a Local Window Size of 2000 might be a good starting point for achieving high Pass@1 scores. Further investigation could explore the reasons behind the diminishing returns of larger Local Window Sizes and the specific mechanisms that contribute to R1-Qwen's superior performance.
</details>
Figure 6: The accuracy of R1-Llama and R1-Qwen across different local window sizes and selection window retention ratios.
Local Window & Retention Ratio. As shown in Fig. 6, we report the model’s reasoning performance across different configurations. The performance improves with a larger local window and a higher retention ratio within a reasonable range. These two settings respectively ensure local contextual coherence and an adequate number of thinking tokens. Setting either to overly small values leads to pronounced performance degradation. However, excessively large values introduce a higher proportion of non-essential tokens, which in turn negatively impacts model performance. Empirically, a local window size of approximately 2,000 and a retention ratio of 0.3–0.4 yield optimal performance. We further observe that R1-Qwen is particularly sensitive to the local window size. This may be caused by the Dual Chunk Attention introduced during the long-context pre-training stage (Yang et al., 2025), which biases attention toward tokens within the local window.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Training Metrics
### Overview
The image presents a line chart displaying training metrics over 400 steps. The chart consists of two subplots: the top subplot shows the MSE Loss and Kendall's Tau correlation, while the bottom subplot displays the Overlap Rate for different percentile thresholds (Top-20%, Top-30%, etc.).
### Components/Axes
* **X-axis (both subplots):** Step (ranging from 0 to 400)
* **Y-axis (top subplot):** Value (ranging from 0 to 3)
* **Y-axis (bottom subplot):** Overlap Rate (%) (ranging from 0 to 100)
* **Legend (top-right):**
* Blue Line: MSE Loss
* Orange Line: Kendall
* **Legend (bottom-right):**
* Dark Gray Line: Top-20%
* Purple Line: Top-30%
* Teal Line: Top-40%
* Green Line: Top-50%
* Light Blue Line: Top-60%
* Dark Teal Line: Top-70%
* Light Green Line: Top-80%
* Magenta Line: Top-90%
### Detailed Analysis
**Top Subplot (MSE Loss & Kendall's Tau):**
* **MSE Loss (Blue Line):** The MSE Loss starts at approximately 2.8 and rapidly decreases to around 0.1 within the first 50 steps. After that, it fluctuates around 0.05-0.1 with minor oscillations until step 400.
* **Kendall (Orange Line):** Kendall's Tau starts at approximately 0.9 and decreases to around 0.6 within the first 50 steps. It then plateaus around 0.65-0.75 with slight variations until step 400.
**Bottom Subplot (Overlap Rate):**
* **Top-20% (Dark Gray Line):** Starts at approximately 20% and increases to around 85% by step 50. It then fluctuates between 80% and 90% for the remainder of the steps.
* **Top-30% (Purple Line):** Starts at approximately 20% and increases to around 80% by step 50. It then fluctuates between 75% and 85% for the remainder of the steps.
* **Top-40% (Teal Line):** Starts at approximately 20% and increases to around 75% by step 50. It then fluctuates between 70% and 80% for the remainder of the steps.
* **Top-50% (Green Line):** Starts at approximately 20% and increases to around 70% by step 50. It then fluctuates between 65% and 75% for the remainder of the steps.
* **Top-60% (Light Blue Line):** Starts at approximately 20% and increases to around 65% by step 50. It then fluctuates between 60% and 70% for the remainder of the steps.
* **Top-70% (Dark Teal Line):** Starts at approximately 20% and increases to around 60% by step 50. It then fluctuates between 55% and 65% for the remainder of the steps.
* **Top-80% (Light Green Line):** Starts at approximately 20% and increases to around 55% by step 50. It then fluctuates between 50% and 60% for the remainder of the steps.
* **Top-90% (Magenta Line):** Starts at approximately 20% and increases to around 50% by step 50. It then fluctuates between 45% and 55% for the remainder of the steps.
### Key Observations
* Both MSE Loss and Kendall's Tau converge relatively quickly within the first 50 steps.
* The Overlap Rate increases rapidly for all percentile thresholds within the first 50 steps and then plateaus.
* Higher percentile thresholds (e.g., Top-20%) exhibit higher overlap rates compared to lower percentile thresholds (e.g., Top-90%).
* The overlap rate curves appear to be converging as the number of steps increases.
### Interpretation
The chart demonstrates the training progress of a model. The rapid decrease in MSE Loss and increase in Overlap Rate within the initial steps indicate that the model is learning quickly. The convergence of both metrics suggests that the training process is stabilizing. The different overlap rates for various percentile thresholds likely reflect the model's ability to accurately rank or predict outcomes for different levels of confidence. The higher overlap rates for the top percentiles suggest that the model is more confident and accurate in its predictions for the most likely outcomes. The plateauing of the metrics after 50 steps suggests that the model has reached a point of diminishing returns and further training may not significantly improve performance. The Kendall's Tau metric provides a measure of rank correlation, and its convergence indicates that the model is learning to correctly order the predicted outcomes.
</details>
Figure 7: The top panel illustrates the convergence of MSE Loss and the Kendall rank correlation coefficient over training steps. The bottom panel tracks the overlap rate of the top- $20\$ ground-truth tokens within the top- $p\$ ( $p\in[20,90]$ ) predicted tokens.
Budget. We report the model’s reasoning performance and throughput in different budget settings in Fig. 8. As expected, as the KV budget increases, the accuracy of R1-Llama and R1-Qwen improves, and the throughput decreases. At the maximum evaluated budget of 5000, DynTS delivers its strongest reasoning results ( $53.0\$ for R1-Llama and $56.3\$ for R1-Qwen), minimizing the performance gap with the full-cache baseline.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Bar and Line Chart: Performance vs. KV Budget for LLMs
### Overview
The image presents two comparative bar and line charts, side-by-side. Both charts illustrate the relationship between "KV Budget" (likely a computational resource allocation) and two performance metrics: "Pass@1" (a measure of accuracy) and "Throughput" (measured in Transactions Per Second - TPS). The left chart focuses on the "R1-Llama" model, while the right chart focuses on the "R1-Qwen" model. Both charts share the same x-axis (KV Budget) and y-axis scales, allowing for direct visual comparison.
### Components/Axes
* **X-axis:** "KV Budget" with values 2500, 3000, 3500, 4000, 4500, and 5000.
* **Left Y-axis:** "Pass@1" ranging from 30 to 80.
* **Right Y-axis:** "Throughput (TPS)" ranging from 400 to 800.
* **Legend (Top-Left of each chart):**
* Blue: "Pass@1" (represented by bars)
* Orange: "Throughput" (represented by a line)
* **Titles:**
* Left Chart: "R1-Llama"
* Right Chart: "R1-Qwen"
### Detailed Analysis or Content Details
**R1-Llama (Left Chart):**
* **Pass@1 (Blue Bars):**
* KV Budget 2500: Approximately 44.2
* KV Budget 3000: Approximately 50.4
* KV Budget 3500: Approximately 51.0
* KV Budget 4000: Approximately 50.8
* KV Budget 4500: Approximately 49.9
* KV Budget 5000: Approximately 53.0
* Trend: The Pass@1 metric initially increases from 2500 to 3500 KV Budget, then plateaus and slightly decreases before increasing again at 5000 KV Budget.
* **Throughput (Orange Line):**
* KV Budget 2500: Approximately 790 TPS
* KV Budget 3000: Approximately 710 TPS
* KV Budget 3500: Approximately 640 TPS
* KV Budget 4000: Approximately 570 TPS
* KV Budget 4500: Approximately 500 TPS
* KV Budget 5000: Approximately 430 TPS
* Trend: The Throughput metric consistently decreases as the KV Budget increases. The line slopes downward.
**R1-Qwen (Right Chart):**
* **Pass@1 (Blue Bars):**
* KV Budget 2500: Approximately 49.8
* KV Budget 3000: Approximately 52.6
* KV Budget 3500: Approximately 54.1
* KV Budget 4000: Approximately 54.3
* KV Budget 4500: Approximately 54.3
* KV Budget 5000: Approximately 56.3
* Trend: The Pass@1 metric generally increases with increasing KV Budget, with a plateau between 4000 and 4500.
* **Throughput (Orange Line):**
* KV Budget 2500: Approximately 770 TPS
* KV Budget 3000: Approximately 730 TPS
* KV Budget 3500: Approximately 680 TPS
* KV Budget 4000: Approximately 620 TPS
* KV Budget 4500: Approximately 570 TPS
* KV Budget 5000: Approximately 530 TPS
* Trend: The Throughput metric consistently decreases as the KV Budget increases, similar to the R1-Llama model. The line slopes downward.
### Key Observations
* **Trade-off:** Both models demonstrate a clear trade-off between Pass@1 and Throughput. Increasing the KV Budget generally improves accuracy (Pass@1) but reduces the number of transactions processed per second (Throughput).
* **Model Differences:** The R1-Qwen model exhibits a more consistent increase in Pass@1 with increasing KV Budget compared to the R1-Llama model, which shows an initial increase followed by a plateau and slight decrease.
* **Throughput Decline:** The decline in Throughput is more pronounced in the R1-Llama model than in the R1-Qwen model.
### Interpretation
The charts suggest that optimizing the KV Budget for these Large Language Models (LLMs) involves balancing accuracy and processing speed. A higher KV Budget allows for more complex computations, potentially leading to more accurate results (higher Pass@1), but at the cost of reduced throughput. The optimal KV Budget will depend on the specific application and its requirements.
The differences between the R1-Llama and R1-Qwen models indicate that they respond differently to changes in KV Budget. R1-Qwen appears to be more efficient in utilizing the increased computational resources to improve accuracy without a significant drop in throughput. This could be due to differences in model architecture, training data, or optimization techniques.
The consistent downward trend in Throughput for both models highlights a fundamental limitation: increasing model complexity (through higher KV Budget) often comes at the expense of processing speed. Further investigation could explore techniques to mitigate this trade-off, such as model quantization or pruning. The data suggests that the R1-Qwen model is more robust to this trade-off than the R1-Llama model.
</details>
Figure 8: Accuracy and throughput across varying KV budgets.
### 6.4 Analysis of Importance Predictor
To validate that the Importance Predictor effectively learns the ground-truth thinking token importance scores, we report the MSE Loss and the Kendall rank correlation coefficient (Abdi, 2007) in the top panel of Fig. 7. As the number of training steps increases, both metrics exhibit clear convergence. The MSE loss demonstrates that the predictor can fit the true importance scores. The Kendall coefficient measures the consistency of rankings between the ground-truth importance scores and the predicted values. This result indicates that the predictor successfully captures each thinking token’s importance to the answer. Furthermore, we analyze the overlap rate of predicted critical thinking tokens, as shown in the bottom panel of Fig. 7. Notably, at the end of training, the overlap rate of critical tokens within the top $30\$ of the predicted tokens exceeds $80\$ . This confirms that the Importance Predictor in DynTS effectively identifies the most pivotal tokens, ensuring the retention of essential thinking tokens even at high compression rates.
## 7 Related Work
Recent works on KV cache compression have primarily focused on classical LLMs, applying eviction strategies based on attention scores or heuristic rules. One line of work addresses long-context pruning at the prefill stage. Such as SnapKV (Li et al., 2024), PyramidKV (Cai et al., 2024), and AdaKV (Feng et al., 2024). However, they are ill-suited for the inference scenarios of LRMs, which have short prefill tokens followed by long decoding steps. Furthermore, several strategies have been proposed specifically for the decoding phase. For instance, H2O (Zhang et al., 2023) leverages accumulated attention scores, StreamingLLM (Xiao et al., 2024) retains attention sinks and recent tokens, and SepLLM (Chen et al., 2024) preserves only the separator tokens. More recently, targeting LRMs, (Cai et al., 2025) introduced RKV, which adds a similarity-based metric to evict redundant tokens, while RLKV (Du et al., 2025) utilizes reinforcement learning to retain critical reasoning heads. However, these methods fail to accurately assess the contribution of intermediate tokens to the final answer. Consequently, they risk erroneously evicting decision-critical tokens, compromising the model’s reasoning performance.
## 8 Conclusion and Discussion
In this work, we investigated the relationship between the reasoning traces and their final answers in LRMs. Our analysis revealed a Pareto Principle in LRMs: only the decision-critical thinking tokens ( $20\$ in the reasoning traces) steer the model toward the final answer. Building on this insight, we proposed DynTS, a novel KV cache compression method. Departing from current strategies that rely on local attention scores for eviction, DynTS introduces a learnable Importance Predictor to predict the contribution of the current token to the final answer. Based on the predicted score, DynTS retains pivotal KV cache. Empirical results on six datasets confirm that DynTS outperforms other SOTA baselines. We also discuss the limitations of DynTS and outline potential directions for future improvement. Please refer to Appendix E for details.
## Impact Statement
This paper presents work aimed at advancing the field of KV cache compression. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. The primary impact of this research is to improve the memory and computational efficiency of LRM’s inference. By reducing memory requirements, our method helps lower the barrier to deploying powerful models on resource-constrained edge devices. We believe our work does not introduce specific ethical or societal risks beyond the general considerations inherent to advancing generative AI.
## References
- H. Abdi (2007) The kendall rank correlation coefficient. Encyclopedia of measurement and statistics 2, pp. 508–510. Cited by: §6.4.
- J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai (2023) Gqa: training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245. Cited by: §2.
- P. C. Bogdan, U. Macar, N. Nanda, and A. Conmy (2025) Thought anchors: which llm reasoning steps matter?. arXiv preprint arXiv:2506.19143. Cited by: §1, §1, §3.1.
- Z. Cai, W. Xiao, H. Sun, C. Luo, Y. Zhang, K. Wan, Y. Li, Y. Zhou, L. Chang, J. Gu, et al. (2025) R-kv: redundancy-aware kv cache compression for reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: §3.1, §6.1, §7.
- Z. Cai, Y. Zhang, B. Gao, Y. Liu, Y. Li, T. Liu, K. Lu, W. Xiong, Y. Dong, J. Hu, et al. (2024) Pyramidkv: dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069. Cited by: §3.1, §7.
- G. Chen, H. Shi, J. Li, Y. Gao, X. Ren, Y. Chen, X. Jiang, Z. Li, W. Liu, and C. Huang (2024) Sepllm: accelerate large language models by compressing one segment into one separator. arXiv preprint arXiv:2412.12094. Cited by: §1, §1, §4.2, §6.1, §7.
- Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che (2025) Towards reasoning era: a survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567. Cited by: §1, §2.
- D. Choi, J. Lee, J. Tack, W. Song, S. Dingliwal, S. M. Jayanthi, B. Ganesh, J. Shin, A. Galstyan, and S. B. Bodapati (2025) Think clearly: improving reasoning via redundant token pruning. arXiv preprint arXiv:2507.08806 4. Cited by: §1.
- G. DeepMind (2025) A new era of intelligence with gemini 3. Note: https://blog.google/products/gemini/gemini-3/#gemini-3-deep-think Cited by: §1.
- A. Devoto, Y. Zhao, S. Scardapane, and P. Minervini (2024) A simple and effective $L\_2$ norm-based strategy for kv cache compression. arXiv preprint arXiv:2406.11430. Cited by: §1.
- W. Du, L. Jiang, K. Tao, X. Liu, and H. Wang (2025) Which heads matter for reasoning? rl-guided kv cache compression. arXiv preprint arXiv:2510.08525. Cited by: §7.
- S. Feng, G. Fang, X. Ma, and X. Wang (2025) Efficient reasoning models: a survey. arXiv preprint arXiv:2504.10903. Cited by: §1.
- Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou (2024) Ada-kv: optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550. Cited by: §7.
- D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §B.1, §1, §6.1, §6.1.
- D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Cited by: §1, §6.1.
- W. Huang, Z. Zhai, Y. Shen, S. Cao, F. Zhao, X. Xu, Z. Ye, Y. Hu, and S. Lin (2024) Dynamic-llava: efficient multimodal large language models via dynamic vision-language context sparsification. arXiv preprint arXiv:2412.00876. Cited by: §4.1.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pp. 611–626. Cited by: §B.1.
- Y. Li, Y. Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen (2024) Snapkv: llm knows what you are looking for before generation. Advances in Neural Information Processing Systems 37, pp. 22947–22970. Cited by: §1, §3.1, §6.1, §7.
- M. Liu, A. Palnitkar, T. Rabbani, H. Jae, K. R. Sang, D. Yao, S. Shabihi, F. Zhao, T. Li, C. Zhang, et al. (2025a) Hold onto that thought: assessing kv cache compression on reasoning. arXiv preprint arXiv:2512.12008. Cited by: §6.1.
- Y. Liu, J. Fu, S. Liu, Y. Zou, S. Zhang, and J. Zhou (2025b) KV cache compression for inference efficiency in llms: a review. In Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, pp. 207–212. Cited by: §1.
- G. Minegishi, H. Furuta, T. Kojima, Y. Iwasawa, and Y. Matsuo (2025) Topology of reasoning: understanding large reasoning models through reasoning graph properties. arXiv preprint arXiv:2506.05744. Cited by: §1.
- OpenAI (2025) OpenAI. External Links: Link Cited by: §1.
- R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean (2023) Efficiently scaling transformer inference. Proceedings of machine learning and systems 5, pp. 606–624. Cited by: §2.
- Z. Qin, Y. Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li (2025) Cake: cascading and adaptive kv cache eviction with layer preferences. arXiv preprint arXiv:2503.12491. Cited by: §1.
- D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: §6.1.
- L. Shi, H. Zhang, Y. Yao, Z. Li, and H. Zhao (2024) Keep the cost down: a review on methods to optimize llm’s kv-cache consumption. arXiv preprint arXiv:2407.18003. Cited by: §1.
- C. Singh, J. P. Inala, M. Galley, R. Caruana, and J. Gao (2024) Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761. Cited by: §3.1.
- Y. Sui, Y. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, N. Zou, et al. (2025) Stop overthinking: a survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419. Cited by: §2.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §2, §2, §4.1.
- Z. Wang and A. C. Bovik (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE Signal Processing Magazine 26 (1), pp. 98–117. External Links: Document Cited by: §4.1.
- G. WEI, X. Zhou, P. Sun, T. Zhang, and Y. Wen (2025) Rethinking key-value cache compression techniques for large language model serving. Proceedings of Machine Learning and Systems 7. Cited by: §1.
- S. Wiegreffe and Y. Pinter (2019) Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Hong Kong, China, pp. 11–20. External Links: Link, Document Cited by: §1.
- G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis (2024) EFFICIENT streaming language models with attention sinks. Cited by: §C.3, §1, §6.1, §7.
- F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng, et al. (2025) Towards large reasoning models: a survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Cited by: §2.
- A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §6.3.
- D. Zhang, Z. Li, M. Zhang, J. Zhang, Z. Liu, Y. Yao, H. Xu, J. Zheng, X. Chen, Y. Zhang, et al. (2025a) From system 1 to system 2: a survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
- Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoff, et al. (2025b) A survey on test-time scaling in large language models: what, how, where, and how well?. arXiv preprint arXiv:2503.24235. Cited by: §1.
- Y. Zhang and T. Math-AI (2024) American invitational mathematics examination (aime) 2024. Cited by: §6.1.
- Y. Zhang and T. Math-AI (2025) American invitational mathematics examination (aime) 2025. Cited by: §6.1.
- Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, et al. (2023) H2o: heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36, pp. 34661–34710. Cited by: §1, §1, §3.1, §4.2, §6.1, §7.
- J. Zhao, Z. Fang, S. Li, S. Yang, and S. He (2024) Buzz: beehive-structured sparse kv cache with segmented heavy hitters for efficient llm inference. arXiv preprint arXiv:2410.23079. Cited by: §4.2.
## Appendix A Proof of Cumulative Computational Gain
**Definition A.1 (Predictor Overhead)**
*Let $\mathcal{M}_{\text{base}}$ be the vanilla LRM with $L$ layers and a hidden dimension $d$ , and $\mathcal{M}_{\text{opt}}$ be the LRM with Importance Predictor. The predictor is defined as a three-layer linear MLP with dimensions $d\to m_{1}\to m_{2}\to m_{3}$ . The computational cost per decode step is:
$$
\mathcal{C}_{\text{mlp}}=2(d\cdot m_{1}+m_{1}\cdot m_{2}+m_{2}\cdot m_{3}). \tag{11}
$$
Setting $m_{1}=2d$ , $m_{2}=d/2$ , and $m_{3}=1$ yields:
$$
\mathcal{C}_{\text{mlp}}=6d^{2}+d \tag{12}
$$*
**Definition A.2 (Effective KV Cache Length)**
*Let $M$ denote the length of the prefill sequence. At decode step $i$ , when the effective cache length reaches the budget $B$ , DynTS performs KV Cache Selection to evict $K$ redundant tokens. The effective KV cache length $S_{i}$ for the base and optimized models is given by:
$$
S_{i}^{\text{base}}=M+i,\quad S_{i}^{\text{opt}}=M+i-n_{i}\cdot K, \tag{13}
$$
where $n_{i}=\max\left(0,\left\lfloor\frac{(M+i)-B}{K}\right\rfloor+1\right)$ denotes the count of cache eviction event at step $i$ .*
**Definition A.3 (LLM Overhead)**
*The computational overhead per step for a decoder-only transformer is composed of a static component $\mathcal{C}_{\text{static}}$ (independent of sequence length) and a dynamic attention component $\mathcal{C}_{\text{attn}}$ (linearly dependent on effective cache length). The static cost $\mathcal{C}_{\text{static}}$ for the backbone remains identical for both models. The self-attention cost for a single layer is $4\cdot d\cdot S_{i}$ (counting $Q\cdot K^{\top}$ and $\text{Softmax}\cdot V$ ). Across $L$ layers:
$$
\mathcal{C}_{\text{attn}}(S_{i})=4\cdot L\cdot d\cdot S_{i} \tag{14}
$$*
* Proof: Computational Gain*
Let $\Delta\mathcal{C}(i)$ be the reduction in FLOPs achieved by DynTS at decoding step $i$ , which is defined as $\text{FLOPs}(\mathcal{M}_{\text{base}}(i))-\text{FLOPs}(\mathcal{M}_{\text{opt}}(i))$ :
$$
\Delta\mathcal{C}(i)=\left[\mathcal{C}_{\text{static}}(i)+\mathcal{C}_{\text{attn}}(S_{i}^{\text{base}})\right]-\left[\mathcal{C}_{\text{static}}(i)+\mathcal{C}_{\text{mlp}}+\mathcal{C}_{\text{attn}}(S_{i}^{\text{opt}})\right]. \tag{15}
$$
Eliminating the static term $\mathcal{C}_{\text{static}}$
$$
\Delta\mathcal{C}(i)=\mathcal{C}_{\text{attn}}(S_{i}^{\text{base}})-\mathcal{C}_{\text{attn}}(S_{i}^{\text{opt}})-\mathcal{C}_{\text{mlp}}. \tag{16}
$$
Substituting $S_{i}^{\text{base}}$ , $S_{i}^{\text{opt}}$ and $\mathcal{C}_{\text{mlp}}$ :
$$
\displaystyle\Delta\mathcal{C}(i) \displaystyle=4\cdot L\cdot d\cdot(M+i)-4\cdot L\cdot d\cdot(M+i-n_{i}\cdot K)-\mathcal{C}_{\text{mlp}} \displaystyle=\underbrace{n_{i}\cdot 4LdK}_{\text{Eviction Saving}}-\underbrace{(6d^{2}+d)}_{\text{Predictor Overhead}}. \tag{17}
$$
This completes the proof. $\square$ ∎
## Appendix B Empirical Analysis and Observations
### B.1 Implementation Details
To calculate the importance of each thinking token to the final answer sequence, we first utilized vLLM (Kwon et al., 2023) to generate the complete reasoning trace. Following (Guo et al., 2025), we set the temperature to 0.6, top-p to 0.95, top-k to 20, and max length to 16,384. To ensure sequence completeness, we sampled 5 times for each question and filtered out samples with incomplete inference traces. Then, we fed the full sequence into the model for a single forward pass to extract the submatrices of attention weight, corresponding to the answer and thinking tokens. Finally, we aggregate the matrices across all layers and heads, and sum along the answer dimension (rows). The aggregated 1D vector is the importance score of each thinking token.
Based on the calculated importance scores, we employ three selection strategies to retain critical thinking tokens: top- $p\$ , bottom- $p\$ , and random sampling, where $p\in[2,4,6,8,10,20,30,40,50]$ . The retained tokens are concatenated with the original question to form the input sequence. The input sequence is processed by vLLM over 5 independent runs using the aforementioned configuration. We report the average Pass@1 across these results as the final accuracy.
### B.2 Ratio of Content Words
To investigate the distinctions of thinking tokens with varying importance scores, we employed spaCy to analyze the Part-of-Speech (POS) tags of each token. Specifically, we heuristically categorized nouns, verbs, adjectives, adverbs, and proper nouns as Content Words carrying substantive meaning, while treating other POS tags as Function Words with limited semantic information. The thinking tokens were sorted by importance score and then partitioned into ten equal parts. We report the ratio of Content Words and Function Words within each part in Fig 9. The tokens with higher importance scores exhibit a significantly higher proportion of content words, suggesting that they encode the core semantic meaning. Conversely, tokens with lower scores are predominantly function words, which primarily serve as syntactic scaffolding or intermediate states to maintain sequence coherence. Consequently, once the full sentence is generated, removing these low-importance tokens has a negligible impact on overall comprehension.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Chart: R1-Llama | AIME24 - Content vs. Function Word Ratio
### Overview
This is a horizontal bar chart comparing the ratio of content words to function words in text generated by R1-Llama, as measured by AIME24. The chart displays the ratio across different percentile ranges of word frequency. The x-axis represents the ratio in percentage, and the y-axis represents the percentile ranges of words.
### Components/Axes
* **Title:** R1-Llama | AIME24
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Word Frequency Percentile Range (Top 10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%)
* **Legend:**
* Content Words (Dark Red)
* Function Words (Light Gray)
### Detailed Analysis
The chart consists of horizontal bars, each representing a percentile range. Each bar is divided into two sections: a dark red section representing the ratio of content words and a light gray section representing the ratio of function words.
Here's a breakdown of the data, reading from top to bottom (highest percentile to lowest):
* **90-100%:** Content Words: 31.0%, Function Words: ~69.0% (estimated)
* **80-90%:** Content Words: 30.8%, Function Words: ~69.2% (estimated)
* **70-80%:** Content Words: 30.7%, Function Words: ~69.3% (estimated)
* **60-70%:** Content Words: 30.8%, Function Words: ~69.2% (estimated)
* **50-60%:** Content Words: 31.1%, Function Words: ~68.9% (estimated)
* **40-50%:** Content Words: 32.2%, Function Words: ~67.8% (estimated)
* **30-40%:** Content Words: 34.1%, Function Words: ~65.9% (estimated)
* **20-30%:** Content Words: 36.9%, Function Words: ~63.1% (estimated)
* **10-20%:** Content Words: 40.3%, Function Words: ~59.7% (estimated)
* **Top 10%:** Content Words: 45.4%, Function Words: ~54.6% (estimated)
The content word ratio generally increases as we move down the percentile ranges (from 90-100% to Top 10%). The function word ratio correspondingly decreases.
### Key Observations
* The ratio of content words is lowest in the 90-100% percentile range and highest in the Top 10% percentile range.
* The difference in ratio between the highest and lowest percentile ranges is approximately 14.4% (45.4% - 31.0%).
* The function word ratio is consistently higher than the content word ratio across all percentile ranges.
* The increase in content word ratio is not perfectly linear, with some ranges showing smaller increases than others.
### Interpretation
This chart suggests that R1-Llama, as evaluated by AIME24, tends to use a higher proportion of function words compared to content words, especially when considering less frequent words. However, as the word frequency increases (moving towards the Top 10%), the proportion of content words increases significantly. This could indicate that the model relies more on common, content-bearing words when generating more frequent text. The consistent dominance of function words suggests the model is proficient in grammatical structure and coherence, even when using less frequent vocabulary. The relatively small differences in ratios between adjacent percentile ranges suggest a gradual shift in word usage rather than abrupt changes. This data could be used to fine-tune the model to achieve a desired balance between content and function word usage, potentially improving the clarity or creativity of the generated text.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Bar Chart: Ratio of Content Words to Function Words (R1-Llama | AIME25)
### Overview
This is a horizontal bar chart displaying the ratio (in percentage) of content words to function words across different percentile ranges. The chart is titled "R1-Llama | AIME25". The x-axis represents the ratio in percentage, ranging from 0 to 100. The y-axis represents percentile ranges, from "Top 10%" to "90-100%". Two data series are presented: "Content Words" (represented by dark red bars) and "Function Words" (represented by light gray bars).
### Components/Axes
* **Title:** R1-Llama | AIME25 (top-center)
* **X-axis Label:** Ratio (%) (bottom-center)
* **Y-axis:** Percentile Ranges (left side)
* Top 10%
* 10-20%
* 20-30%
* 30-40%
* 40-50%
* 50-60%
* 60-70%
* 70-80%
* 80-90%
* 90-100%
* **Legend:** (top-right)
* Content Words (dark red)
* Function Words (light gray)
### Detailed Analysis
The chart shows the percentage of content words for each percentile range. The function word percentage is implicitly represented by the remaining portion of each bar.
Here's a breakdown of the data points:
* **Top 10%:** Content Words: 44.3%
* **10-20%:** Content Words: 39.3%
* **20-30%:** Content Words: 35.5%
* **30-40%:** Content Words: 32.6%
* **40-50%:** Content Words: 31.5%
* **50-60%:** Content Words: 30.0%
* **60-70%:** Content Words: 30.1%
* **70-80%:** Content Words: 30.7%
* **80-90%:** Content Words: 29.6%
* **90-100%:** Content Words: 29.3%
The function word percentages can be calculated by subtracting the content word percentage from 100%. For example:
* **Top 10%:** Function Words: 100% - 44.3% = 55.7%
* **10-20%:** Function Words: 100% - 39.3% = 60.7%
* **20-30%:** Function Words: 100% - 35.5% = 64.5%
* **30-40%:** Function Words: 100% - 32.6% = 67.4%
* **40-50%:** Function Words: 100% - 31.5% = 68.5%
* **50-60%:** Function Words: 100% - 30.0% = 70.0%
* **60-70%:** Function Words: 100% - 30.1% = 69.9%
* **70-80%:** Function Words: 100% - 30.7% = 69.3%
* **80-90%:** Function Words: 100% - 29.6% = 70.4%
* **90-100%:** Function Words: 100% - 29.3% = 70.7%
The content word percentage decreases as the percentile range increases, while the function word percentage increases.
### Key Observations
* The highest proportion of content words is found in the "Top 10%" range (44.3%).
* The lowest proportion of content words is found in the "90-100%" range (29.3%).
* The function word percentage is consistently higher than the content word percentage across all percentile ranges.
* The difference between content and function word percentages is smallest in the "Top 10%" range and largest in the "90-100%" range.
### Interpretation
The data suggests that the most important or frequently used words (those in the top 10%) are more likely to be content words, while less frequent words (those in the 90-100% range) are more likely to be function words. This is expected, as content words carry the primary meaning of a text, while function words serve grammatical purposes. The increasing proportion of function words as we move down the percentile ranks indicates that the less frequent words are primarily those that provide structure and connection rather than core meaning. This chart provides insight into the lexical distribution within the R1-Llama model, specifically as measured by the AIME25 metric. The trend suggests that the model prioritizes and utilizes content words more heavily in its most frequent vocabulary.
</details>
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Ratio of Content Words to Function Words (R1-Llama | AMC23)
### Overview
This is a horizontal bar chart displaying the ratio (in percentage) of content words to function words, categorized by frequency ranges. The chart is titled "R1-Llama | AMC23". The x-axis represents the ratio in percentage, ranging from 0 to 100. The y-axis represents frequency ranges, from "Top 10%" to "90-100%".
### Components/Axes
* **Title:** R1-Llama | AMC23 (top-center)
* **X-axis Label:** Ratio (%) (bottom-center)
* **Y-axis Labels:**
* Top 10%
* 10-20%
* 20-30%
* 30-40%
* 40-50%
* 50-60%
* 60-70%
* 70-80%
* 80-90%
* 90-100%
* **Legend:**
* Content Words (dark red)
* Function Words (light gray)
### Detailed Analysis
The chart consists of ten horizontal bars, each representing a frequency range. Each bar is divided into two sections: a dark red section representing the "Content Words" ratio and a light gray section representing the "Function Words" ratio.
Here's a breakdown of the data points, reading from the bottom up:
* **Top 10%:** Content Words: 45.3%, Function Words: approximately 54.7% (total 100%)
* **10-20%:** Content Words: 39.1%, Function Words: approximately 60.9%
* **20-30%:** Content Words: 35.4%, Function Words: approximately 64.6%
* **30-40%:** Content Words: 32.3%, Function Words: approximately 67.7%
* **40-50%:** Content Words: 30.6%, Function Words: approximately 69.4%
* **50-60%:** Content Words: 29.6%, Function Words: approximately 70.4%
* **60-70%:** Content Words: 29.1%, Function Words: approximately 70.9%
* **70-80%:** Content Words: 28.7%, Function Words: approximately 71.3%
* **80-90%:** Content Words: 27.4%, Function Words: approximately 72.6%
* **90-100%:** Content Words: 25.7%, Function Words: approximately 74.3%
The "Content Words" ratio generally decreases as the frequency range increases, while the "Function Words" ratio increases.
### Key Observations
* The ratio of content words is highest in the "Top 10%" frequency range (45.3%).
* The ratio of content words is lowest in the "90-100%" frequency range (25.7%).
* There is a consistent, downward trend in the content word ratio as the frequency range increases.
* The function word ratio consistently increases as the frequency range increases.
### Interpretation
The chart demonstrates an inverse relationship between word frequency and the proportion of content words. More frequent words (those appearing in the 90-100% range) are more likely to be function words (articles, prepositions, conjunctions, etc.), while less frequent words (those in the top 10%) are more likely to be content words (nouns, verbs, adjectives, etc.). This is expected, as function words provide grammatical structure and are used repeatedly, while content words carry the primary meaning and are more diverse.
The data suggests that the language model (R1-Llama) or the corpus (AMC23) exhibits a typical distribution of word frequencies, where a small number of words account for a large proportion of the total word count, and these are predominantly function words. This information could be useful for optimizing language models, improving text compression, or analyzing linguistic characteristics of the corpus. The consistent trend indicates a robust pattern rather than random variation.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Bar Chart: R1-Llama | GPQA-D Word Ratio
### Overview
This is a horizontal bar chart displaying the ratio of "Content Words" to "Function Words" across different percentage ranges. The chart appears to analyze the composition of language used by the R1-Llama model on the GPQA-D dataset. The x-axis represents the ratio in percentage, and the y-axis represents the percentage ranges of words.
### Components/Axes
* **Title:** R1-Llama | GPQA-D
* **X-axis Label:** Ratio (%)
* **Y-axis:** Percentage ranges: Top >10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%
* **Legend:**
* Content Words (Dark Red)
* Function Words (Light Gray)
### Detailed Analysis
The chart consists of two sets of horizontal bars for each percentage range. The dark red bars represent the ratio of Content Words, and the light gray bars represent the ratio of Function Words.
Here's a breakdown of the data points, reading from top to bottom:
* **90-100%:** Content Words: 36.3%, Function Words: ~63.7% (estimated)
* **80-90%:** Content Words: 37.1%, Function Words: ~62.9% (estimated)
* **70-80%:** Content Words: 38.4%, Function Words: ~61.6% (estimated)
* **60-70%:** Content Words: 39.1%, Function Words: ~60.9% (estimated)
* **50-60%:** Content Words: 39.6%, Function Words: ~60.4% (estimated)
* **40-50%:** Content Words: 40.8%, Function Words: ~59.2% (estimated)
* **30-40%:** Content Words: 42.9%, Function Words: ~57.1% (estimated)
* **20-30%:** Content Words: 45.7%, Function Words: ~54.3% (estimated)
* **10-20%:** Content Words: 49.6%, Function Words: ~50.4% (estimated)
* **Top >10%:** Content Words: 56.6%, Function Words: ~43.4% (estimated)
The Content Words bars generally increase in length as the percentage range decreases, indicating a higher ratio of content words in the top 10% of words. Conversely, the Function Words bars decrease in length as the percentage range decreases.
### Key Observations
* The ratio of Content Words is highest in the "Top >10%" range (56.6%) and lowest in the "90-100%" range (36.3%).
* The difference between the Content Word and Function Word ratios is most pronounced in the "Top >10%" range.
* The trend shows a gradual increase in the ratio of Content Words as you move from higher to lower percentage ranges.
### Interpretation
The data suggests that the R1-Llama model, when evaluated on the GPQA-D dataset, utilizes a higher proportion of content words in the most frequently occurring words (top 10%) compared to the less frequent words (90-100%). This could indicate that the model relies heavily on core vocabulary and concepts when generating responses or processing information. The consistent increase in content word ratio as the percentage range decreases suggests a hierarchical structure in word usage, where the most important words carry more semantic weight. The relatively stable ratio of function words across all ranges indicates that these words are essential for grammatical structure and are used consistently regardless of word frequency. This analysis provides insight into the linguistic characteristics of the R1-Llama model and its performance on the GPQA-D dataset.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Horizontal Stacked Bar Chart: Ratio of Content Words vs. Function Words
### Overview
This is a horizontal stacked bar chart displaying the ratio of "Content Words" to "Function Words" across different percentage ranges. The chart is titled "R1-Llama | GK23EN" at the top-center. The x-axis represents the ratio in percentage (0-100%), and the y-axis represents the percentage ranges of words. Each bar is divided into two sections: a dark red section representing "Content Words" and a light gray section representing "Function Words".
### Components/Axes
* **Title:** R1-Llama | GK23EN
* **X-axis Label:** Ratio (%)
* **Y-axis:** Percentage ranges: Top >10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%
* **Legend:**
* Dark Red: Content Words
* Light Gray: Function Words
### Detailed Analysis
The chart shows the percentage of content words and function words for different ranges of word frequency. The bars are stacked horizontally, with the content word percentage on the left and the function word percentage on the right.
Here's a breakdown of the data, reading from the bottom (highest percentage range) to the top:
* **Top >10%:** Content Words: ~46.0%, Function Words: ~54.0%
* **10-20%:** Content Words: ~42.0%, Function Words: ~58.0%
* **20-30%:** Content Words: ~39.3%, Function Words: ~60.7%
* **30-40%:** Content Words: ~36.8%, Function Words: ~63.2%
* **40-50%:** Content Words: ~34.2%, Function Words: ~65.8%
* **50-60%:** Content Words: ~32.7%, Function Words: ~67.3%
* **60-70%:** Content Words: ~31.4%, Function Words: ~68.6%
* **70-80%:** Content Words: ~30.1%, Function Words: ~69.9%
* **80-90%:** Content Words: ~28.1%, Function Words: ~71.9%
* **90-100%:** Content Words: ~26.8%, Function Words: ~73.2%
The trend is that as the percentage range increases (moving from the bottom to the top of the chart), the ratio of content words decreases, and the ratio of function words increases.
### Key Observations
* The highest proportion of content words is found in the "Top >10%" range.
* The lowest proportion of content words is found in the "90-100%" range.
* The difference between the content word and function word ratios is most pronounced in the lower percentage ranges.
* The function word ratio consistently exceeds the content word ratio across all ranges.
### Interpretation
The data suggests that in the analyzed text (R1-Llama | GK23EN), the most frequent words (those appearing in the top 10% of word occurrences) tend to be content words, while less frequent words are more likely to be function words. This is a common linguistic pattern, as content words carry the primary meaning of a text, while function words (articles, prepositions, conjunctions, etc.) serve grammatical roles. The consistent dominance of function words across all ranges indicates their essential role in structuring language. The decreasing ratio of content words as the percentage range increases suggests that the core meaning of the text is concentrated in a relatively small set of frequently used words. This could be indicative of a focused or specialized text. The chart provides a quantitative view of the lexical composition of the text, highlighting the balance between meaningful content and grammatical structure.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Bar Chart: R1-Llama | MATH500 - Content vs. Function Word Ratio
### Overview
This is a horizontal bar chart displaying the ratio of content words to function words within different percentile ranges of a language model (R1-Llama) evaluated on a MATH500 dataset. The chart shows how the proportion of content words changes as you move from the most frequent words (Top 10%) to the least frequent words (90-100%).
### Components/Axes
* **Title:** R1-Llama | MATH500
* **X-axis:** Ratio (%) - Scale ranges from 0 to 100.
* **Y-axis:** Percentile Ranges - Categories are: Top 10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%.
* **Legend:**
* Content Words: Represented by dark red bars.
* Function Words: Represented by light gray bars.
### Detailed Analysis
The chart consists of ten horizontal bars, each representing a percentile range. The length of the dark red bar indicates the ratio of content words for that range, while the light gray portion represents the ratio of function words.
Here's a breakdown of the data points:
* **Top 10%:** 45.8% Content Words
* **10-20%:** 40.1% Content Words
* **20-30%:** 37.0% Content Words
* **30-40%:** 34.1% Content Words
* **40-50%:** 32.0% Content Words
* **50-60%:** 30.8% Content Words
* **60-70%:** 30.2% Content Words
* **70-80%:** 29.1% Content Words
* **80-90%:** 27.2% Content Words
* **90-100%:** 25.7% Content Words
**Trend:** The ratio of content words decreases consistently as you move down the percentile ranges (from Top 10% to 90-100%). This indicates that less frequent words are more likely to be function words.
### Key Observations
* The most significant difference in content word ratio is between the Top 10% and the 90-100% range (a difference of approximately 20.1%).
* The decrease in content word ratio appears to be relatively linear across the percentile ranges.
* The function word ratio increases as the content word ratio decreases, maintaining a total of 100% for each percentile range.
### Interpretation
The data suggests that the R1-Llama model, when evaluated on the MATH500 dataset, exhibits a clear pattern in the distribution of content and function words. The most frequent words are predominantly content words (nouns, verbs, adjectives, etc.), while less frequent words are more likely to be function words (prepositions, articles, conjunctions, etc.). This is a common linguistic phenomenon, as function words are essential for grammatical structure but carry less semantic weight.
The consistent downward trend in content word ratio indicates that the model's vocabulary is structured in a way that reflects this linguistic principle. This could be due to the training data used to develop the model or inherent properties of the MATH500 dataset itself. The chart provides insight into the model's understanding and utilization of different word types, which could be relevant for tasks such as text generation, question answering, and language understanding.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: Content vs. Function Word Ratio
### Overview
The image presents a horizontal bar chart comparing the ratio of content words to function words across different percentage ranges. The chart is titled "R1-Qwen | AIME24". Each bar represents a percentage range, with the length of the red segment indicating the ratio of content words and the grey segment representing the ratio of function words.
### Components/Axes
* **X-axis:** Ratio (%) - Scale ranges from 0 to 100.
* **Y-axis:** Percentage Ranges - Listed as: Top >10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%.
* **Legend:** Located in the top-right corner.
* Red: Content Words
* Grey: Function Words
### Detailed Analysis
The chart displays the following data points. The trend is that the ratio of content words decreases as the percentage range increases.
* **Top >10%:** Content Words: 38.2%, Function Words: ~61.8%
* **10-20%:** Content Words: 39.4%, Function Words: ~60.6%
* **20-30%:** Content Words: 37.5%, Function Words: ~62.5%
* **30-40%:** Content Words: 35.6%, Function Words: ~64.4%
* **40-50%:** Content Words: 33.9%, Function Words: ~66.1%
* **50-60%:** Content Words: 32.8%, Function Words: ~67.2%
* **60-70%:** Content Words: 31.0%, Function Words: ~69.0%
* **70-80%:** Content Words: 29.1%, Function Words: ~70.9%
* **80-90%:** Content Words: 24.7%, Function Words: ~75.3%
* **90-100%:** Content Words: 19.0%, Function Words: ~81.0%
### Key Observations
* The highest ratio of content words is observed in the "Top >10%" range (38.2%).
* The lowest ratio of content words is observed in the "90-100%" range (19.0%).
* There is a consistent downward trend in the content word ratio as the percentage range increases.
* The function word ratio consistently increases as the percentage range increases.
### Interpretation
The data suggests that in the analyzed text (R1-Qwen | AIME24), the proportion of content words decreases as one considers words appearing less frequently. This implies that the most common words in the text are predominantly function words (articles, prepositions, conjunctions, etc.), while less frequent words are more likely to be content words (nouns, verbs, adjectives, etc.). This is a common characteristic of natural language, where a small set of function words accounts for a large proportion of the total word count. The chart provides a quantitative view of this distribution, showing how the balance shifts as one moves from the most frequent to the least frequent words. The consistent trend suggests a relatively stable linguistic structure within the analyzed text.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Bar Chart: Ratio of Content Words to Function Words
### Overview
The image presents a horizontal bar chart illustrating the ratio (in percentage) of content words to function words across different frequency ranges. The chart is titled "R1-Qwen | AIME25". The x-axis represents the ratio in percentage, ranging from 0 to 100. The y-axis represents frequency ranges, from "Top 10%" to "90-100%". Two data series are displayed: "Content Words" (represented by dark red bars) and "Function Words" (represented by light gray bars).
### Components/Axes
* **Title:** R1-Qwen | AIME25 (top-center)
* **X-axis Label:** Ratio (%) (bottom-center)
* **Y-axis Label:** Frequency Ranges (left-side)
* **Legend:** Located in the top-right corner.
* "Content Words" - Dark Red
* "Function Words" - Light Gray
* **Y-axis Markers (Frequency Ranges):**
* Top 10%
* 10-20%
* 20-30%
* 30-40%
* 40-50%
* 50-60%
* 60-70%
* 70-80%
* 80-90%
* 90-100%
### Detailed Analysis
The chart displays the percentage of content words and function words for each frequency range. The function word percentage is represented by the length of the gray bars, while the content word percentage is represented by the length of the red bars.
Here's a breakdown of the data points:
* **Top 10%:** Content Words: 37.5%, Function Words: ~62.5%
* **10-20%:** Content Words: 37.5%, Function Words: ~62.5%
* **20-30%:** Content Words: 35.9%, Function Words: ~64.1%
* **30-40%:** Content Words: 34.9%, Function Words: ~65.1%
* **40-50%:** Content Words: 33.5%, Function Words: ~66.5%
* **50-60%:** Content Words: 32.2%, Function Words: ~67.8%
* **60-70%:** Content Words: 31.3%, Function Words: ~68.7%
* **70-80%:** Content Words: 29.7%, Function Words: ~70.3%
* **80-90%:** Content Words: 26.2%, Function Words: ~73.8%
* **90-100%:** Content Words: 20.9%, Function Words: ~79.1%
**Trend Verification:** The content word percentage consistently decreases as the frequency range increases, while the function word percentage increases. This is visually apparent as the red bars become shorter and the gray bars remain relatively constant in length as you move up the y-axis.
### Key Observations
* The ratio of function words to content words is significantly higher in lower frequency ranges (90-100%) compared to higher frequency ranges (Top 10%).
* The difference in percentage between content and function words is most pronounced in the 90-100% range.
* The content word percentage decreases steadily across all frequency ranges.
### Interpretation
The data suggests that function words (articles, prepositions, conjunctions, etc.) are more prevalent in less frequent words, while content words (nouns, verbs, adjectives, etc.) dominate the most frequent words. This is consistent with linguistic theory, as function words provide grammatical structure and are essential for sentence formation, while content words carry the primary meaning. The decreasing trend of content word ratio with increasing frequency suggests that the most common words in a corpus are primarily grammatical elements rather than semantic ones. The "R1-Qwen | AIME25" title suggests this data is related to a specific language model (R1-Qwen) and a dataset (AIME25), potentially indicating the model's vocabulary distribution or the characteristics of the training data. The consistent trend suggests a robust pattern within the dataset and model.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Bar Chart: Word Ratio Distribution
### Overview
This is a horizontal bar chart displaying the ratio of content words to function words across different percentage ranges. The chart is titled "R1-Qwen | AMC23" at the top-center. The x-axis represents the ratio in percentage, and the y-axis represents the percentage ranges of word occurrences.
### Components/Axes
* **Title:** R1-Qwen | AMC23 (Top-center)
* **X-axis Label:** Ratio (%) (Bottom-center)
* **Y-axis Labels:**
* Top 10%
* 10-20%
* 20-30%
* 30-40%
* 40-50%
* 50-60%
* 60-70%
* 70-80%
* 80-90%
* 90-100%
* **Legend:** (Top-right)
* Content Words (Dark Red)
* Function Words (Light Gray)
### Detailed Analysis
The chart consists of horizontal bars, each representing a percentage range. Each bar is divided into two sections: a dark red section representing "Content Words" and a light gray section representing "Function Words". The length of each section corresponds to the ratio of that word type within the given percentage range.
Here's a breakdown of the data points, reading from top to bottom:
* **90-100%:** Content Words: 21.1%, Function Words: Approximately 78.9% (calculated as 100% - 21.1%)
* **80-90%:** Content Words: 24.6%, Function Words: Approximately 75.4%
* **70-80%:** Content Words: 27.7%, Function Words: Approximately 72.3%
* **60-70%:** Content Words: 30.1%, Function Words: Approximately 69.9%
* **50-60%:** Content Words: 31.9%, Function Words: Approximately 68.1%
* **40-50%:** Content Words: 33.5%, Function Words: Approximately 66.5%
* **30-40%:** Content Words: 35.8%, Function Words: Approximately 64.2%
* **20-30%:** Content Words: 37.5%, Function Words: Approximately 62.5%
* **10-20%:** Content Words: 39.2%, Function Words: Approximately 60.8%
* **Top 10%:** Content Words: 39.6%, Function Words: Approximately 60.4%
**Trend Verification:** The dark red "Content Words" bars consistently increase in length as we move down the chart (from 90-100% to Top 10%), indicating a higher ratio of content words in the lower percentage ranges. Conversely, the light gray "Function Words" bars decrease in length, showing a lower ratio of function words in those ranges.
### Key Observations
* The ratio of content words is significantly lower in the higher percentage ranges (90-100%) compared to the lower percentage ranges (Top 10%).
* The increase in content word ratio is relatively consistent across the percentage ranges, suggesting a gradual shift in word type distribution.
* The difference between the highest and lowest content word ratios is approximately 18.5% (39.6% - 21.1%).
### Interpretation
The chart demonstrates the distribution of content and function words within a text corpus (likely related to the "R1-Qwen" model and "AMC23" dataset). The increasing ratio of content words as we move down the percentage ranges suggests that the most frequent words in the text are predominantly content words. This could indicate that the text is information-rich and focuses on conveying specific meaning rather than grammatical structure. The decreasing ratio of function words implies that the text relies less on articles, prepositions, and other grammatical elements.
The chart provides insights into the linguistic characteristics of the text, which could be valuable for natural language processing tasks such as text summarization, keyword extraction, and sentiment analysis. The data suggests that the text is likely to be more focused on semantic content than on grammatical relationships. The consistent trend suggests a systematic pattern in the word distribution, rather than random fluctuations.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Bar Chart: Ratio of Content Words to Function Words
### Overview
This is a horizontal bar chart displaying the ratio (in percentage) of content words to function words across different frequency ranges. The chart is titled "R1-Qwen | GPQA-D". The x-axis represents the ratio in percentage, ranging from 0 to 100. The y-axis represents frequency ranges, starting from "Top 10%" and going up to "90-100%". Two data series are presented: "Content Words" (represented by dark red bars) and "Function Words" (represented by light gray bars).
### Components/Axes
* **Title:** R1-Qwen | GPQA-D
* **X-axis Label:** Ratio (%)
* **Y-axis Labels (Frequency Ranges):** Top 10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%
* **Legend:**
* Content Words (Dark Red)
* Function Words (Light Gray)
### Detailed Analysis
The chart shows the percentage of content words and function words for each frequency range. The data points are as follows:
* **Top 10%:** Content Words: 47.9%, Function Words: ~52.1% (estimated from the bar length)
* **10-20%:** Content Words: 48.2%, Function Words: ~51.8%
* **20-30%:** Content Words: 46.9%, Function Words: ~53.1%
* **30-40%:** Content Words: 46.1%, Function Words: ~53.9%
* **40-50%:** Content Words: 44.7%, Function Words: ~55.3%
* **50-60%:** Content Words: 43.6%, Function Words: ~56.4%
* **60-70%:** Content Words: 42.0%, Function Words: ~58.0%
* **70-80%:** Content Words: 39.8%, Function Words: ~60.2%
* **80-90%:** Content Words: 36.4%, Function Words: ~63.6%
* **90-100%:** Content Words: 30.4%, Function Words: ~69.6%
**Trend Verification:**
The "Content Words" bars generally decrease in length as the frequency range increases, indicating a decreasing ratio of content words in less frequent word occurrences. Conversely, the "Function Words" bars increase in length with increasing frequency range, showing a higher ratio of function words in less frequent word occurrences.
### Key Observations
* The ratio of content words is highest in the "Top 10%" frequency range (47.9%).
* The ratio of function words is highest in the "90-100%" frequency range (~69.6%).
* There is a consistent inverse relationship between the ratio of content words and function words across all frequency ranges.
* The difference between the two ratios widens as the frequency range increases.
### Interpretation
The data suggests that more frequent words (those in the "Top 10%" range) tend to be content words, while less frequent words (those in the "90-100%" range) tend to be function words. This is expected, as content words carry the primary meaning of a text and are therefore more likely to appear frequently, while function words (articles, prepositions, etc.) serve grammatical purposes and are often used regardless of the specific content. The widening gap between the ratios as frequency decreases indicates that the least frequent words are overwhelmingly function words, likely representing rare grammatical constructions or specialized terminology. The chart provides insight into the lexical distribution of a corpus, potentially revealing characteristics of the text's style or domain. The title "R1-Qwen | GPQA-D" suggests this data is related to a specific model (R1-Qwen) and a question answering dataset (GPQA-D).
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Word Ratio Distribution
### Overview
The image presents a stacked horizontal bar chart illustrating the ratio of "Content Words" to "Function Words" across different percentage ranges. The chart is titled "R1-Qwen | GK23EN" at the top-center. The x-axis represents the ratio in percentage, ranging from 0 to 100. The y-axis displays percentage ranges, from "Top >10%" to "90-100%". Each bar is divided into two sections: a dark red section representing "Content Words" and a light gray section representing "Function Words".
### Components/Axes
* **Title:** R1-Qwen | GK23EN
* **X-axis Label:** Ratio (%)
* **Y-axis Labels:** Top >10%, 10-20%, 20-30%, 30-40%, 40-50%, 50-60%, 60-70%, 70-80%, 80-90%, 90-100%
* **Legend:**
* Dark Red: Content Words
* Light Gray: Function Words
### Detailed Analysis
The chart displays the percentage of content words and function words within specified ratio ranges. The bars are stacked horizontally, with the content word percentage displayed on the left and the function word percentage filling the remainder of the bar.
Here's a breakdown of the data, reading from the bottom up:
* **Top >10%:** Content Words: ~38.6%, Function Words: ~61.4%
* **10-20%:** Content Words: ~40.0%, Function Words: ~60.0%
* **20-30%:** Content Words: ~39.2%, Function Words: ~60.8%
* **30-40%:** Content Words: ~37.4%, Function Words: ~62.6%
* **40-50%:** Content Words: ~35.6%, Function Words: ~64.4%
* **50-60%:** Content Words: ~33.6%, Function Words: ~66.4%
* **60-70%:** Content Words: ~31.6%, Function Words: ~68.4%
* **70-80%:** Content Words: ~29.1%, Function Words: ~70.9%
* **80-90%:** Content Words: ~26.2%, Function Words: ~73.8%
* **90-100%:** Content Words: ~23.1%, Function Words: ~76.9%
The trend shows a consistent decrease in the ratio of Content Words as the overall ratio increases, and a corresponding increase in the ratio of Function Words.
### Key Observations
* The proportion of Function Words consistently exceeds that of Content Words across all ratio ranges.
* The most significant difference between Content and Function Word ratios occurs in the highest ratio range (90-100%), where Function Words comprise approximately 76.9% of the total.
* The difference between the two word types is smallest in the 10-20% range, with Content Words at 40% and Function Words at 60%.
### Interpretation
This chart likely represents the linguistic composition of a text or corpus, labeled "R1-Qwen | GK23EN". The data suggests that this text is heavily dominated by function words (articles, prepositions, conjunctions, etc.) compared to content words (nouns, verbs, adjectives, etc.). This could indicate a highly grammatical or formal style of writing, or potentially a text focused on procedural instructions or technical documentation where functional language is prioritized. The decreasing ratio of content words as the overall ratio increases suggests that as the overall ratio of words increases, the proportion of function words becomes even more dominant. This could be due to the increasing need for grammatical structure and clarity as the complexity of the text grows. The specific labels "R1-Qwen" and "GK23EN" likely refer to a specific model or dataset used in the analysis.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Bar Chart: Content vs. Function Word Ratio in MATH500 (R1-Qwen)
### Overview
This is a horizontal bar chart displaying the ratio of content words to function words across different percentage ranges within a text corpus (MATH500) analyzed by R1-Qwen. The chart shows how the proportion of content words varies as you consider increasingly frequent words in the text.
### Components/Axes
* **Title:** R1-Qwen | MATH500
* **X-axis:** Ratio (%) - Scale ranges from 0 to 100.
* **Y-axis:** Percentage ranges of words, labeled as follows (from top to bottom):
* 90-100%
* 80-90%
* 70-80%
* 60-70%
* 50-60%
* 40-50%
* 30-40%
* 20-30%
* 10-20%
* Top 10%
* **Legend:**
* Content Words (Dark Red)
* Function Words (Light Gray)
### Detailed Analysis
The chart consists of horizontal bars representing the ratio of content words for each percentage range. The function word portion is represented by the remaining space to 100%.
Here's a breakdown of the data points, reading from top to bottom:
* **90-100%:** Content Words: 20.6%
* **80-90%:** Content Words: 24.5%
* **70-80%:** Content Words: 27.5%
* **60-70%:** Content Words: 29.8%
* **50-60%:** Content Words: 31.9%
* **40-50%:** Content Words: 33.9%
* **30-40%:** Content Words: 36.0%
* **20-30%:** Content Words: 37.8%
* **10-20%:** Content Words: 38.9%
* **Top 10%:** Content Words: 39.8%
The trend is clearly upward. As the percentage range decreases (i.e., considering more frequent words), the ratio of content words increases. The increase is not linear, but appears to be slowing down as you approach the top 10%.
### Key Observations
* The ratio of content words is lowest in the 90-100% range (20.6%) and highest in the top 10% range (39.8%).
* The difference between the lowest and highest content word ratios is 19.2%.
* The increase in content word ratio is most pronounced between the 90-100% and 20-30% ranges.
### Interpretation
This chart demonstrates that the most frequent words in the MATH500 corpus (as analyzed by R1-Qwen) tend to be function words (articles, prepositions, conjunctions, etc.), while less frequent words are more likely to be content words (nouns, verbs, adjectives, etc.). This is a common characteristic of natural language, where a small set of function words accounts for a large proportion of the total word count.
The increasing trend suggests that as you focus on the core vocabulary of the corpus, the proportion of meaningful content words increases. This could be useful for tasks like keyword extraction or topic modeling, where identifying the most important content words is crucial.
The slowing increase towards the top 10% might indicate that even the most frequent content words are still relatively rare compared to function words, or that the top 10% contains a mix of both content and function words. Further analysis would be needed to determine the exact composition of the top 10% of words.
</details>
Figure 9: Proportion of content words versus function words in thinking tokens. Bars represent deciles sorted by importance score; e.g., the bottom bar indicates the ratio for the top- $10\$ most important tokens.
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" against "Reasoning Step". The chart appears to represent the fluctuation of importance during a reasoning process, divided into "Question" and "Thinking" phases. A horizontal dotted line indicates the mean importance score and ratio.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to approximately 7000.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The scale is not numerically defined, but visually represents relative importance.
* **Chart Area:** Filled with a fluctuating blue line representing the importance score over reasoning steps.
* **Phase Labels:** Two labels are present above the chart: "Question" (covering steps 0-50) and "Thinking" (covering steps 50-7000).
* **Mean Score Indicator:** A horizontal dotted red line with the text "Mean Score: 0.228, Ratio: 0.239" positioned approximately at the y-value of 0.23.
* **Color Scheme:** Blue for the importance score line, red for the mean score indicator.
### Detailed Analysis
The chart shows a highly variable importance score over the reasoning steps.
* **Question Phase (0-50):** The importance score starts high and rapidly decreases, exhibiting significant fluctuations. The score appears to be generally higher in this phase compared to the initial part of the "Thinking" phase.
* **Thinking Phase (50-7000):** The importance score remains generally low, with frequent, smaller fluctuations. There are occasional spikes in importance, but they are less pronounced and less frequent than in the "Question" phase.
* **Mean Score:** The mean importance score is 0.228, and the ratio is 0.239. This represents the average importance score across all reasoning steps.
* **Data Points (Approximate):**
* At Reasoning Step 0, Importance Score is approximately "High".
* At Reasoning Step 25, Importance Score is approximately 0.6 (estimated visually).
* At Reasoning Step 50, Importance Score is approximately 0.2 (estimated visually).
* At Reasoning Step 1000, Importance Score is approximately 0.15 (estimated visually).
* At Reasoning Step 5000, Importance Score is approximately 0.3 (estimated visually).
* At Reasoning Step 6500, Importance Score is approximately 0.1 (estimated visually).
### Key Observations
* The importance score is significantly higher during the "Question" phase compared to the "Thinking" phase.
* The "Thinking" phase exhibits a generally low and stable importance score, with occasional spikes.
* The fluctuations in importance score suggest a dynamic reasoning process where the relevance of different steps varies considerably.
* The mean importance score is relatively low (0.228), indicating that, on average, most reasoning steps have a low importance score.
### Interpretation
The chart suggests that the initial "Question" phase is characterized by high cognitive activity and fluctuating relevance, while the "Thinking" phase involves a more sustained, but generally lower-intensity, reasoning process. The high variability in the importance score indicates that the reasoning process is not linear or uniform; some steps are more crucial than others. The relatively low mean importance score suggests that a significant portion of the reasoning process involves steps that are not directly contributing to the overall solution. The ratio of 0.239 could represent the proportion of steps exceeding the mean importance. This data could be used to analyze the efficiency of a reasoning process and identify areas for improvement. The chart implies that focusing on the initial question phase and identifying key reasoning steps could lead to more effective problem-solving.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" against "Reasoning Step". The chart appears to represent the importance of each reasoning step during a question-answering or problem-solving process. The chart is divided into two sections labeled "Question" and "Thinking".
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from approximately 0 to 12500.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The scale is not explicitly numerical, but visually represents relative importance.
* **Data Series:** A single fluctuating line representing the Importance Score over Reasoning Steps.
* **Annotations:**
* "Question" label positioned above the first section of the chart (Reasoning Steps 0-100).
* "Thinking" label positioned above the second section of the chart (Reasoning Steps 100-12500).
* A horizontal dashed red line labeled "Mean Score: 0.138; Ratio: 0.224".
### Detailed Analysis
The chart shows a high concentration of importance scores in the initial "Question" phase (Reasoning Steps 0-100). The Importance Score fluctuates significantly during this phase, with many peaks reaching towards the "High" end of the scale. After the initial phase, the Importance Score generally decreases and remains relatively low throughout the "Thinking" phase (Reasoning Steps 100-12500). There are occasional spikes in the Importance Score during the "Thinking" phase, but these are less frequent and generally lower in magnitude than those observed in the "Question" phase.
* **Question Phase (0-100 Reasoning Steps):** The Importance Score fluctuates rapidly between low and high values. The peaks are frequent and relatively high.
* **Thinking Phase (100-12500 Reasoning Steps):** The Importance Score remains mostly low, with occasional, smaller spikes. The line is generally closer to the "Low" end of the scale.
* **Mean Score:** The average Importance Score across all Reasoning Steps is 0.138.
* **Ratio:** A ratio of 0.224 is provided, but its meaning is not explicitly defined in the image.
### Key Observations
* The initial "Question" phase exhibits significantly higher Importance Scores compared to the "Thinking" phase.
* The Importance Score decreases substantially after the initial phase, suggesting that the early reasoning steps are more critical.
* The "Thinking" phase is characterized by relatively low and stable Importance Scores, with occasional minor fluctuations.
* The ratio of 0.224 may represent a metric related to the distribution of importance scores, but its exact meaning is unclear without further context.
### Interpretation
The data suggests that the initial steps of reasoning, likely related to understanding and framing the question, are significantly more important than the subsequent "Thinking" steps. This could indicate that the majority of the cognitive effort is focused on correctly interpreting the problem rather than the actual problem-solving process. The low mean Importance Score and the relatively flat line in the "Thinking" phase suggest that most reasoning steps during this phase contribute little to the overall importance. The ratio of 0.224 could potentially represent the proportion of reasoning steps that exceed a certain importance threshold, but this is speculative. The chart highlights a potential bottleneck in the reasoning process, where the initial question understanding is crucial, and subsequent steps are less impactful. This could be relevant for optimizing AI reasoning systems or understanding human cognitive processes.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" over "Reasoning Step". The chart appears to represent the fluctuation of importance during a reasoning process, divided into "Question" and "Thinking" phases. A horizontal dotted line indicates the mean importance score and ratio.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to approximately 7200.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The scale is not numerically defined, but visually represents a relative importance level.
* **Chart Area:** The main area displaying the importance score fluctuations.
* **Phase Labels:** Two labels indicating phases: "Question" (from 0 to approximately 100 Reasoning Steps) and "Thinking" (from approximately 100 to 7200 Reasoning Steps).
* **Mean Score Line:** A horizontal dotted red line labeled "Mean Score: 0.208, Ratio: 0.281".
* **Color Scheme:** A blue color gradient is used to represent the importance score, with darker shades indicating higher scores.
### Detailed Analysis
The chart shows a distinct pattern in the importance score over the reasoning steps.
* **Question Phase (0-100 Reasoning Steps):** The importance score starts at a low level and rapidly increases to a high level within the first 100 reasoning steps. The score fluctuates significantly during this phase.
* **Thinking Phase (100-7200 Reasoning Steps):** After the initial "Question" phase, the importance score drops to a consistently low level, with frequent, small fluctuations. The score remains relatively stable around the mean score.
* **Mean Score:** The mean importance score is 0.208, with a ratio of 0.281. This line serves as a baseline for comparison.
* **Fluctuations in Thinking Phase:** The fluctuations in the "Thinking" phase are generally small, with the importance score rarely exceeding the mean score. There are occasional spikes, but they are short-lived.
* **Visual Trend:** The overall trend shows a sharp increase in importance during the "Question" phase, followed by a sustained low level of importance during the "Thinking" phase.
### Key Observations
* The "Question" phase is characterized by high importance and significant fluctuations.
* The "Thinking" phase is characterized by low, relatively stable importance.
* The mean importance score is low, suggesting that the reasoning process generally involves low importance levels.
* The ratio of 0.281 is not clearly defined in the context of the chart, but it may represent a normalized measure of importance.
### Interpretation
The chart suggests a two-stage reasoning process. The initial "Question" phase involves a rapid assessment of importance, likely related to identifying relevant information or formulating a problem. The subsequent "Thinking" phase involves a more sustained, but less intense, level of reasoning. The low mean importance score in the "Thinking" phase could indicate that the reasoning process involves a lot of background processing or exploration of less critical information. The sharp contrast between the two phases suggests a shift in cognitive focus. The ratio value may be a metric related to the proportion of high-importance steps within the reasoning process. The data suggests that the initial question-framing stage is significantly more important than the subsequent thinking stage, at least as measured by this "Importance Score".
</details>
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" over "Reasoning Step". The chart appears to represent the importance of each reasoning step during a question-answering or problem-solving process, divided into "Question" and "Thinking" phases. The chart is a line-like representation with significant fluctuations in importance score.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to approximately 14000.
* **Y-axis:** "Importance Score", ranging from "Low" to "High".
* **Chart Title/Sections:** The chart is divided into two sections labeled "Question" (from 0 to approximately 200 Reasoning Steps) and "Thinking" (from approximately 200 to 14000 Reasoning Steps).
* **Horizontal Line:** A dotted red horizontal line is present, labeled "Mean Score: 0.170, Ratio: 0.237".
* **Color Scheme:** The chart uses a blue color gradient to represent the importance score.
### Detailed Analysis
The chart displays a fluctuating importance score over the reasoning steps.
* **Question Phase (0-200 Reasoning Steps):** The importance score starts at a low value and rapidly increases to a high value within the first 200 reasoning steps. The score fluctuates significantly within this range.
* **Thinking Phase (200-14000 Reasoning Steps):** The importance score generally remains low, with frequent and substantial fluctuations. The score oscillates between low and moderate values, with occasional spikes reaching higher levels.
* **Mean Score:** The average importance score across all reasoning steps is 0.170.
* **Ratio:** The ratio associated with the mean score is 0.237.
### Key Observations
* The "Question" phase exhibits a much higher initial importance score compared to the "Thinking" phase.
* The "Thinking" phase is characterized by a consistently lower, but highly variable, importance score.
* The fluctuations in the "Thinking" phase suggest that the relevance of individual reasoning steps varies considerably.
* The mean score of 0.170 indicates that, on average, the importance score is relatively low throughout the entire reasoning process.
### Interpretation
The chart suggests that the initial steps involved in understanding a question ("Question" phase) are significantly more important than the subsequent reasoning steps ("Thinking" phase). However, the high variability in the "Thinking" phase indicates that some reasoning steps are crucial, while others are less relevant. The low mean score suggests that the overall reasoning process is not consistently focused on highly important steps.
The ratio of 0.237 could represent the proportion of reasoning steps that exceed the mean importance score. This implies that approximately 23.7% of the reasoning steps are considered more important than the average.
This data could be used to optimize reasoning algorithms or to identify areas where the reasoning process can be improved. For example, focusing on the initial question understanding phase or identifying and prioritizing the most important reasoning steps could lead to more efficient and accurate problem-solving. The chart highlights the dynamic nature of the reasoning process and the importance of considering the context of each reasoning step.
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" over "Reasoning Step". The chart appears to represent the importance of each reasoning step during a question-answering or problem-solving process. The chart is divided into two phases: "Question" and "Thinking".
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to 5000, with major tick marks at 0, 1000, 2000, 3000, 4000, and 5000.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The axis is not numerically scaled.
* **Chart Title:** Not explicitly present, but the chart represents the relationship between Reasoning Step and Importance Score.
* **Phase Labels:** "Question" (from 0 to approximately 40 Reasoning Steps) and "Thinking" (from approximately 40 to 5000 Reasoning Steps).
* **Mean Score Line:** A horizontal dashed red line indicating a "Mean Score: 0.325; Ratio: 0.257".
* **Data Series:** A single series of vertical bars representing the Importance Score at each Reasoning Step.
### Detailed Analysis
The chart shows a distinct difference in Importance Score between the "Question" and "Thinking" phases.
* **Question Phase (0-40 Reasoning Steps):** The Importance Score is initially high, with many bars reaching near the "High" end of the scale. The score rapidly decreases as the Reasoning Step increases, becoming relatively low by step 40. The bars are densely packed and vary significantly in height.
* **Thinking Phase (40-5000 Reasoning Steps):** The Importance Score fluctuates around the Mean Score line (0.325). The bars are generally shorter and more uniformly distributed, with occasional spikes reaching higher values. The fluctuations appear somewhat random.
**Approximate Data Points (based on visual estimation):**
* **Reasoning Step 0:** Importance Score is approximately 0.8-0.9.
* **Reasoning Step 10:** Importance Score is approximately 0.6-0.7.
* **Reasoning Step 20:** Importance Score is approximately 0.4-0.5.
* **Reasoning Step 30:** Importance Score is approximately 0.2-0.3.
* **Reasoning Step 40:** Importance Score is approximately 0.1-0.2.
* **Reasoning Step 500:** Importance Score is approximately 0.3-0.4.
* **Reasoning Step 1000:** Importance Score is approximately 0.2-0.3.
* **Reasoning Step 2000:** Importance Score is approximately 0.3-0.4.
* **Reasoning Step 3000:** Importance Score is approximately 0.2-0.3.
* **Reasoning Step 4000:** Importance Score is approximately 0.3-0.4.
* **Reasoning Step 5000:** Importance Score is approximately 0.2-0.3.
### Key Observations
* The initial "Question" phase exhibits a rapid decline in Importance Score, suggesting that the initial steps of question processing are the most crucial.
* The "Thinking" phase shows a relatively stable, but fluctuating, Importance Score around the mean. This indicates that many reasoning steps contribute somewhat to the overall process, but none are overwhelmingly important.
* The Mean Score of 0.325 and Ratio of 0.257 provide a baseline for evaluating the importance of individual reasoning steps.
* There are occasional spikes in the "Thinking" phase, suggesting that certain reasoning steps are more critical than others.
### Interpretation
This chart likely represents the attention weights or importance scores assigned to each reasoning step in a large language model (LLM) or similar AI system during a question-answering task. The rapid decline in importance during the "Question" phase suggests that the initial steps of understanding the question and formulating a plan are the most critical. The more stable, fluctuating importance in the "Thinking" phase indicates that the subsequent reasoning steps contribute incrementally to the solution.
The Mean Score and Ratio likely represent the average importance score and some measure of the distribution of importance scores, respectively. The spikes in the "Thinking" phase could correspond to key insights or critical reasoning steps that significantly contribute to the final answer.
The chart suggests that the model initially focuses heavily on understanding the question, and then distributes its attention more broadly across a range of reasoning steps during the problem-solving process. The relatively low overall importance scores in the "Thinking" phase could indicate that the model relies on a distributed representation of knowledge and reasoning, rather than a few dominant steps.
</details>
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a line chart illustrating the relationship between "Reasoning Step" and "Importance Score". The chart appears to represent the evolution of importance during a problem-solving or reasoning process, divided into "Question" and "Thinking" phases. The y-axis represents the "Importance Score" ranging from "Low" to "High", while the x-axis represents the "Reasoning Step" from 0 to approximately 7500. A horizontal dashed line indicates the "Mean Score" and "Ratio".
### Components/Axes
* **X-axis:** "Reasoning Step" - Scale from 0 to approximately 7500.
* **Y-axis:** "Importance Score" - Qualitative scale labeled "Low" at the bottom and "High" at the top.
* **Title:** Not explicitly present, but the chart depicts "Importance Score vs. Reasoning Step".
* **Phases:** The chart is divided into two phases labeled "Question" (from 0 to approximately 100) and "Thinking" (from approximately 100 to 7500).
* **Mean Score Line:** A horizontal dashed red line indicating the average importance score.
* **Mean Score Text:** "Mean Score: 0.252; Ratio: 0.223"
* **Data Series:** A single blue line representing the importance score over reasoning steps.
### Detailed Analysis
The blue line representing the "Importance Score" exhibits a distinct pattern.
* **Question Phase (0-100 Reasoning Steps):** The line starts at a low importance score and rapidly increases to a high importance score within the first 100 reasoning steps. The line fluctuates significantly within this range.
* **Thinking Phase (100-7500 Reasoning Steps):** After the initial spike, the importance score generally decreases and stabilizes around the mean score. The line continues to fluctuate, but the amplitude of the fluctuations is smaller than in the "Question" phase. There are occasional spikes in importance, but they are less frequent and less pronounced.
* **Mean Score:** The horizontal dashed red line is positioned at approximately 0.252 on the "Importance Score" axis.
* **Ratio:** The text indicates a "Ratio" of 0.223, but its meaning is not explicitly defined within the chart.
Approximate data points (reading from the chart):
* Step 0: Importance Score ~ 0.05
* Step 50: Importance Score ~ 0.6
* Step 100: Importance Score ~ 0.4
* Step 1000: Importance Score ~ 0.2
* Step 2000: Importance Score ~ 0.3
* Step 3000: Importance Score ~ 0.2
* Step 4000: Importance Score ~ 0.25
* Step 5000: Importance Score ~ 0.2
* Step 6000: Importance Score ~ 0.28
* Step 7000: Importance Score ~ 0.22
* Step 7500: Importance Score ~ 0.23
### Key Observations
* The "Question" phase is characterized by a rapid increase and high variability in importance.
* The "Thinking" phase exhibits a more stable, but still fluctuating, importance score around the mean.
* The importance score generally decreases after the initial "Question" phase.
* The "Ratio" value (0.223) is provided, but its significance is unclear without further context.
### Interpretation
The chart suggests a process where initial problem formulation ("Question" phase) involves a rapid assessment of importance, leading to a peak in relevance. As the reasoning process progresses into the "Thinking" phase, the importance score stabilizes, indicating a more focused and sustained level of relevance. The fluctuations within the "Thinking" phase likely represent the exploration of different ideas or aspects of the problem. The mean importance score provides a baseline for evaluating the overall relevance of the reasoning process.
The sharp transition between the "Question" and "Thinking" phases suggests a shift in cognitive strategy. The initial phase is exploratory and dynamic, while the subsequent phase is more deliberate and analytical. The decreasing trend in importance after the initial spike could indicate that the most critical aspects of the problem have been identified, and the subsequent reasoning steps focus on refining and elaborating on those aspects.
The "Ratio" value might represent a measure of efficiency or effectiveness, but its precise meaning requires additional information. It could be the ratio of important steps to total steps, or some other metric related to the reasoning process.
</details>
Figure 10: Visualization of token importance scores for question and thinking tokens within DeepSeek-R1-Distill-Llama-8B reasoning traces across six samples from AIME24, AIME25, AMC23, GPQA-D, GAOKAO2023EN, and MATH500.
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Chart: Importance Score over Reasoning Steps
### Overview
The image presents a chart visualizing the "Importance Score" over "Reasoning Steps". The chart appears to represent the fluctuation of importance during a question-answering or problem-solving process, divided into "Question" and "Thinking" phases. The chart is a time series, with the x-axis representing the reasoning step and the y-axis representing the importance score. A horizontal dashed line indicates the mean importance score.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from approximately 0 to 12000.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The scale is not explicitly numerical, but visually represents a gradient.
* **Title/Labels:** The chart is divided into two sections labeled "Question" (from 0 to approximately 50 Reasoning Steps) and "Thinking" (from approximately 50 to 12000 Reasoning Steps).
* **Horizontal Line:** A dashed red horizontal line is present, labeled "Mean Score: 0.416; Ratio: 0.201".
* **Color Scheme:** The chart uses a blue gradient to represent the importance score, with darker blues indicating higher scores and lighter blues indicating lower scores.
### Detailed Analysis
The chart displays a fluctuating importance score over the reasoning steps.
* **Question Phase (0-50 Reasoning Steps):** The importance score exhibits high variability, with frequent peaks and troughs. The score fluctuates rapidly between "Low" and "High".
* **Thinking Phase (50-12000 Reasoning Steps):** The importance score remains generally lower than in the "Question" phase, with more frequent values closer to the "Low" end of the scale. The fluctuations are still present, but the amplitude appears reduced compared to the "Question" phase. The score oscillates around the mean score.
* **Mean Score:** The mean importance score is 0.416, with a ratio of 0.201. The ratio is not defined in the image.
* **Visual Trend:** The "Question" phase shows a burst of activity, while the "Thinking" phase shows a more sustained, but lower-amplitude, oscillation.
### Key Observations
* The importance score is significantly higher and more volatile during the "Question" phase compared to the "Thinking" phase.
* The mean importance score provides a baseline for comparison, and the majority of the "Thinking" phase data points appear to fall around this value.
* The ratio of 0.201 is provided alongside the mean score, but its meaning is not explained in the image.
### Interpretation
The chart suggests that the initial stage of problem-solving ("Question") involves a higher degree of fluctuating importance, potentially reflecting the exploration of different potential approaches or the identification of key information. As the reasoning process progresses into the "Thinking" phase, the importance score stabilizes and generally decreases, indicating a more focused and consistent line of thought. The high variability in the "Question" phase could represent the initial uncertainty and exploration, while the lower variability in the "Thinking" phase suggests a more refined and directed reasoning process. The mean score and ratio provide quantitative measures of the overall importance and potentially the efficiency of the reasoning process, but the meaning of the ratio requires further context. The chart demonstrates a clear distinction between the initial exploratory phase and the subsequent focused reasoning phase.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" across "Reasoning Steps". The chart appears to represent the importance of each step in a reasoning process, potentially in the context of a question-answering or problem-solving system. The chart is divided into two sections labeled "Question" and "Thinking".
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from approximately 0 to 12500.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The scale is not explicitly numerical, but visually represents a continuous range.
* **Sections:** The chart is divided into two sections:
* "Question" - spanning from Reasoning Step 0 to approximately 100.
* "Thinking" - spanning from Reasoning Step 100 to 12500.
* **Horizontal Line:** A dashed red horizontal line is present, labeled "Mean Score: 0.346; Ratio: 0.205".
* **Color Scale:** A vertical color scale on the left side indicates that darker blue represents a "High" Importance Score, while lighter blue represents a "Low" Importance Score.
### Detailed Analysis
The chart displays a fluctuating Importance Score across Reasoning Steps.
* **Question Section (0-100):** The Importance Score exhibits high variability within the "Question" section. The score fluctuates rapidly between low and high values. There are several peaks of high importance, followed by rapid declines.
* **Thinking Section (100-12500):** The "Thinking" section shows a generally lower and more stable Importance Score compared to the "Question" section. The score fluctuates around the mean score, with many steps having a low importance score. The fluctuations are less dramatic than in the "Question" section.
* **Mean Score:** The mean Importance Score is 0.346, with a ratio of 0.205. The ratio is not defined in the image.
* **Visual Trend:** The overall trend in the "Thinking" section is relatively flat, indicating that most reasoning steps have a similar level of importance.
### Key Observations
* The "Question" section demonstrates significantly higher variability in Importance Score than the "Thinking" section.
* The mean Importance Score is relatively low (0.346), suggesting that most reasoning steps have a limited impact on the overall importance.
* The "Thinking" section is much longer than the "Question" section, indicating that the reasoning process involves a significantly larger number of steps dedicated to "Thinking" compared to "Question".
### Interpretation
The data suggests that the initial "Question" phase involves a more dynamic and variable assessment of importance, potentially reflecting the process of understanding and framing the problem. The subsequent "Thinking" phase, however, is characterized by a more consistent, but generally lower, level of importance for each reasoning step. This could indicate that the initial question is crucial for setting the stage, but the subsequent reasoning steps are more incremental and less individually impactful.
The low mean Importance Score in the "Thinking" phase might suggest that the reasoning process is distributed across many steps, with no single step being overwhelmingly important. Alternatively, it could indicate that the Importance Score metric is not sensitive enough to capture the subtle contributions of individual reasoning steps. The ratio of 0.205 is not explained, but could be a measure of the variance or distribution of importance scores.
The chart provides insights into the dynamics of a reasoning process, highlighting the distinct characteristics of the initial question phase and the subsequent thinking phase. It suggests that understanding the question is critical, but the thinking process is more about consistent, incremental steps.
</details>
<details>
<summary>x31.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the relationship between "Importance Score" and "Reasoning Step". The chart appears to represent the evolution of importance assigned to different reasoning steps during a problem-solving process, divided into "Question" and "Thinking" phases. The chart uses a bar-like representation for the "Question" phase and a line-like representation for the "Thinking" phase.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to approximately 8500.
* **Y-axis:** "Importance Score", ranging from "Low" to "High".
* **Chart Sections:** The chart is divided into two sections labeled "Question" (from 0 to approximately 40 on the x-axis) and "Thinking" (from approximately 40 to 8500 on the x-axis).
* **Horizontal Line:** A dashed red horizontal line is present, labeled "Mean Score: 0.475, Ratio: 0.232".
### Detailed Analysis
The "Question" phase (x=0 to 40) is represented by a blue bar chart. The importance score starts at a low value around x=0 and rapidly increases to a peak around x=15-20, then gradually decreases to a lower value by x=40. The peak importance score appears to be approximately 0.8-0.9, while the score at x=40 is around 0.3-0.4.
The "Thinking" phase (x=40 to 8500) is represented by a fluctuating blue line. The line oscillates around the dashed red horizontal line, indicating that the importance score during the thinking phase remains relatively stable, with frequent fluctuations. The line's amplitude (variation around the mean) appears to be approximately +/- 0.2. The line generally stays between an importance score of 0.2 and 0.6.
The mean importance score is 0.475, and the ratio is 0.232.
### Key Observations
* The "Question" phase exhibits a clear peak in importance score, suggesting a focused period of initial analysis.
* The "Thinking" phase shows a more distributed and fluctuating importance score, indicating a more exploratory and iterative process.
* The mean importance score provides a baseline for evaluating the overall importance assigned to reasoning steps.
* The ratio value (0.232) is not explicitly defined in the chart, but it likely represents a calculated metric related to the importance scores.
### Interpretation
The chart suggests a two-stage reasoning process. The "Question" phase involves a rapid assessment of initial information, leading to a peak in importance score. The "Thinking" phase represents a more prolonged and dynamic exploration of the problem, with fluctuating importance scores. The relatively stable mean importance score during the "Thinking" phase indicates that, on average, reasoning steps contribute consistently to the overall solution. The ratio value might represent the proportion of important reasoning steps within the "Thinking" phase, or a measure of efficiency. The difference in visualization between the two phases (bar chart vs. line graph) highlights the distinct nature of each stage. The initial burst of importance in the "Question" phase could represent the identification of key constraints or objectives, while the fluctuating importance in the "Thinking" phase could reflect the iterative refinement of potential solutions.
</details>
<details>
<summary>x32.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" over "Reasoning Step". The chart appears to represent the relative importance of each reasoning step during a problem-solving process, divided into "Question" and "Thinking" phases. The chart consists of a dense series of vertical lines, representing the importance score at each step. A horizontal dashed line indicates the mean importance score.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to approximately 15000.
* **Y-axis:** "Importance Score", ranging from "Low" to "High". The scale is not explicitly numerical, but visually represents a gradient.
* **Chart Title/Sections:** The chart is divided into two sections labeled "Question" (from 0 to approximately 100 Reasoning Steps) and "Thinking" (from approximately 100 to 15000 Reasoning Steps).
* **Mean Score Indicator:** A horizontal dashed red line is present, labeled "Mean Score: 0.347, Ratio: 0.226".
* **Color Gradient:** A blue color gradient is used to represent the "Importance Score", with darker blues indicating higher scores and lighter blues indicating lower scores.
### Detailed Analysis
The chart displays a fluctuating importance score across reasoning steps.
* **Question Phase (0-100 Reasoning Steps):** The importance score fluctuates rapidly and significantly. The score starts at a low value, quickly rises to high values, and then decreases again. The vertical lines are densely packed, indicating a high frequency of importance score changes.
* **Thinking Phase (100-15000 Reasoning Steps):** The importance score continues to fluctuate, but generally remains lower and more stable than in the "Question" phase. The fluctuations are less dramatic, and the score tends to hover around the mean score. There are periods of increased activity (higher importance scores) interspersed with periods of lower activity.
* **Mean Score:** The mean importance score is 0.347, with a ratio of 0.226. The ratio is not defined in the image.
* **Visual Trend:** The "Question" phase shows a rapid and volatile change in importance score. The "Thinking" phase shows a more gradual and sustained fluctuation around the mean.
### Key Observations
* The "Question" phase exhibits significantly higher variability in importance score compared to the "Thinking" phase.
* The mean importance score appears to be a reasonable representation of the typical importance score during the "Thinking" phase.
* There are no obvious outliers or anomalies in the data, but the high frequency of fluctuations suggests a complex and dynamic reasoning process.
### Interpretation
The chart suggests that the initial "Question" phase of problem-solving involves a rapid exploration of different ideas and considerations, leading to significant fluctuations in the perceived importance of each reasoning step. The "Thinking" phase, on the other hand, represents a more focused and deliberate process, with a more stable and consistent importance score. The mean score provides a baseline for evaluating the relative importance of individual reasoning steps. The ratio of 0.226 is not defined, but could represent the proportion of reasoning steps above the mean score.
The data suggests that the initial phase of problem-solving is characterized by uncertainty and exploration, while the subsequent phase is characterized by refinement and consolidation. The fluctuations in importance score likely reflect the iterative nature of the reasoning process, as new information is considered and existing assumptions are challenged.
</details>
<details>
<summary>x33.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" over "Reasoning Step". The chart appears to represent a process divided into two phases: "Question" and "Thinking". The Importance Score fluctuates significantly during the "Thinking" phase, while it is relatively high and stable during the "Question" phase. A horizontal dashed line indicates the mean Importance Score and Ratio.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from 0 to approximately 7500.
* **Y-axis:** "Importance Score", labeled with "High" at the top and "Low" at the bottom. The scale is not numerically defined, but appears to be a relative scale.
* **Chart Title:** Not explicitly present, but the chart represents the relationship between Reasoning Step and Importance Score.
* **Phases:** The chart is divided into two sections labeled "Question" (from 0 to approximately 50 Reasoning Steps) and "Thinking" (from approximately 50 to 7500 Reasoning Steps).
* **Mean Score Line:** A horizontal dashed red line spanning the entire chart, labeled "Mean Score: 0.604; Ratio: 0.198".
### Detailed Analysis
The "Question" phase (0-50 Reasoning Steps) shows a consistently high Importance Score, represented by a solid blue bar. The score appears to be near the "High" end of the Y-axis.
The "Thinking" phase (50-7500 Reasoning Steps) exhibits a highly variable Importance Score. The score fluctuates rapidly, crossing the mean score line (0.604) numerous times.
* **Initial Drop:** Immediately after the "Question" phase, the Importance Score drops sharply.
* **Fluctuations:** Throughout the "Thinking" phase, the score oscillates between low and moderate levels. There are periods where the score rises above the mean, but it generally remains below the level observed in the "Question" phase.
* **No Clear Trend:** There is no discernible upward or downward trend in the Importance Score during the "Thinking" phase. It appears largely random.
### Key Observations
* The Importance Score is significantly higher during the "Question" phase compared to the "Thinking" phase.
* The "Thinking" phase is characterized by high variability in the Importance Score.
* The mean Importance Score is 0.604, and the Ratio is 0.198.
* The transition from "Question" to "Thinking" is marked by a sudden decrease in Importance Score.
### Interpretation
This chart likely represents the cognitive process of problem-solving or decision-making. The "Question" phase represents the initial understanding of the problem, where the importance of information is high and focused. The "Thinking" phase represents the exploration of potential solutions, where the importance of individual reasoning steps fluctuates as different ideas are considered and evaluated.
The high variability in the "Thinking" phase suggests that the process is exploratory and involves a lot of trial and error. The lower average Importance Score during this phase indicates that many reasoning steps are not directly relevant to the final solution. The ratio of 0.198 could represent the proportion of important reasoning steps within the "Thinking" phase.
The sharp drop in Importance Score at the transition from "Question" to "Thinking" suggests a shift in cognitive focus from understanding the problem to generating potential solutions. This could be interpreted as a necessary step in the problem-solving process, where the initial clarity of the problem is replaced by the uncertainty of exploration.
</details>
<details>
<summary>x34.png Details</summary>

### Visual Description
\n
## Chart: Importance Score vs. Reasoning Step
### Overview
The image presents a chart visualizing the "Importance Score" across "Reasoning Steps". The chart appears to represent a comparison between the initial "Question" phase and the subsequent "Thinking" phase of a reasoning process. The chart is a bar-like representation, with the x-axis representing the reasoning step and the y-axis representing the importance score.
### Components/Axes
* **X-axis:** "Reasoning Step", ranging from approximately 0 to 4000.
* **Y-axis:** "Importance Score", ranging from "Low" to "High".
* **Chart Sections:** Two distinct sections are labeled: "Question" (from 0 to approximately 40 Reasoning Steps) and "Thinking" (from approximately 40 to 4000 Reasoning Steps).
* **Horizontal Line:** A dashed horizontal line is present, labeled "Mean Score: 1.086; Ratio: 0.125".
* **Color Scheme:** The "Question" section is filled with a solid blue color. The "Thinking" section consists of numerous thin, vertical blue bars fluctuating around the mean score line.
### Detailed Analysis
The "Question" section is a single, solid blue bar. Its height indicates a relatively high importance score. The bar extends from approximately Reasoning Step 0 to 40. The importance score appears to be approximately 2.5, but this is an estimate based on the visual scale.
The "Thinking" section, spanning from Reasoning Step 40 to 4000, is composed of many narrow vertical bars. These bars fluctuate significantly above and below the horizontal "Mean Score" line. The bars appear to be randomly distributed around the mean.
* **Mean Score:** 1.086
* **Ratio:** 0.125
The distribution of the bars in the "Thinking" section suggests a high degree of variability in the importance score during the reasoning process.
### Key Observations
* The "Question" phase exhibits a consistently high importance score, significantly higher than the mean score observed in the "Thinking" phase.
* The "Thinking" phase demonstrates a highly variable importance score, with fluctuations above and below the mean.
* The ratio of 0.125 suggests a relationship between the mean score and some other value, but the context of this ratio is not provided in the image.
### Interpretation
The chart suggests that the initial "Question" phase of a reasoning process is characterized by a high degree of importance, while the subsequent "Thinking" phase involves a more fluctuating and variable level of importance. The high variability in the "Thinking" phase could indicate that different reasoning steps contribute varying degrees of importance to the overall process. The mean score and ratio provide quantitative measures of the overall importance and its relationship to another factor, but the meaning of the ratio is unclear without additional context. The chart could be illustrating the cognitive effort or attention allocated to different stages of problem-solving. The initial question is highly focused, while the thinking phase involves exploration and potentially less focused attention.
</details>
Figure 11: Visualization of token importance scores for question and thinking tokens within DeepSeek-R1-Distill-Qwen-7B reasoning traces across six samples from AIME24, AIME25, AMC23, GPQA-D, GAOKAO2023EN, and MATH500.
### B.3 Additional Results
<details>
<summary>x35.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | AIME24 Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different sampling methods (Full, Random, Bottom, Top) for the R1-Llama model on the AIME24 dataset, as a function of the ratio of data used. The x-axis represents the ratio (in percentage), and the y-axis represents the accuracy (in percentage).
### Components/Axes
* **Title:** R1-Llama | AIME24
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Accuracy (%)
* **Legend:** Located in the top-right corner.
* Full (represented by a black dashed line with 'x' markers)
* Random (represented by a green solid line with triangle markers)
* Bottom (represented by a blue solid line with square markers)
* Top (represented by a red solid line with circle markers)
* **X-axis Markers:** 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Markers:** 30, 35, 40, 45, 50, 55, 60, 65
### Detailed Analysis
Here's a breakdown of each data series and their trends:
* **Full (Black Dashed Line):** This line is nearly flat, hovering around 65% accuracy across all ratios. It starts at approximately 65% at a ratio of 2%, remains around 65% until a ratio of 40%, and then slightly decreases to approximately 64% at a ratio of 50%.
* **Random (Green Line):** This line shows an upward trend. It starts at approximately 32% accuracy at a ratio of 2%, dips to around 30% at a ratio of 6%, then steadily increases to approximately 48% accuracy at a ratio of 50%.
* **Bottom (Blue Line):** This line exhibits a more fluctuating pattern. It begins at approximately 30% accuracy at a ratio of 2%, rises to around 34% at a ratio of 8%, dips to approximately 28% at a ratio of 10%, then increases to around 37% at a ratio of 40%, and finally settles at approximately 35% at a ratio of 50%.
* **Top (Red Line):** This line demonstrates a clear upward trend. It starts at approximately 55% accuracy at a ratio of 2%, increases to around 59% at a ratio of 20%, continues to rise to approximately 62% at a ratio of 30%, and then slightly decreases to around 61% at a ratio of 50%.
Here's a table reconstructing the approximate data points:
| Ratio (%) | Full (%) | Random (%) | Bottom (%) | Top (%) |
|---|---|---|---|---|
| 2 | 65 | 32 | 30 | 55 |
| 4 | 65 | 33 | 32 | 56 |
| 6 | 65 | 30 | 33 | 58 |
| 8 | 65 | 31 | 34 | 59 |
| 10 | 65 | 30 | 28 | 60 |
| 20 | 65 | 38 | 32 | 61 |
| 30 | 65 | 42 | 35 | 62 |
| 40 | 65 | 45 | 37 | 62 |
| 50 | 64 | 48 | 35 | 61 |
### Key Observations
* The "Full" sampling method maintains a consistently high accuracy, regardless of the ratio.
* The "Top" sampling method shows the most significant improvement in accuracy as the ratio increases.
* The "Bottom" sampling method exhibits the most variability in accuracy.
* The "Random" sampling method shows a steady increase in accuracy with increasing ratio, but remains lower than "Top" and "Full".
### Interpretation
The data suggests that using the entire dataset ("Full") provides the most stable and consistently high accuracy. However, if only a limited portion of the data can be used, prioritizing the "Top" samples yields the best results, as accuracy increases substantially with a higher ratio of "Top" samples. The "Bottom" sampling method appears to be the least reliable, with fluctuating accuracy. The "Random" sampling method offers a moderate improvement in accuracy as the ratio increases, but it doesn't reach the levels achieved by "Top" or "Full".
The consistent high accuracy of the "Full" method indicates that the AIME24 dataset doesn't have significant redundancy or noise that would hinder performance. The effectiveness of the "Top" sampling method suggests that certain samples within the dataset are more informative or representative than others, and focusing on these samples can lead to improved accuracy even with a limited dataset size. The poor performance of the "Bottom" sampling method could indicate that these samples are less relevant or contain more noise.
</details>
<details>
<summary>x36.png Details</summary>

### Visual Description
\n
## Line Chart: R1-Llama | AIME25 Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different sampling methods (Full, Random, Bottom, Top) for the R1-Llama model on the AIME25 dataset, as a function of the ratio of data used. The x-axis represents the ratio (in percentage), and the y-axis represents the accuracy (in percentage).
### Components/Axes
* **Title:** R1-Llama | AIME25
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Accuracy (%)
* **Legend:**
* Full (Grey 'x' markers)
* Random (Green triangle markers)
* Bottom (Blue square markers)
* Top (Red circle markers)
* **X-axis Markers:** 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Markers:** 30.0, 32.5, 35.0, 37.5, 40.0, 42.5, 45.0, 47.5, 50.0
### Detailed Analysis
The chart contains four data series, each representing a different sampling method.
* **Full (Grey):** The line is relatively flat, starting at approximately 48.5% accuracy at a ratio of 2% and remaining around 49.5% accuracy up to a ratio of 50%.
* (2%, 48.5%)
* (4%, 49.0%)
* (6%, 49.2%)
* (8%, 49.2%)
* (10%, 49.2%)
* (20%, 49.0%)
* (30%, 48.5%)
* (40%, 49.0%)
* (50%, 49.2%)
* **Random (Green):** The line shows a more dynamic trend. It starts at approximately 32.5% accuracy at 2%, dips to around 30.5% at 8%, rises to approximately 40.0% at 40%, and then slightly decreases to around 38.0% at 50%.
* (2%, 32.5%)
* (4%, 31.0%)
* (6%, 32.0%)
* (8%, 30.5%)
* (10%, 32.0%)
* (20%, 30.0%)
* (30%, 36.0%)
* (40%, 40.0%)
* (50%, 38.0%)
* **Bottom (Blue):** The line generally increases with the ratio. It starts at approximately 33.0% accuracy at 2%, rises to around 37.0% at 30%, and then slightly decreases to around 36.0% at 50%.
* (2%, 33.0%)
* (4%, 34.0%)
* (6%, 36.0%)
* (8%, 37.0%)
* (10%, 37.0%)
* (20%, 37.0%)
* (30%, 37.0%)
* (40%, 36.0%)
* (50%, 36.0%)
* **Top (Red):** The line shows a strong upward trend initially, then plateaus. It starts at approximately 43.5% accuracy at 2%, rises to around 48.5% at 6%, and remains relatively stable around 49.0% for the rest of the ratios.
* (2%, 43.5%)
* (4%, 46.0%)
* (6%, 48.5%)
* (8%, 49.0%)
* (10%, 49.0%)
* (20%, 48.5%)
* (30%, 48.0%)
* (40%, 49.0%)
* (50%, 49.0%)
### Key Observations
* The "Top" sampling method consistently achieves the highest accuracy across all ratios.
* The "Full" sampling method provides a stable, high level of accuracy.
* The "Random" sampling method exhibits the most variability in accuracy, with a significant increase between ratios 20% and 40%.
* The "Bottom" sampling method shows a moderate and relatively consistent increase in accuracy with increasing ratio.
### Interpretation
The data suggests that the "Top" sampling method is the most effective for the R1-Llama model on the AIME25 dataset, consistently delivering the highest accuracy. The "Full" method provides a reliable baseline performance. The "Random" method's performance is more sensitive to the ratio of data used, indicating that a larger sample size is needed to achieve optimal results. The "Bottom" method offers a moderate level of accuracy that improves with more data.
The plateauing of the "Top" and "Full" lines suggests that there is a diminishing return in accuracy beyond a certain ratio of data. This could indicate that the model reaches its performance limit with the available data or that the additional data does not contribute significantly to its learning process. The initial dip in the "Random" line could be due to the inherent variability of random sampling, where a small sample may not be representative of the overall dataset. The subsequent rise suggests that as the sample size increases, the random sampling becomes more representative and the accuracy improves.
</details>
<details>
<summary>x37.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | AMC23 Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different sampling methods (Full, Random, Bottom, Top) as a function of the ratio of data used. The chart appears to be evaluating the performance of a model (R1-Llama) on a dataset (AMC23). The x-axis represents the ratio of data used, expressed as a percentage, and the y-axis represents the accuracy, also expressed as a percentage.
### Components/Axes
* **Title:** R1-Llama | AMC23
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Accuracy (%)
* **Legend:**
* Full (Grey, dashed line)
* Random (Green, solid line)
* Bottom (Blue, solid line)
* Top (Red, solid line)
* **X-axis Markers:** 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Scale:** Approximately 60% to 96%
### Detailed Analysis
The chart contains four data series, each representing a different sampling method.
* **Top (Red Line):** This line shows a generally increasing trend.
* At Ratio = 2%, Accuracy ≈ 88%
* At Ratio = 4%, Accuracy ≈ 91%
* At Ratio = 6%, Accuracy ≈ 92%
* At Ratio = 8%, Accuracy ≈ 92.5%
* At Ratio = 10%, Accuracy ≈ 93%
* At Ratio = 20%, Accuracy ≈ 94%
* At Ratio = 30%, Accuracy ≈ 94.5%
* At Ratio = 40%, Accuracy ≈ 95%
* At Ratio = 50%, Accuracy ≈ 95%
* **Bottom (Blue Line):** This line shows a generally increasing trend, but starts lower and remains below the "Top" line.
* At Ratio = 2%, Accuracy ≈ 65%
* At Ratio = 4%, Accuracy ≈ 64%
* At Ratio = 6%, Accuracy ≈ 62%
* At Ratio = 8%, Accuracy ≈ 63%
* At Ratio = 10%, Accuracy ≈ 66%
* At Ratio = 20%, Accuracy ≈ 70%
* At Ratio = 30%, Accuracy ≈ 73%
* At Ratio = 40%, Accuracy ≈ 80%
* At Ratio = 50%, Accuracy ≈ 85%
* **Random (Green Line):** This line shows a generally increasing trend, starting low and ending higher than "Bottom" but lower than "Top".
* At Ratio = 2%, Accuracy ≈ 63%
* At Ratio = 4%, Accuracy ≈ 61%
* At Ratio = 6%, Accuracy ≈ 59%
* At Ratio = 8%, Accuracy ≈ 60%
* At Ratio = 10%, Accuracy ≈ 64%
* At Ratio = 20%, Accuracy ≈ 71%
* At Ratio = 30%, Accuracy ≈ 76%
* At Ratio = 40%, Accuracy ≈ 84%
* At Ratio = 50%, Accuracy ≈ 86%
* **Full (Grey, Dashed Line):** This line is approximately flat and consistently high.
* At Ratio = 2%, Accuracy ≈ 95%
* At Ratio = 4%, Accuracy ≈ 95%
* At Ratio = 6%, Accuracy ≈ 95%
* At Ratio = 8%, Accuracy ≈ 95%
* At Ratio = 10%, Accuracy ≈ 95%
* At Ratio = 20%, Accuracy ≈ 95%
* At Ratio = 30%, Accuracy ≈ 95%
* At Ratio = 40%, Accuracy ≈ 95%
* At Ratio = 50%, Accuracy ≈ 95%
### Key Observations
* The "Top" sampling method consistently achieves the highest accuracy across all ratios.
* The "Full" sampling method maintains a very high and stable accuracy, comparable to "Top".
* The "Bottom" sampling method consistently has the lowest accuracy.
* Accuracy generally increases with increasing ratio for all sampling methods except "Full".
* The "Random" sampling method performs better than "Bottom" but worse than "Top".
### Interpretation
The data suggests that selecting the "Top" data points or using the entire dataset ("Full") yields the best performance for the R1-Llama model on the AMC23 dataset. The "Bottom" sampling method is the least effective. The increasing accuracy with ratio for "Random", "Bottom", and "Top" indicates that more data generally improves model performance, up to a point where the "Top" and "Full" methods plateau. The flat line for "Full" suggests that adding more data beyond a certain point does not significantly improve accuracy. This could be due to diminishing returns or the dataset reaching its information capacity. The difference between "Top" and "Full" is minimal, suggesting that the most informative data is concentrated in the "Top" portion of the dataset. This could be useful for reducing computational costs by only using the most relevant data. The "Random" method provides a middle ground, offering some improvement over "Bottom" without the computational cost of "Full". The consistent high performance of "Full" suggests that the model benefits from a comprehensive view of the data, but the "Top" method offers a viable alternative with comparable results.
</details>
<details>
<summary>x38.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | GPQA-D Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different sampling methods (Full, Random, Bottom, Top) for the R1-Llama model on the GPQA-D dataset, as a function of the ratio of samples used. The x-axis represents the ratio (in percentage), and the y-axis represents the accuracy (in percentage).
### Components/Axes
* **Title:** R1-Llama | GPQA-D
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Accuracy (%)
* **Legend:**
* Full (Grey 'x' markers)
* Random (Green triangle markers)
* Bottom (Blue triangle markers)
* Top (Red circle markers)
* **X-axis Markers:** 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Scale:** Approximately 36 to 46 (percentage)
### Detailed Analysis
The chart shows four lines representing the accuracy of each sampling method as the ratio increases.
* **Full (Grey):** The line is relatively flat, hovering around approximately 44.5% accuracy across all ratios.
* (2, 44.5)
* (4, 44.5)
* (6, 44.5)
* (8, 44.5)
* (10, 44.5)
* (20, 44.5)
* (30, 44.5)
* (40, 44.5)
* (50, 44.5)
* **Random (Green):** The line starts at approximately 36.5% at a ratio of 2, decreases to a minimum of around 35.5% at a ratio of 8, and then increases steadily to approximately 42.5% at a ratio of 40.
* (2, 36.5)
* (4, 37.5)
* (6, 38.0)
* (8, 35.5)
* (10, 37.0)
* (20, 39.0)
* (30, 40.5)
* (40, 42.5)
* (50, ~38.0)
* **Bottom (Blue):** The line fluctuates between approximately 37% and 38.5% with no clear trend.
* (2, 37.5)
* (4, 37.0)
* (6, 38.0)
* (8, 37.0)
* (10, 36.5)
* (20, 37.0)
* (30, 37.0)
* (40, 38.0)
* (50, 38.0)
* **Top (Red):** The line starts at approximately 42.5% at a ratio of 2, increases to a peak of approximately 45% at a ratio of 20, and then decreases slightly to approximately 44.5% at a ratio of 50.
* (2, 42.5)
* (4, 43.5)
* (6, 44.0)
* (8, 44.5)
* (10, 44.5)
* (20, 45.0)
* (30, 44.8)
* (40, 44.5)
* (50, 44.5)
### Key Observations
* The "Full" sampling method maintains a consistently high accuracy across all ratios.
* The "Top" sampling method shows the highest accuracy at a ratio of 20%.
* The "Random" sampling method demonstrates a significant increase in accuracy as the ratio increases, starting from a lower baseline.
* The "Bottom" sampling method exhibits the least variation in accuracy.
### Interpretation
The data suggests that using all available samples ("Full") provides a stable and reasonably high level of accuracy. However, the "Top" sampling method can achieve even higher accuracy, particularly when using around 20% of the samples. The "Random" method shows that increasing the sample ratio can improve accuracy, but it starts from a lower point than the other methods. The "Bottom" method appears to be the least effective, with consistently lower and more stable accuracy.
The relationship between the sampling methods and accuracy likely reflects the distribution of information within the dataset. The "Top" method may be identifying the most relevant samples, leading to higher accuracy at certain ratios. The "Random" method's improvement with ratio suggests that more samples generally lead to better results, but the randomness introduces variability. The "Full" method's stability indicates that all samples contribute to a consistent level of performance. The "Bottom" method's poor performance suggests that the least relevant samples are detrimental to accuracy. The peak at 20% for the "Top" method could indicate a sweet spot where the most informative samples are captured without including too much noise.
</details>
<details>
<summary>x39.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | GK23EN Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different sampling methods (Full, Bottom, Random, Top) for the R1-Llama model (GK23EN) as a function of the ratio of data used. The x-axis represents the ratio, expressed as a percentage, and the y-axis represents the accuracy, also expressed as a percentage.
### Components/Axes
* **Title:** R1-Llama | GK23EN
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Accuracy (%)
* **Legend:**
* Full (represented by a dashed grey line with 'x' markers)
* Bottom (represented by a solid blue line with square markers)
* Random (represented by a solid green line with triangle markers)
* Top (represented by a solid red line with circle markers)
* **X-axis Markers:** 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Scale:** Ranges from approximately 62% to 74%.
### Detailed Analysis
Here's a breakdown of each data series and their trends:
* **Top (Red Line):** This line generally slopes upward from 2% to 30% ratio, then plateaus and slightly declines.
* Ratio 2%: Accuracy ≈ 72.4%
* Ratio 4%: Accuracy ≈ 73.0%
* Ratio 6%: Accuracy ≈ 73.4%
* Ratio 8%: Accuracy ≈ 73.6%
* Ratio 10%: Accuracy ≈ 73.8%
* Ratio 20%: Accuracy ≈ 74.0%
* Ratio 30%: Accuracy ≈ 74.2%
* Ratio 40%: Accuracy ≈ 73.4%
* Ratio 50%: Accuracy ≈ 73.8%
* **Bottom (Blue Line):** This line shows a consistent upward trend throughout the entire range.
* Ratio 2%: Accuracy ≈ 62.8%
* Ratio 4%: Accuracy ≈ 63.6%
* Ratio 6%: Accuracy ≈ 64.0%
* Ratio 8%: Accuracy ≈ 64.4%
* Ratio 10%: Accuracy ≈ 64.6%
* Ratio 20%: Accuracy ≈ 65.2%
* Ratio 30%: Accuracy ≈ 65.8%
* Ratio 40%: Accuracy ≈ 67.2%
* Ratio 50%: Accuracy ≈ 67.8%
* **Random (Green Line):** This line exhibits a slow, steady increase up to 30%, then a steep increase from 30% to 50%.
* Ratio 2%: Accuracy ≈ 62.4%
* Ratio 4%: Accuracy ≈ 63.2%
* Ratio 6%: Accuracy ≈ 63.8%
* Ratio 8%: Accuracy ≈ 64.2%
* Ratio 10%: Accuracy ≈ 64.5%
* Ratio 20%: Accuracy ≈ 65.0%
* Ratio 30%: Accuracy ≈ 65.6%
* Ratio 40%: Accuracy ≈ 68.2%
* Ratio 50%: Accuracy ≈ 70.2%
* **Full (Grey Dashed Line):** This line remains relatively flat throughout the entire range, fluctuating around 72.5%.
* Ratio 2%: Accuracy ≈ 72.6%
* Ratio 4%: Accuracy ≈ 72.8%
* Ratio 6%: Accuracy ≈ 72.6%
* Ratio 8%: Accuracy ≈ 72.4%
* Ratio 10%: Accuracy ≈ 72.4%
* Ratio 20%: Accuracy ≈ 72.6%
* Ratio 30%: Accuracy ≈ 72.4%
* Ratio 40%: Accuracy ≈ 72.2%
* Ratio 50%: Accuracy ≈ 72.4%
### Key Observations
* The "Top" sampling method consistently achieves the highest accuracy, particularly at higher ratios.
* The "Bottom" sampling method shows a steady improvement in accuracy as the ratio increases, but remains significantly lower than the "Top" method.
* The "Random" sampling method demonstrates a delayed but substantial increase in accuracy at higher ratios (40% and 50%).
* The "Full" sampling method maintains a relatively constant accuracy, suggesting it is not significantly affected by the ratio of data used.
### Interpretation
The chart suggests that the "Top" sampling method is the most effective for the R1-Llama model (GK23EN) in terms of achieving high accuracy. The "Bottom" method provides a consistent, albeit lower, level of accuracy. The "Random" method's late surge in accuracy indicates that a larger sample size is crucial for its performance. The "Full" method's stability suggests it may be less sensitive to data quantity, potentially due to its comprehensive nature.
The differences in performance between the sampling methods likely stem from the distribution of information within the dataset. The "Top" method may be prioritizing the most informative data points, leading to higher accuracy. The "Random" method's improvement with increased ratio suggests it benefits from a larger, more representative sample. The "Full" method's consistent performance could be attributed to its inclusion of all available data, mitigating the impact of sample size.
The plateauing of the "Top" method's accuracy at higher ratios suggests a point of diminishing returns, where adding more data does not significantly improve performance. This could indicate that the model has reached its maximum potential with the available data and sampling strategy.
</details>
<details>
<summary>x40.png Details</summary>

### Visual Description
## Line Chart: R1-Llama | MATH500 Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of the R1-Llama model on the MATH500 dataset, plotted against different ratios. Four data series are presented: "Full", "Random", "Bottom", and "Top". The chart aims to demonstrate how the model's performance changes with varying proportions of data used, categorized by the selection method (Full, Random, Bottom, Top).
### Components/Axes
* **Title:** R1-Llama | MATH500
* **X-axis:** Ratio (%) - Scale ranges from approximately 2% to 50%. Markers are present at 2, 4, 6, 8, 10, 20, 30, 40, and 50.
* **Y-axis:** Accuracy (%) - Scale ranges from approximately 76% to 92%.
* **Legend:** Located in the top-right corner.
* "Full" - Represented by a black 'x' marker.
* "Random" - Represented by a green triangle marker.
* "Bottom" - Represented by a blue diamond marker.
* "Top" - Represented by a red circle marker.
### Detailed Analysis
* **Top (Red Line):** The "Top" line exhibits a strong upward trend, starting at approximately 85% accuracy at a ratio of 2%, increasing to around 91% at a ratio of 20%, and plateauing around 91.5% from 20% to 50%.
* Ratio 2%: ~85%
* Ratio 4%: ~87.5%
* Ratio 6%: ~88.5%
* Ratio 8%: ~89%
* Ratio 10%: ~90%
* Ratio 20%: ~91%
* Ratio 30%: ~91.2%
* Ratio 40%: ~91.4%
* Ratio 50%: ~91.5%
* **Random (Green Line):** The "Random" line shows a moderate upward trend. It begins at approximately 77.5% accuracy at 2%, gradually increasing to around 85% at 50%.
* Ratio 2%: ~77.5%
* Ratio 4%: ~78%
* Ratio 6%: ~78.5%
* Ratio 8%: ~79%
* Ratio 10%: ~79.5%
* Ratio 20%: ~81%
* Ratio 30%: ~82.5%
* Ratio 40%: ~84%
* Ratio 50%: ~85%
* **Bottom (Blue Line):** The "Bottom" line demonstrates a slow, but consistent, upward trend. It starts at approximately 76.5% accuracy at 2% and reaches around 82.5% at 50%.
* Ratio 2%: ~76.5%
* Ratio 4%: ~77%
* Ratio 6%: ~77.5%
* Ratio 8%: ~77.7%
* Ratio 10%: ~78%
* Ratio 20%: ~79%
* Ratio 30%: ~81%
* Ratio 40%: ~82%
* Ratio 50%: ~82.5%
* **Full (Black Line):** The "Full" line is relatively flat, hovering around 77% accuracy across all ratios.
* Ratio 2%: ~77%
* Ratio 4%: ~77%
* Ratio 6%: ~77%
* Ratio 8%: ~77%
* Ratio 10%: ~77%
* Ratio 20%: ~77%
* Ratio 30%: ~77%
* Ratio 40%: ~77%
* Ratio 50%: ~77%
### Key Observations
* The "Top" data selection method consistently yields the highest accuracy across all ratios.
* The "Full" data selection method shows the lowest and most stable accuracy.
* The "Bottom" data selection method has the slowest rate of accuracy improvement with increasing ratio.
* The "Random" data selection method performs better than "Bottom" but significantly worse than "Top".
### Interpretation
The data suggests that selecting the "Top" performing data points (presumably based on some initial evaluation metric) is the most effective strategy for maximizing the accuracy of the R1-Llama model on the MATH500 dataset. Using the full dataset ("Full") provides minimal benefit, and even performs worse than random selection. This implies that the MATH500 dataset contains a significant amount of noisy or irrelevant data, and focusing on the most informative subset ("Top") is crucial for achieving high accuracy. The relatively slow improvement of the "Bottom" selection method suggests that the least performing data points are not only unhelpful but may even hinder the model's learning process. The "Random" selection method provides a baseline performance, indicating that some degree of data selection is beneficial, but a targeted approach (like "Top") is far superior. The plateauing of the "Top" line at higher ratios suggests diminishing returns – beyond a certain point, adding more of the top-performing data does not significantly improve accuracy.
</details>
<details>
<summary>x41.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Ratio for Different Data Selection Strategies
### Overview
This line chart displays the accuracy of a model (likely a language model, given the context "R1-Qwen | AIME24") as a function of the ratio of data used, comparing four different data selection strategies: Full, Random, Bottom, and Top. The accuracy is measured in percentage (%), and the ratio is also expressed as a percentage (%).
### Components/Axes
* **Title:** R1-Qwen | AIME24 (top-center)
* **X-axis:** Ratio (%) - ranging from 2% to 50%. Markers are placed at 2, 4, 6, 8, 10, 20, 30, 40, and 50.
* **Y-axis:** Accuracy (%) - ranging from approximately 30% to 75%.
* **Legend:** Located in the top-right corner.
* Full (gray dashed line with 'x' markers)
* Random (green solid line with triangle markers)
* Bottom (blue solid line with diamond markers)
* Top (red solid line with circle markers)
### Detailed Analysis
* **Full (Gray):** The line representing "Full" data starts at approximately 72% accuracy at 2% ratio and remains relatively constant at around 72-73% accuracy throughout the entire ratio range (up to 50%).
* **Random (Green):** The "Random" line starts at approximately 36% accuracy at 2% ratio. It initially decreases to around 32% at 6% ratio, then increases to approximately 35% at 50% ratio. The trend is generally flat with some fluctuations.
* **Bottom (Blue):** The "Bottom" line begins at approximately 38% accuracy at 2% ratio. It fluctuates between approximately 38% and 42% accuracy, with a slight upward trend towards the end, reaching around 43% at 50% ratio.
* **Top (Red):** The "Top" line shows a strong upward trend. It starts at approximately 54% accuracy at 2% ratio, increases rapidly to around 68% at 6% ratio, reaches a peak of approximately 72% at 8% ratio, and then plateaus around 72-73% accuracy for the remainder of the ratio range.
**Data Points (Approximate):**
| Ratio (%) | Full (%) | Random (%) | Bottom (%) | Top (%) |
|---|---|---|---|---|
| 2 | 72 | 36 | 38 | 54 |
| 4 | 72 | 34 | 39 | 64 |
| 6 | 72 | 32 | 40 | 68 |
| 8 | 72 | 33 | 41 | 72 |
| 10 | 72 | 34 | 40 | 72 |
| 20 | 72 | 33 | 40 | 72 |
| 30 | 72 | 34 | 41 | 72 |
| 40 | 72 | 33 | 42 | 72 |
| 50 | 72 | 35 | 43 | 72 |
### Key Observations
* The "Top" data selection strategy consistently outperforms all other strategies, especially at lower ratios (2% to 8%).
* The "Full" data strategy maintains a high and stable accuracy across all ratios.
* "Random" and "Bottom" strategies exhibit relatively low and fluctuating accuracy.
* The "Top" strategy shows diminishing returns after 8% ratio, as its accuracy plateaus.
### Interpretation
The chart demonstrates the effectiveness of selecting the "Top" data for training a model, particularly when limited data is available. The rapid increase in accuracy with the "Top" strategy suggests that focusing on the most informative or relevant data points leads to significant performance gains. The "Full" strategy provides a stable baseline, indicating that using all available data is also effective, but may not be as efficient as targeted data selection. The lower performance of "Random" and "Bottom" strategies suggests that randomly or selecting the least informative data points does not contribute significantly to model accuracy. The plateauing of the "Top" strategy at higher ratios indicates that there is a limit to the benefits of selecting only the top data, and adding more data (even if it's not the "top") may not further improve performance. This data suggests a strategy of prioritizing high-quality data over quantity, especially in resource-constrained scenarios.
</details>
<details>
<summary>x42.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Ratio for Different Data Distributions
### Overview
This image presents a line chart comparing the accuracy of a model across different ratios of data distribution. Four data series are plotted: "Full", "Random", "Bottom", and "Top". The chart aims to demonstrate how the distribution of data impacts model performance, measured by accuracy. The title of the chart is "R1-Qwen | AIME25".
### Components/Axes
* **X-axis:** "Ratio (%)", ranging from 2 to 50, with markers at 2, 4, 6, 8, 10, 20, 30, 40, and 50.
* **Y-axis:** "Accuracy (%)", ranging from 20 to 65, with markers at 20, 30, 40, 50, and 60.
* **Legend:** Located in the top-right corner, identifying the four data series with corresponding colors:
* "Full" - Grey 'x' markers
* "Random" - Green triangle markers
* "Bottom" - Blue diamond markers
* "Top" - Red circle markers
### Detailed Analysis
* **Full (Grey):** The "Full" line is approximately horizontal, maintaining an accuracy of around 61-63% across all ratios. The data points are:
* Ratio 2: ~62%
* Ratio 4: ~62%
* Ratio 6: ~62%
* Ratio 8: ~62%
* Ratio 10: ~62%
* Ratio 20: ~62%
* Ratio 30: ~62%
* Ratio 40: ~62%
* Ratio 50: ~62%
* **Random (Green):** The "Random" line shows a fluctuating trend. It starts at approximately 34% at a ratio of 2, increases to a peak of around 37% at a ratio of 6, then declines to approximately 22% at a ratio of 30, and finally rises to around 31% at a ratio of 50. The data points are:
* Ratio 2: ~34%
* Ratio 4: ~35%
* Ratio 6: ~37%
* Ratio 8: ~34%
* Ratio 10: ~32%
* Ratio 20: ~30%
* Ratio 30: ~22%
* Ratio 40: ~27%
* Ratio 50: ~31%
* **Bottom (Blue):** The "Bottom" line exhibits an upward trend. It begins at approximately 41% at a ratio of 2, dips slightly to around 39% at a ratio of 6, and then steadily increases to approximately 51% at a ratio of 50. The data points are:
* Ratio 2: ~41%
* Ratio 4: ~40%
* Ratio 6: ~39%
* Ratio 8: ~40%
* Ratio 10: ~41%
* Ratio 20: ~43%
* Ratio 30: ~44%
* Ratio 40: ~47%
* Ratio 50: ~51%
* **Top (Red):** The "Top" line demonstrates a strong upward trend, followed by a plateau. It starts at approximately 42% at a ratio of 2, rapidly increases to around 57% at a ratio of 6, continues to rise to approximately 61% at a ratio of 10, and then plateaus around 61-63% for the remaining ratios. The data points are:
* Ratio 2: ~42%
* Ratio 4: ~53%
* Ratio 6: ~57%
* Ratio 8: ~59%
* Ratio 10: ~61%
* Ratio 20: ~61%
* Ratio 30: ~61%
* Ratio 40: ~62%
* Ratio 50: ~62%
### Key Observations
* The "Top" data distribution consistently achieves the highest accuracy, particularly as the ratio increases.
* The "Full" data distribution maintains a stable, but relatively lower, accuracy compared to "Top".
* The "Random" data distribution exhibits the lowest and most volatile accuracy.
* The "Bottom" data distribution shows a positive correlation between ratio and accuracy, but remains lower than "Top".
### Interpretation
The chart suggests that the distribution of data significantly impacts model accuracy. Specifically, focusing on the "Top" portion of the data (presumably the most informative or representative samples) leads to the best performance. The "Full" dataset provides a baseline, indicating that simply having all data isn't necessarily optimal. The "Random" distribution performs poorly, highlighting the importance of data organization and selection. The increasing accuracy of the "Bottom" distribution with increasing ratio suggests that even less informative data can contribute to improved performance when a sufficient amount is used. The plateau in "Top" accuracy indicates a point of diminishing returns – adding more data from the "Top" distribution beyond a certain ratio doesn't yield significant improvements. The title "R1-Qwen | AIME25" likely refers to a specific model (R1-Qwen) and dataset (AIME25) used in the experiment. This data could be used to inform data sampling strategies for training machine learning models, prioritizing the inclusion of high-quality or representative data samples.
</details>
<details>
<summary>x43.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Ratio for Different Data Selection Methods
### Overview
This line chart displays the accuracy of a model (likely a machine learning model) under different data selection ratios. Four data selection methods are compared: "Full", "Random", "Bottom", and "Top". The x-axis represents the "Ratio (%)" of data used, ranging from 2% to 50%, and the y-axis represents the "Accuracy (%)", ranging from approximately 60% to 98%. The chart title is "R1-Qwen | AMC23".
### Components/Axes
* **Title:** R1-Qwen | AMC23 (positioned at the top-center)
* **X-axis Label:** Ratio (%) (positioned at the bottom-center)
* Axis Markers: 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Label:** Accuracy (%) (positioned at the left-center)
* Axis Markers: 60, 65, 70, 75, 80, 85, 90, 95, 98
* **Legend:** Located in the top-right corner.
* "Full" - Grey dashed line with 'x' markers
* "Random" - Green line with triangle markers
* "Bottom" - Blue line with diamond markers
* "Top" - Red line with circle markers
### Detailed Analysis
* **Full (Grey):** The "Full" line starts at approximately 97% accuracy at 2% ratio and remains relatively flat, fluctuating around 96-97% accuracy until 50% ratio.
* **Random (Green):** The "Random" line begins at approximately 65% accuracy at 2% ratio. It generally increases with increasing ratio, showing a slight dip around 6-8% ratio, reaching approximately 82% accuracy at 50% ratio.
* **Bottom (Blue):** The "Bottom" line starts at approximately 73% accuracy at 2% ratio. It fluctuates between approximately 72% and 76% accuracy until 30% ratio, then increases to approximately 79% accuracy at 50% ratio.
* **Top (Red):** The "Top" line starts at approximately 84% accuracy at 2% ratio. It shows a strong upward trend, increasing to approximately 96% accuracy at 8% ratio, and then plateaus, remaining around 96-97% accuracy until 50% ratio.
Here's a table summarizing approximate data points:
| Ratio (%) | Full (%) | Random (%) | Bottom (%) | Top (%) |
|---|---|---|---|---|
| 2 | 97 | 65 | 73 | 84 |
| 4 | 97 | 68 | 74 | 93 |
| 6 | 97 | 66 | 72 | 95 |
| 8 | 97 | 65 | 71 | 96 |
| 10 | 97 | 67 | 73 | 96 |
| 20 | 97 | 72 | 74 | 96 |
| 30 | 97 | 75 | 75 | 96 |
| 40 | 97 | 80 | 75 | 96 |
| 50 | 97 | 82 | 79 | 96 |
### Key Observations
* The "Top" data selection method consistently achieves the highest accuracy, especially at lower ratios.
* The "Full" data selection method maintains high accuracy across all ratios.
* The "Random" and "Bottom" methods show lower accuracy compared to "Top" and "Full", with "Random" showing a more consistent increase with ratio.
* The "Bottom" method exhibits relatively stable accuracy, with minimal variation across ratios.
### Interpretation
The data suggests that selecting the "Top" performing data points (likely based on some metric) is the most effective strategy for achieving high accuracy with limited data. Using the full dataset also yields high accuracy, but the benefit of using the entire dataset diminishes as the ratio increases, as the accuracy remains relatively constant. The "Random" selection method improves with more data, but remains less accurate than the "Top" or "Full" methods. The "Bottom" method appears to be the least effective, as its accuracy remains relatively low and stable across all ratios.
The chart likely represents a scenario where data is limited, and the goal is to maximize model performance by strategically selecting a subset of the available data. The "Top" method likely identifies the most informative or representative data points, leading to better generalization and higher accuracy. The "Bottom" method, conversely, selects the least informative data points, resulting in lower accuracy. The "Full" method provides a baseline performance when all data is available. The "Random" method provides a baseline performance when data is selected without any specific criteria.
</details>
<details>
<summary>x44.png Details</summary>

### Visual Description
## Line Chart: R1-Qwen | GPQA-D Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different strategies (Full, Random, Bottom, Top) for a model (R1-Qwen) on a dataset (GPQA-D) as a function of the ratio, presumably of training data or a similar parameter. The chart shows how accuracy changes as the ratio increases from 2% to 50%.
### Components/Axes
* **Title:** R1-Qwen | GPQA-D
* **X-axis:** Ratio (%) - Scale ranges from 2 to 50, with markers at 2, 4, 6, 8, 10, 20, 30, 40, and 50.
* **Y-axis:** Accuracy (%) - Scale ranges from 36 to 50, with markers at 36, 38, 40, 42, 44, 46, 48, and 50.
* **Legend:** Located in the top-center of the chart.
* Full (Grey 'x' markers)
* Random (Green triangle markers)
* Bottom (Blue square markers)
* Top (Red circle markers)
### Detailed Analysis
* **Full (Grey):** The line representing "Full" is nearly horizontal, indicating a relatively constant accuracy across all ratios. It starts at approximately 48.2% at a ratio of 2% and increases slightly to approximately 50.2% at a ratio of 50%.
* **Random (Green):** The "Random" line shows a decreasing trend from 2% to 6% ratio, then a sharp increase from 20% to 40% ratio.
* At 2% Ratio: Approximately 38.5% accuracy.
* At 4% Ratio: Approximately 37.5% accuracy.
* At 6% Ratio: Approximately 36.5% accuracy.
* At 8% Ratio: Approximately 36.5% accuracy.
* At 10% Ratio: Approximately 37% accuracy.
* At 20% Ratio: Approximately 38.5% accuracy.
* At 30% Ratio: Approximately 42% accuracy.
* At 40% Ratio: Approximately 45.5% accuracy.
* At 50% Ratio: Approximately 46.5% accuracy.
* **Bottom (Blue):** The "Bottom" line fluctuates with a slight upward trend.
* At 2% Ratio: Approximately 40.5% accuracy.
* At 4% Ratio: Approximately 40.5% accuracy.
* At 6% Ratio: Approximately 41% accuracy.
* At 8% Ratio: Approximately 40% accuracy.
* At 10% Ratio: Approximately 39.5% accuracy.
* At 20% Ratio: Approximately 40% accuracy.
* At 30% Ratio: Approximately 41% accuracy.
* At 40% Ratio: Approximately 42% accuracy.
* At 50% Ratio: Approximately 42.5% accuracy.
* **Top (Red):** The "Top" line shows a slight increase, then a decrease, and then a slight increase again.
* At 2% Ratio: Approximately 48% accuracy.
* At 4% Ratio: Approximately 48.5% accuracy.
* At 6% Ratio: Approximately 49% accuracy.
* At 8% Ratio: Approximately 48.5% accuracy.
* At 10% Ratio: Approximately 48.5% accuracy.
* At 20% Ratio: Approximately 49% accuracy.
* At 30% Ratio: Approximately 49.5% accuracy.
* At 40% Ratio: Approximately 49% accuracy.
* At 50% Ratio: Approximately 49.5% accuracy.
### Key Observations
* The "Full" strategy maintains a consistently high accuracy, significantly higher than the other strategies at lower ratios.
* The "Random" strategy performs poorly at low ratios but shows a substantial improvement as the ratio increases, suggesting it benefits from more data.
* The "Bottom" strategy exhibits moderate and relatively stable accuracy.
* The "Top" strategy shows a slight initial increase in accuracy, followed by a dip, and then a slight recovery.
### Interpretation
The data suggests that the "Full" strategy is the most robust and reliable, providing consistently high accuracy regardless of the ratio. The "Random" strategy, while initially less accurate, demonstrates the potential to improve with increased data. This could indicate that the random selection process benefits from a larger sample size. The "Bottom" and "Top" strategies offer moderate performance, potentially representing the impact of selecting data from specific portions of the dataset. The differences in performance between these strategies highlight the importance of data selection and the potential benefits of utilizing the entire dataset ("Full" strategy) for optimal accuracy. The ratio likely represents the proportion of the dataset used for training, and the results suggest that increasing the training data generally improves performance, particularly for the "Random" strategy. The "Full" strategy's consistent performance suggests it is less sensitive to the amount of training data used.
</details>
<details>
<summary>x45.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Ratio for Different Data Selection Methods
### Overview
The image presents a line chart comparing the accuracy of a model under different data selection ratios, specifically for "Full", "Bottom", "Random", and "Top" data subsets. The chart is titled "R1-Qwen | GK23EN" at the top center. The x-axis represents the "Ratio (%)" and the y-axis represents "Accuracy (%)".
### Components/Axes
* **Title:** R1-Qwen | GK23EN
* **X-axis Label:** Ratio (%)
* Scale: 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Label:** Accuracy (%)
* Scale: 66, 68, 70, 72, 74, 76, 78, 80
* **Legend:** Located in the top-right corner.
* Full: Grey dashed line with 'x' markers
* Bottom: Blue line with diamond markers
* Random: Green line with triangle markers
* Top: Red line with circle markers
### Detailed Analysis
* **Full (Grey):** The "Full" data line is relatively flat, starting at approximately 79% accuracy at Ratio 2% and remaining around 79% up to Ratio 50%.
* **Bottom (Blue):** The "Bottom" data line shows a slight upward trend. It begins at approximately 66.5% accuracy at Ratio 2%, increases to around 68.5% at Ratio 20%, and then rises to approximately 70.5% at Ratio 50%.
* **Random (Green):** The "Random" data line exhibits a decreasing trend from Ratio 2% to Ratio 20%, starting at approximately 66.5% and decreasing to around 65%. It then increases sharply from Ratio 20% to Ratio 50%, reaching approximately 72% accuracy.
* **Top (Red):** The "Top" data line shows a strong upward trend from Ratio 2% to Ratio 8%, increasing from approximately 70% to 78.5%. It then plateaus, remaining around 78.5% to 79% from Ratio 8% to Ratio 50%.
**Data Points (Approximate):**
| Ratio (%) | Full (%) | Bottom (%) | Random (%) | Top (%) |
|---|---|---|---|---|
| 2 | 79 | 66.5 | 66.5 | 70 |
| 4 | 79 | 67 | 66 | 76 |
| 6 | 79 | 67.5 | 66 | 78 |
| 8 | 79 | 68 | 66.5 | 78.5 |
| 10 | 79 | 68 | 65.5 | 78.5 |
| 20 | 79 | 68.5 | 65 | 78.5 |
| 30 | 79 | 68.5 | 66 | 78.5 |
| 40 | 79 | 69 | 69 | 78.5 |
| 50 | 79 | 70.5 | 72 | 78.5 |
### Key Observations
* The "Top" data selection consistently yields the highest accuracy, especially at lower ratios.
* The "Full" data provides a stable, high level of accuracy.
* The "Random" data selection performs the worst initially but shows a significant increase in accuracy at higher ratios.
* The "Bottom" data selection shows a modest improvement in accuracy as the ratio increases.
### Interpretation
The chart demonstrates the impact of different data selection strategies on model accuracy. Selecting the "Top" data (presumably the most informative samples) leads to the best performance, particularly when only a small portion of the data is used. Using the "Full" dataset provides a consistently high accuracy, suggesting that all data contributes positively. The "Random" selection is initially less effective but improves with a larger ratio, indicating that even random samples can be valuable when sufficient data is available. The "Bottom" data selection shows the least improvement, suggesting that the least informative samples contribute little to the model's performance. The plateauing of the "Top" and "Full" lines suggests diminishing returns as the ratio approaches 100%. This data suggests a strategy of prioritizing the most informative data samples ("Top") can be highly effective, especially when computational resources or data availability are limited.
</details>
<details>
<summary>x46.png Details</summary>

### Visual Description
\n
## Line Chart: R1-Qwen | MATH500 Accuracy vs. Ratio
### Overview
This line chart displays the accuracy of different sampling methods (Full, Random, Bottom, Top) on the MATH500 dataset, as a function of the ratio of data used. The x-axis represents the ratio of data used (in percentage), and the y-axis represents the accuracy (in percentage).
### Components/Axes
* **Title:** R1-Qwen | MATH500
* **X-axis Label:** Ratio (%)
* **Y-axis Label:** Accuracy (%)
* **Legend:**
* Full (represented by a grey dashed line with 'x' markers)
* Random (represented by a green solid line with triangle markers)
* Bottom (represented by a blue solid line with square markers)
* Top (represented by a red solid line with circle markers)
* **X-axis Markers:** 2, 4, 6, 8, 10, 20, 30, 40, 50
* **Y-axis Markers:** 80, 82, 84, 86, 88, 90, 92, 94
### Detailed Analysis
* **Top (Red Line):** The Top line starts at approximately 82% accuracy at a ratio of 2%, then rapidly increases to approximately 94% accuracy at a ratio of 10%. It plateaus around 94-95% accuracy from a ratio of 10% to 50%.
* **Bottom (Blue Line):** The Bottom line starts at approximately 82% accuracy at a ratio of 2%. It gradually increases to approximately 86% accuracy at a ratio of 50%, with a relatively linear trend.
* **Full (Grey Dashed Line):** The Full line remains relatively constant at approximately 94% accuracy across all ratios, from 2% to 50%.
* **Random (Green Line):** The Random line starts at approximately 82% accuracy at a ratio of 2%. It initially fluctuates around 82-83% until a ratio of 10%, then decreases to approximately 81% at a ratio of 20%. It then increases to approximately 85% at a ratio of 50%.
Here's a more detailed breakdown of the data points (approximate values):
| Ratio (%) | Top (Red) | Bottom (Blue) | Full (Grey) | Random (Green) |
|---|---|---|---|---|
| 2 | 82 | 82 | 94 | 82 |
| 4 | 88 | 83 | 94 | 82 |
| 6 | 91 | 83 | 94 | 82 |
| 8 | 92.5 | 83.5 | 94 | 82.5 |
| 10 | 94 | 84 | 94 | 82 |
| 20 | 94 | 84.5 | 94 | 81 |
| 30 | 94 | 85 | 94 | 82 |
| 40 | 94 | 85 | 94 | 84 |
| 50 | 94 | 86 | 94 | 85 |
### Key Observations
* The "Top" sampling method achieves the highest accuracy, especially at lower ratios.
* The "Full" method maintains a consistently high accuracy across all ratios.
* The "Random" method exhibits the most variability in accuracy.
* The "Bottom" method shows a steady, but relatively slow, increase in accuracy.
* The "Top" method demonstrates diminishing returns after a ratio of 10%, as accuracy plateaus.
### Interpretation
The data suggests that selecting the "top" performing samples (presumably based on some criteria) is highly effective for achieving high accuracy on the MATH500 dataset, particularly when only a small portion of the data is available. The "Full" method provides a baseline of high accuracy, but doesn't offer significant improvement over the "Top" method. The "Random" method is the least consistent, indicating that random sampling is not an optimal strategy for this task. The "Bottom" method shows some improvement with increasing data ratio, but remains significantly lower in accuracy than the "Top" and "Full" methods.
The plateauing of the "Top" method after a ratio of 10% suggests that the most informative samples are identified early on, and adding more data beyond that point doesn't yield substantial gains. This could indicate that the MATH500 dataset has a hierarchical structure, where a small subset of samples contains the majority of the relevant information. The difference between the "Top" and "Full" methods suggests that the full dataset contains some noise or less informative samples.
</details>
Figure 12: Reasoning performance trends of Llama and Qwen models across six datasets as a function of thinking token retention ratio.
We present the evaluation results of R1-Llama (DeepSeek-R1-Distill-Llama-8B) and R1-Qwen (DeepSeek-R1-Distill-Qwen-7B) across the AIME24, AIME25, AMC23, GPQA-Diamond, GK23EN, and MATH-500 datasets. Fig. 10 and 11 illustrate the variation of token importance scores over decoding steps for representative samples from each dataset. Additionally, Fig. 12 demonstrates the reasoning performance under different thinking token retention strategies and retention ratios.
## Appendix C Detailed Experimental Setup
### C.1 Device and Environment
All experiments were conducted on a server equipped with 8 NVIDIA H800 GPUs. We utilize the Hugging Face transformers library and vLLM as the primary inference engine, and employ the veRL framework to optimize the parameters of the importance predictor.
### C.2 Training of Importance Predictor
We generate 5 diverse reasoning traces for each question in the MATH training dataset. Only the traces leading to the correct answer are retained as training data. This process yields 7,142 and 7,064 valid samples for R1-Qwen and R1-Llama, respectively. The Importance Predictor is implemented as an MLP with dimensions of $(3584\to 7168\to 1792\to 1)$ and $(4096\to 8192\to 2048\to 1)$ for R1-Qwen and R1-Llama. We conduct the training using the veRL framework for a total of 15 epochs. The global batch size is set to 256, with a micro-batch size of 4 for the gradient accumulation step. Optimization is performed using AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , and a weight decay of $0.01$ . The learning rate is initialized at $5\times 10^{-4}$ with a cosine decay schedule, and the gradient clipping threshold is set to $1.0$ .
### C.3 Inference Setting
<details>
<summary>x47.png Details</summary>

### Visual Description
\n
## Bar Chart: KV Cache Length Comparison - Transformers vs. DynTS
### Overview
This bar chart compares the KV Cache Length (in 10^3 units) for two models, "Transformers" and "DynTS (Ours)", across six different datasets: AIME24, AIME25, AMC23, GaoKao2023En, GPQA-D, and MATH500. The chart visually represents the performance difference in terms of KV cache usage between the two models on each dataset. Above each DynTS bar is a multiplier indicating how much smaller the cache length is compared to the Transformers model.
### Components/Axes
* **X-axis:** Dataset names: AIME24, AIME25, AMC23, GaoKao2023En, GPQA-D, MATH500.
* **Y-axis:** KV Cache Length (10^3). Scale ranges from 0.0 to 20.0.
* **Legend:**
* Blue: Transformers
* Red: DynTS (Ours)
* **Labels:** Each DynTS bar has a label indicating the compression factor relative to the Transformers model (e.g., "3.4x").
### Detailed Analysis
The chart consists of paired bars for each dataset, representing the KV Cache Length for Transformers and DynTS.
* **AIME24:** Transformers: approximately 16.5. DynTS: approximately 4.8. Compression factor: 3.4x.
* **AIME25:** Transformers: approximately 17.0. DynTS: approximately 5.0. Compression factor: 3.4x.
* **AMC23:** Transformers: approximately 16.8. DynTS: approximately 5.0. Compression factor: 3.3x.
* **GaoKao2023En:** Transformers: approximately 19.0. DynTS: approximately 5.0. Compression factor: 3.8x.
* **GPQA-D:** Transformers: approximately 10.5. DynTS: approximately 1.9. Compression factor: 5.6x.
* **MATH500:** Transformers: approximately 17.0. DynTS: approximately 2.4. Compression factor: 5.7x.
The Transformers bars are consistently higher than the DynTS bars across all datasets, indicating a larger KV Cache Length for Transformers.
### Key Observations
* DynTS consistently uses significantly less KV cache than Transformers across all datasets.
* The compression factor varies between 3.3x and 5.7x, with the largest compression achieved on the GPQA-D and MATH500 datasets.
* The difference in KV Cache Length is most pronounced on the GPQA-D and MATH500 datasets.
* The KV Cache Length for Transformers is relatively consistent across all datasets, ranging from approximately 10.5 to 19.0.
### Interpretation
The data suggests that DynTS is a more KV cache-efficient model compared to Transformers. This is particularly evident on the GPQA-D and MATH500 datasets, where DynTS achieves a compression factor of 5.6x and 5.7x, respectively. This could be due to the DynTS architecture being better suited for these specific datasets, or it could indicate a more general advantage in terms of KV cache management. The consistent reduction in KV Cache Length across all datasets suggests that DynTS's efficiency is not dataset-specific. The relatively stable KV Cache Length for Transformers indicates that the model's memory requirements are less sensitive to the input data. The multiplier labels directly highlight the benefit of using DynTS, quantifying the reduction in memory usage. This is important because KV cache size is a significant factor in the computational cost and scalability of transformer-based models.
</details>
(a) R1-Llama
<details>
<summary>x48.png Details</summary>

### Visual Description
## Bar Chart: KV Cache Length Comparison - Transformers vs. DynTS
### Overview
This bar chart compares the KV Cache Length (in 10^3 units) achieved by two models, "Transformers" and "DynTS (Ours)", across six different datasets: AIME24, AIME25, AMC23, GaoKao2023En, GPQA-D, and MATH500. The chart visually represents the performance difference in terms of KV cache length, with "DynTS" consistently achieving significantly shorter cache lengths. Above each "DynTS" bar is a multiplier indicating how much shorter the cache length is compared to the "Transformers" model.
### Components/Axes
* **X-axis:** Dataset names: AIME24, AIME25, AMC23, GaoKao2023En, GPQA-D, MATH500.
* **Y-axis:** KV Cache Length (10^3). Scale ranges from 0.0 to 20.0, with increments of 2.5.
* **Legend:**
* Blue: Transformers
* Red: DynTS (Ours)
* **Labels:** Each "DynTS" bar has a label indicating the speedup factor (e.g., "3.4x", "3.8x").
### Detailed Analysis
The chart consists of paired bars for each dataset, representing the KV Cache Length for Transformers and DynTS.
* **AIME24:**
* Transformers: Approximately 16.5 (10^3)
* DynTS: Approximately 4.8 (10^3). Speedup: 3.4x
* **AIME25:**
* Transformers: Approximately 17.5 (10^3)
* DynTS: Approximately 5.0 (10^3). Speedup: 3.4x
* **AMC23:**
* Transformers: Approximately 16.8 (10^3)
* DynTS: Approximately 5.0 (10^3). Speedup: 3.3x
* **GaoKao2023En:**
* Transformers: Approximately 19.0 (10^3)
* DynTS: Approximately 5.0 (10^3). Speedup: 3.8x
* **GPQA-D:**
* Transformers: Approximately 16.5 (10^3)
* DynTS: Approximately 3.0 (10^3). Speedup: 5.5x
* **MATH500:**
* Transformers: Approximately 17.0 (10^3)
* DynTS: Approximately 3.0 (10^3). Speedup: 5.7x
The "Transformers" bars are consistently taller than the "DynTS" bars across all datasets. The speedup factors above the "DynTS" bars indicate the magnitude of the reduction in KV Cache Length.
### Key Observations
* "DynTS" consistently achieves a significantly shorter KV Cache Length compared to "Transformers" across all datasets.
* The speedup factor varies between 3.3x and 5.7x.
* The largest speedup is observed on the GPQA-D and MATH500 datasets (5.5x and 5.7x respectively).
* The KV Cache Length for "Transformers" remains relatively stable across all datasets, fluctuating between approximately 16.5 and 19.0 (10^3).
### Interpretation
The data demonstrates that the "DynTS (Ours)" model is substantially more efficient in terms of KV Cache Length compared to the "Transformers" model. This suggests that "DynTS" requires less memory to store the KV cache, which is a critical factor in the performance of large language models, especially when dealing with long sequences. The varying speedup factors indicate that the benefits of "DynTS" are dataset-dependent, with larger improvements observed on datasets like GPQA-D and MATH500. The relatively stable KV Cache Length for "Transformers" suggests that its memory usage is less sensitive to the specific dataset. The consistent reduction in KV Cache Length by DynTS indicates a fundamental architectural advantage in managing memory usage during processing. This could translate to faster inference speeds and the ability to handle longer sequences with the same hardware resources.
</details>
(b) R1-Qwen
Figure 13: Comparison of average KV Cache length between standard Transformers and DynTS across six benchmarks. The arrows and annotations indicate the compression ratio achieved by our method on each dataset.
<details>
<summary>x49.png Details</summary>

### Visual Description
## Line Chart: Performance Comparison of Transformers and DynTS
### Overview
This image presents a comparative performance analysis of two models, "Transformers" and "DynTS", across three metrics: Throughput (TPS), KV Memory (GB), and GFLOPs. The performance is evaluated as a function of "Decoding Steps", ranging from 0 to 15k. The chart consists of three sub-charts stacked vertically, each representing one of the performance metrics. Each sub-chart displays two lines representing the performance of each model. Numerical multipliers are displayed above each data point.
### Components/Axes
* **X-axis (all sub-charts):** Decoding Steps (Scale: 0k, 2k, 5k, 7k, 10k, 12k, 15k)
* **Top Sub-chart (Throughput):**
* Y-axis: Throughput (TPS) (Scale: 0, 250, 500, 750, 1000, 1250)
* **Middle Sub-chart (KV Memory):**
* Y-axis: KV Memory (GB) (Scale: 0, 5, 10, 15)
* **Bottom Sub-chart (GFLOPs):**
* Y-axis: GFLOPs (Scale: 0, 20, 40)
* **Legend (top-right of the entire chart):**
* Black Line: Transformers
* Red Line: DynTS
### Detailed Analysis or Content Details
**1. Throughput (TPS)**
* **Transformers (Black Line):** The line starts at approximately 1100 TPS at 0 Decoding Steps and gradually decreases to approximately 270 TPS at 15k Decoding Steps. The trend is a slow, relatively linear decline.
* 0k: 1100 TPS
* 2k: 950 TPS
* 5k: 750 TPS
* 7k: 650 TPS
* 10k: 550 TPS
* 12k: 450 TPS
* 15k: 270 TPS
* **DynTS (Red Line):** The line starts at approximately 800 TPS at 0 Decoding Steps and decreases more rapidly than Transformers, reaching approximately 150 TPS at 15k Decoding Steps. The trend is a steeper, relatively linear decline.
* 0k: 800 TPS
* 2k: 650 TPS
* 5k: 450 TPS
* 7k: 350 TPS
* 10k: 250 TPS
* 12k: 200 TPS
* 15k: 150 TPS
* Multipliers:
* Transformers: 1.39x, 1.86x, 2.24x, 2.74x, 3.10x, 3.59x, 3.74x
* DynTS: 1.39x, 1.86x, 2.24x, 2.74x, 3.10x, 3.59x, 3.74x
**2. KV Memory (GB)**
* **Transformers (Black Line):** The line starts at approximately 0.2 GB at 0 Decoding Steps and increases linearly to approximately 14 GB at 15k Decoding Steps.
* 0k: 0.2 GB
* 2k: 1.3 GB
* 5k: 3.2 GB
* 7k: 4.9 GB
* 10k: 7.0 GB
* 12k: 9.0 GB
* 15k: 14 GB
* **DynTS (Red Line):** The line starts at approximately 0.8 GB at 0 Decoding Steps and increases linearly to approximately 1.8 GB at 15k Decoding Steps.
* 0k: 0.8 GB
* 2k: 1.9 GB
* 5k: 3.0 GB
* 7k: 4.1 GB
* 10k: 5.2 GB
* 12k: 6.3 GB
* 15k: 1.8 GB
* Multipliers:
* Transformers: 0.64x, 0.47x, 0.37x, 0.31x, 0.26x, 0.23x, 0.20x
* DynTS: 0.64x, 0.47x, 0.37x, 0.31x, 0.26x, 0.23x, 0.20x
**3. GFLOPs**
* **Transformers (Black Line):** The line starts at approximately 42 GFLOPs at 0 Decoding Steps and decreases to approximately 20 GFLOPs at 15k Decoding Steps.
* 0k: 42 GFLOPs
* 2k: 35 GFLOPs
* 5k: 28 GFLOPs
* 7k: 25 GFLOPs
* 10k: 22 GFLOPs
* 12k: 21 GFLOPs
* 15k: 20 GFLOPs
* **DynTS (Red Line):** The line starts at approximately 40 GFLOPs at 0 Decoding Steps and decreases to approximately 15 GFLOPs at 15k Decoding Steps.
* 0k: 40 GFLOPs
* 2k: 33 GFLOPs
* 5k: 25 GFLOPs
* 7k: 22 GFLOPs
* 10k: 19 GFLOPs
* 12k: 17 GFLOPs
* 15k: 15 GFLOPs
* Multipliers:
* Transformers: 0.85x, 0.74x, 0.65x, 0.58x, 0.53x, 0.48x, 0.44x
* DynTS: 0.85x, 0.74x, 0.65x, 0.58x, 0.53x, 0.48x, 0.44x
### Key Observations
* Transformers consistently exhibit higher throughput than DynTS across all decoding steps.
* DynTS requires significantly less KV memory than Transformers.
* Both models show a decrease in throughput and GFLOPs as decoding steps increase.
* The multipliers are identical for both models across all metrics, suggesting a consistent scaling relationship.
* The bottom chart (GFLOPs) has a zoomed-in section from 4500 to 4900, showing a more detailed view of the initial decline in GFLOPs.
### Interpretation
The data suggests that while Transformers offer higher throughput, DynTS is more memory-efficient. The consistent multipliers indicate that the performance degradation of both models with increasing decoding steps is proportional. This trade-off between throughput and memory usage could be crucial depending on the application's constraints. For example, in resource-constrained environments, DynTS might be preferred despite its lower throughput. The identical multipliers across metrics suggest a fundamental relationship between these performance indicators for both models. The zoomed-in section of the GFLOPs chart highlights the initial rapid decline in computational cost as decoding steps increase, potentially indicating an initial overhead that diminishes with longer sequences. The consistent decline in throughput and GFLOPs with increasing decoding steps suggests a computational bottleneck that becomes more pronounced as the sequence length grows.
</details>
Figure 14: Real-time throughput, memory, and compute overhead tracking for R1-Qwen over total decoding steps. The results exhibit a trend consistent with R1-Llama, confirming the scalability of DynTS across different model architectures.
We implement DynTS and all baseline methods using the Hugging Face transformers library for KV cache compression. To ensure fairness, we use the effective KV cache length as the compression signal. Whenever the cache size reaches the predefined budget, all methods are restricted to retain an identical number of KV pairs. For SnapKV, H2O, and R-KV, we set identical local window sizes and retention ratios to those of our methods. For SepLLM, we preserve the separator tokens and evict the earliest generated non-separator tokens until the total cache length matches ours. For StreamingLLM, we set the same sink token size, following (Xiao et al., 2024). We set the number of parallel generated sequences to 20. The generation hyperparameters are configured as follows: temperature $T=0.6$ , top- $p=0.95$ , top- $k=20$ , and a maximum new token limit of 16,384. We conduct 5 independent sampling runs for all datasets. Ablation studies on the local window, retention ratio, and budget were conducted across four challenging benchmarks (AIME24, AIME25, AMC23, and GPQA-D) with the same other configurations to verify effectiveness. Fig. 6 and 8 report the mean results across these datasets.
## Appendix D Additional Results
Inference Efficiency on R1-Qwen. Complementing the efficiency analysis of R1-Llama presented in the main text, Figure 14 illustrates the real-time throughput, memory footprint, and computational overhead for R1-Qwen. Consistent with previous observations, DynTS exhibits significant scalability advantages over the standard Transformer baseline as the sequence length increases, achieving a peak throughput speedup of $3.74\times$ while compressing the memory footprint to $0.20\times$ and reducing the cumulative computational cost (GFLOPs) to $0.44\times$ after the last KV cache selection step. The recurrence of the characteristic sawtooth pattern further validates the robustness of our periodic KV Cache Selection mechanism, demonstrating that it effectively bounds resource accumulation and delivers substantial efficiency gains across diverse LRM architectures by continuously evicting non-essential thinking tokens.
KV Cache Compression Ratio. Figure 13 explicitly visualizes the reduction in KV Cache length achieved by DynTS across diverse reasoning tasks. By dynamically filtering out non-essential thinking tokens, our method drastically reduces the memory footprint compared to the full-cache Transformers baseline. For instance, on the MATH500 benchmark, DynTS achieves an impressive compression ratio of up to $5.7\times$ , reducing the average cache length from over 17,000 tokens to the constrained budget of 3,000. These results directly explain the memory and throughput advantages reported in the efficiency analysis, confirming that DynTS successfully maintains high reasoning accuracy with a fraction of the memory cost.
<details>
<summary>x50.png Details</summary>

### Visual Description
## Line Chart: Training Performance Metrics
### Overview
The image presents a line chart displaying training performance metrics over 400 steps. The chart consists of two subplots: the top subplot shows the MSE Loss and Kendall's Tau correlation coefficient, while the bottom subplot displays the overlap rate for different percentile groups (Top-20%, Top-30%, etc.). The x-axis represents the training step, and the y-axes represent the respective metric values.
### Components/Axes
* **X-axis (Both Subplots):** Step (ranging from 0 to 400)
* **Y-axis (Top Subplot):** Value (ranging from 0 to 16, approximately)
* **Y-axis (Bottom Subplot):** Overlap Rate (%) (ranging from 0 to 100)
* **Legend (Top Subplot):**
* Blue Line: MSE Loss
* Orange Line: Kendall
* **Legend (Bottom Subplot):** Located in the bottom-right corner.
* Dark Gray Line: Top-20%
* Purple Line: Top-30%
* Dark Blue Line: Top-40%
* Teal Line: Top-50%
* Green Line: Top-60%
* Yellow Line: Top-70%
* Light Blue Line: Top-80%
* Gray Line: Top-90%
### Detailed Analysis or Content Details
**Top Subplot (MSE Loss & Kendall's Tau):**
* **MSE Loss (Blue Line):** Starts at approximately 16, rapidly decreases to around 2 by step 10, then fluctuates between 1 and 2, stabilizing around 1.5 after step 100.
* **Kendall (Orange Line):** Starts at approximately 2, decreases to around 0.5 by step 10, and then stabilizes around 0.2-0.3 after step 50.
**Bottom Subplot (Overlap Rate):**
* **Top-20% (Dark Gray):** Starts at approximately 18%, increases to around 65% by step 100, and then fluctuates between 60% and 70% for the remainder of the training.
* **Top-30% (Purple):** Starts at approximately 20%, increases to around 68% by step 100, and then fluctuates between 62% and 72%.
* **Top-40% (Dark Blue):** Starts at approximately 22%, increases to around 70% by step 100, and then fluctuates between 64% and 74%.
* **Top-50% (Teal):** Starts at approximately 24%, increases to around 72% by step 100, and then fluctuates between 66% and 76%.
* **Top-60% (Green):** Starts at approximately 26%, increases to around 74% by step 100, and then fluctuates between 68% and 78%.
* **Top-70% (Yellow):** Starts at approximately 28%, increases to around 76% by step 100, and then fluctuates between 70% and 80%.
* **Top-80% (Light Blue):** Starts at approximately 30%, increases to around 78% by step 100, and then fluctuates between 72% and 82%.
* **Top-90% (Gray):** Starts at approximately 32%, increases to around 80% by step 100, and then fluctuates between 74% and 84%.
All overlap rate lines exhibit a similar upward trend initially, converging to a relatively stable range after step 100. The higher percentile groups (Top-90%) consistently show higher overlap rates than the lower percentile groups (Top-20%).
### Key Observations
* The MSE Loss decreases rapidly initially and then plateaus, indicating the model is learning.
* Kendall's Tau also decreases and stabilizes, suggesting a diminishing correlation as training progresses.
* The overlap rates for all percentile groups increase with training steps, indicating improved performance across different levels of confidence.
* The overlap rates are positively correlated with the percentile group, meaning higher percentiles have higher overlap rates.
* The overlap rates stabilize after approximately 100 steps, suggesting the model's performance has converged.
### Interpretation
The data suggests that the model is learning effectively, as evidenced by the decreasing MSE Loss and increasing overlap rates. The stabilization of these metrics after 100 steps indicates that the model has likely converged. The consistent positive correlation between percentile group and overlap rate suggests that the model is more confident in its predictions for higher-ranked items. The decreasing Kendall's Tau suggests that the model's ability to perfectly rank items diminishes as training progresses, which is expected as the model focuses on minimizing loss rather than maintaining perfect ranking. The combination of these metrics provides a comprehensive view of the model's training progress and performance. The initial spike in MSE Loss could be attributed to the model adjusting its weights at the beginning of training.
</details>
Figure 15: Training dynamics of the Importance Predictor on R1-Qwen. The top panel displays the convergence of MSE Loss and Kendall correlation, while the bottom panel shows the overlap rate of the top- $20\$ ground-truth tokens within the top- $p\$ ( $p\in[20,90]$ ) predicted tokens across training steps.
Importance Predictor Analysis for R1-Qwen. Complementing the findings on R1-Llama, Figure 15 depicts the learning trajectory of the Importance Predictor for the R1-Qwen model. The training process exhibits a similar convergence pattern: the MSE loss rapidly decreases and stabilizes, while the Kendall rank correlation coefficient steadily improves, indicating that the simple MLP architecture effectively captures the importance ranking of thinking tokens in R1-Qwen. Furthermore, the bottom panel highlights the high overlap rate between the predicted and ground-truth critical tokens; notably, the overlap rate for the top-40% of tokens exceeds 80% after approximately 200 steps. This high alignment confirms that the Importance Predictor can accurately identify pivotal tokens within R1-Qwen’s reasoning process, providing a reliable basis for the subsequent KV cache compression.
Budget Impact Analysis over Benchmarks. Figures 16 illustrate the granular impact of KV budget constraints on reasoning performance and system throughput. Focusing on R1-Llama, we observe a consistent trade-off across all datasets: increasing the KV budget significantly boosts reasoning accuracy at the cost of linearly decreasing throughput. Specifically, on the challenging AIME24 benchmark, expanding the budget from 2,500 to 5,000 tokens improves Pass@1 accuracy from $40.0\$ to $49.3\$ , while the throughput decreases from $\sim$ 600 to $\sim$ 445 tokens/s. This suggests that while a tighter budget accelerates inference, a larger budget is essential for solving complex problems requiring extensive context retention. Experimental results on R1-Qwen exhibit a highly similar trend, confirming that the performance characteristics of DynTS are model-agnostic. Overall, our method allows users to flexibly balance efficiency and accuracy based on specific deployment requirements.
<details>
<summary>x51.png Details</summary>

### Visual Description
\n
## Bar Chart with Line Overlay: R1-Llama Performance vs. KV Budget
### Overview
The image presents four bar charts, each displaying the relationship between "Pass@1" and "Throughput (TPS)" for the R1-Llama model across different KV Budgets. Each chart corresponds to a specific accelerator: AIME24, AIME25, AMC23, and GPQA-D. The charts show how Pass@1 changes with increasing KV Budget, while a line graph overlays the corresponding Throughput values.
### Components/Axes
* **X-axis:** KV Budget (ranging from 2500 to 4500, with increments of 500).
* **Y-axis (Left):** Pass@1 (ranging from approximately 20 to 100, depending on the accelerator).
* **Y-axis (Right):** Throughput (TPS) (ranging from approximately 300 to 700).
* **Bar Color:** Light Blue, representing Pass@1.
* **Line Color:** Orange, representing Throughput.
* **Legend:**
* Pass@1 (Light Blue)
* Throughput (Orange)
* **Titles:** Each chart is titled "R1-Llama | [Accelerator Name]".
### Detailed Analysis or Content Details
**1. R1-Llama | AIME24**
* **Trend (Pass@1):** The Pass@1 values generally increase with increasing KV Budget.
* **Data Points (Pass@1):**
* KV Budget 2500: Approximately 40.0
* KV Budget 3000: Approximately 44.7
* KV Budget 3500: Approximately 45.3
* KV Budget 4000: Approximately 39.3
* KV Budget 4500: Approximately 49.3
* **Trend (Throughput):** The Throughput values decrease with increasing KV Budget.
* **Data Points (Throughput):**
* KV Budget 2500: Approximately 650 TPS
* KV Budget 3000: Approximately 600 TPS
* KV Budget 3500: Approximately 550 TPS
* KV Budget 4000: Approximately 500 TPS
* KV Budget 4500: Approximately 450 TPS
**2. R1-Llama | AIME25**
* **Trend (Pass@1):** Pass@1 values show a slight increase initially, then plateau and slightly decrease with increasing KV Budget.
* **Data Points (Pass@1):**
* KV Budget 2500: Approximately 26.0
* KV Budget 3000: Approximately 29.3
* KV Budget 3500: Approximately 28.0
* KV Budget 4000: Approximately 26.0
* KV Budget 4500: Approximately 25.3
* **Trend (Throughput):** Throughput decreases with increasing KV Budget.
* **Data Points (Throughput):**
* KV Budget 2500: Approximately 600 TPS
* KV Budget 3000: Approximately 550 TPS
* KV Budget 3500: Approximately 500 TPS
* KV Budget 4000: Approximately 450 TPS
* KV Budget 4500: Approximately 400 TPS
**3. R1-Llama | AMC23**
* **Trend (Pass@1):** Pass@1 values decrease significantly with increasing KV Budget.
* **Data Points (Pass@1):**
* KV Budget 2500: Approximately 90.3
* KV Budget 3000: Approximately 79.0
* KV Budget 3500: Approximately 84.0
* KV Budget 4000: Approximately 87.0
* KV Budget 4500: Approximately 87.0
* **Trend (Throughput):** Throughput decreases with increasing KV Budget.
* **Data Points (Throughput):**
* KV Budget 2500: Approximately 700 TPS
* KV Budget 3000: Approximately 600 TPS
* KV Budget 3500: Approximately 500 TPS
* KV Budget 4000: Approximately 400 TPS
* KV Budget 4500: Approximately 300 TPS
**4. R1-Llama | GPQA-D**
* **Trend (Pass@1):** Pass@1 values generally increase with increasing KV Budget.
* **Data Points (Pass@1):**
* KV Budget 2500: Approximately 37.9
* KV Budget 3000: Approximately 45.8
* KV Budget 3500: Approximately 45.1
* KV Budget 4000: Approximately 45.5
* KV Budget 4500: Approximately 46.4
* **Trend (Throughput):** Throughput decreases with increasing KV Budget.
* **Data Points (Throughput):**
* KV Budget 2500: Approximately 600 TPS
* KV Budget 3000: Approximately 550 TPS
* KV Budget 3500: Approximately 500 TPS
* KV Budget 4000: Approximately 450 TPS
* KV Budget 4500: Approximately 400 TPS
### Key Observations
* There's a consistent inverse relationship between Pass@1 and Throughput across all accelerators. Increasing KV Budget generally improves Pass@1 but reduces Throughput.
* AMC23 starts with the highest Pass@1 but experiences a significant drop as KV Budget increases.
* AIME25 has the lowest overall Pass@1 values.
* AIME24 and GPQA-D show a more consistent increase in Pass@1 with increasing KV Budget.
### Interpretation
The data suggests a trade-off between accuracy (Pass@1) and speed (Throughput) when adjusting the KV Budget for the R1-Llama model. Higher KV Budgets appear to prioritize accuracy at the expense of processing speed. The varying performance across different accelerators (AIME24, AIME25, AMC23, GPQA-D) indicates that the optimal KV Budget will depend on the specific hardware being used and the desired balance between accuracy and throughput. The significant drop in Pass@1 for AMC23 with increasing KV Budget could indicate a hardware-specific limitation or an interaction between the model and the accelerator. The consistent inverse relationship between the two metrics suggests a fundamental constraint in the system's performance. Further investigation is needed to understand the underlying reasons for these trends and to optimize the KV Budget for each accelerator to achieve the best possible performance.
</details>
<details>
<summary>x52.png Details</summary>

### Visual Description
\n
## Bar Chart with Line Overlay: Pass@1 vs. Throughput for Different KV Budgets and Models
### Overview
The image presents four bar charts, each representing a different model (AIME24, AIME25, AMC23, and GPOA-D) tested on R1-Owen. Each chart displays the Pass@1 metric (left y-axis) as a bar graph and the Throughput (right y-axis) as a line graph, both plotted against varying KV Budget values on the x-axis.
### Components/Axes
* **X-axis:** KV Budget (ranging from 2500 to 5000 in increments of 500). Labelled "KV Budget".
* **Left Y-axis:** Pass@1 (ranging from 20 to 100). Labelled "Pass@1".
* **Right Y-axis:** Throughput (ranging from 600 to 800). Labelled "Throughput (TPS)".
* **Legend:**
* Blue bars: Pass@1
* Orange line: Throughput
* **Chart Titles:**
* R1-Owen | AIME24
* R1-Owen | AIME25
* R1-Owen | AMC23
* R1-Owen | GPOA-D
### Detailed Analysis or Content Details
**1. R1-Owen | AIME24**
* **Pass@1:** The bars show an increasing trend. Approximate values: 27.0 (KV Budget 2500), 34.0 (3000), 42.0 (3500), 46.0 (4000), 48.0 (4500), 52.0 (5000).
* **Throughput:** The line slopes downward. Approximate values: 780 (KV Budget 2500), 760 (3000), 740 (3500), 720 (4000), 700 (4500), 680 (5000).
**2. R1-Owen | AIME25**
* **Pass@1:** The bars show an increasing trend. Approximate values: 20.0 (KV Budget 2500), 26.0 (3000), 33.3 (3500), 36.7 (4000), 38.0 (4500), 34.0 (5000).
* **Throughput:** The line slopes downward. Approximate values: 790 (KV Budget 2500), 770 (3000), 750 (3500), 730 (4000), 710 (4500), 690 (5000).
**3. R1-Owen | AMC23**
* **Pass@1:** The bars show an increasing trend. Approximate values: 70.0 (KV Budget 2500), 78.5 (3000), 84.5 (3500), 87.0 (4000), 85.5 (4500), 85.0 (5000).
* **Throughput:** The line slopes downward. Approximate values: 790 (KV Budget 2500), 770 (3000), 750 (3500), 730 (4000), 710 (4500), 690 (5000).
**4. R1-Owen | GPOA-D**
* **Pass@1:** The bars show an increasing trend. Approximate values: 40.6 (KV Budget 2500), 45.0 (3000), 48.8 (3500), 46.7 (4000), 48.4 (4500), 48.2 (5000).
* **Throughput:** The line slopes downward. Approximate values: 780 (KV Budget 2500), 760 (3000), 740 (3500), 720 (4000), 700 (4500), 680 (5000).
### Key Observations
* All four models exhibit a negative correlation between KV Budget and Throughput. As the KV Budget increases, the Throughput decreases.
* All four models exhibit a positive correlation between KV Budget and Pass@1. As the KV Budget increases, the Pass@1 increases.
* AMC23 consistently demonstrates the highest Pass@1 values across all KV Budget levels.
* AIME25 has the lowest Pass@1 values.
* The rate of increase in Pass@1 appears to slow down at higher KV Budget values for all models.
### Interpretation
The data suggests a trade-off between accuracy (Pass@1) and speed (Throughput) when adjusting the KV Budget. Increasing the KV Budget improves the model's ability to correctly classify inputs (higher Pass@1), but at the cost of processing speed (lower Throughput). This is a common phenomenon in machine learning, where more complex models or higher precision settings often require more computational resources and time.
The significant difference in Pass@1 values between AMC23 and AIME25 indicates that AMC23 is a more accurate model for the given task, but it may also be slower. The consistent downward trend in Throughput across all models suggests that the system's processing capacity is a limiting factor.
The slowing rate of increase in Pass@1 at higher KV Budget values suggests diminishing returns. Beyond a certain point, increasing the KV Budget yields only marginal improvements in accuracy, while continuing to reduce Throughput. This information is valuable for optimizing the model's performance by finding the optimal balance between accuracy and speed based on the specific application requirements.
</details>
Figure 16: Impact of budget on Pass@1 and throughput for R1-Llama (top) and R1-Qwen (bottom) across AIME24, AIME25, AMC23, and GPQA-D datasets. The blue bars represent accuracy (left y-axis), and the orange lines represent throughput (right y-axis).
<details>
<summary>x53.png Details</summary>

### Visual Description
## Heatmap: Performance Comparison of Models
### Overview
The image presents a heatmap comparing the performance of four different models (R1-Llama | AIME24, R1-Llama | AIME25, R1-Llama | AMC23, and R1-Llama | GPQA D) across varying ratios and local window sizes. The performance metric is "Pass@1", represented by the color intensity.
### Components/Axes
* **X-axis:** Ratio, ranging from 0.1 to 0.5 with increments of 0.1.
* **Y-axis:** Local Window Size, with values of 500, 1000, 2000, and 3000.
* **Color Scale (Right):** Pass@1, ranging from approximately 24 (dark blue) to 89 (dark green).
* **Titles:** Each heatmap is labeled with the model name (e.g., "R1-Llama | AIME24").
* **Layout:** Four heatmaps are arranged horizontally, each representing a different model.
### Detailed Analysis
Each heatmap displays a grid of values corresponding to the combination of Ratio and Local Window Size. The values are color-coded based on the Pass@1 score.
**R1-Llama | AIME24:**
* **Trend:** Generally, performance is relatively stable across ratios for each window size. There's a slight tendency for performance to decrease with increasing ratio at window size 2000.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 43.7
* Ratio 0.1, Window 1000: 49.3
* Ratio 0.1, Window 2000: 47.3
* Ratio 0.1, Window 3000: 45.3
* Ratio 0.5, Window 500: 42.7
* Ratio 0.5, Window 1000: 46.0
* Ratio 0.5, Window 2000: 43.3
* Ratio 0.5, Window 3000: 46.7
* The lowest value is approximately 42.7, and the highest is approximately 49.3.
**R1-Llama | AIME25:**
* **Trend:** Similar to AIME24, performance is relatively stable. There's a slight dip in performance at Ratio 0.3 and 0.4 for all window sizes.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 25.3
* Ratio 0.1, Window 1000: 26.7
* Ratio 0.1, Window 2000: 24.0
* Ratio 0.1, Window 3000: 27.3
* Ratio 0.5, Window 500: 26.7
* Ratio 0.5, Window 1000: 30.7
* Ratio 0.5, Window 2000: 26.7
* Ratio 0.5, Window 3000: 28.7
* The lowest value is approximately 24.0, and the highest is approximately 30.7.
**R1-Llama | AMC23:**
* **Trend:** Performance is consistently high across all ratios and window sizes. There's a slight increase in performance with increasing ratio up to 0.4, then a slight decrease.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 85.5
* Ratio 0.1, Window 1000: 87.0
* Ratio 0.1, Window 2000: 84.0
* Ratio 0.1, Window 3000: 87.5
* Ratio 0.5, Window 500: 86.0
* Ratio 0.5, Window 1000: 86.0
* Ratio 0.5, Window 2000: 86.0
* Ratio 0.5, Window 3000: 86.5
* The lowest value is approximately 84.0, and the highest is approximately 89.0.
**R1-Llama | GPQA D:**
* **Trend:** Performance is generally good, but lower than AMC23. There's a slight increase in performance with increasing ratio up to 0.3, then a slight decrease.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 44.1
* Ratio 0.1, Window 1000: 45.8
* Ratio 0.1, Window 2000: 45.1
* Ratio 0.1, Window 3000: 44.2
* Ratio 0.5, Window 500: 44.9
* Ratio 0.5, Window 1000: 45.4
* Ratio 0.5, Window 2000: 47.4
* Ratio 0.5, Window 3000: 46.5
* The lowest value is approximately 44.1, and the highest is approximately 47.4.
### Key Observations
* R1-Llama | AMC23 consistently outperforms the other models across all conditions.
* R1-Llama | AIME25 has the lowest overall performance.
* The impact of Local Window Size on performance varies between models.
* The Ratio has a relatively small impact on performance for most models.
### Interpretation
The heatmap demonstrates a clear performance hierarchy among the four models. AMC23 is the most robust, achieving high Pass@1 scores regardless of the ratio or local window size. AIME25 is the least effective. The relatively stable performance across different ratios suggests that the amount of data used for training or inference does not significantly impact the models' ability to pass the tests, within the tested range. The varying impact of local window size indicates that the models' performance is sensitive to the context window, but the specific relationship differs between models. This data could be used to inform model selection and hyperparameter tuning for specific applications. The differences in performance suggest that the underlying architectures or training data of the models are significantly different.
</details>
<details>
<summary>x54.png Details</summary>

### Visual Description
## Heatmap: Pass@1 Performance Across Models and Parameters
### Overview
The image presents a heatmap comparing the Pass@1 performance of four different models (R1-Owen | AIME24, R1-Owen | AIME25, R1-Owen | AMC23, and R1-Owen | GPQA D) across varying combinations of 'Ratio' and 'Local Window Size'. The heatmap uses a color gradient to represent the Pass@1 values, with cooler colors (blues) indicating lower performance and warmer colors (yellows/greens) indicating higher performance.
### Components/Axes
* **X-axis:** 'Ratio', ranging from 0.1 to 0.5, with markers at 0.1, 0.2, 0.3, 0.4, and 0.5.
* **Y-axis:** 'Local Window Size', ranging from 500 to 2000, with markers at 500, 1000, 1500, and 2000.
* **Color Scale:** Represents Pass@1 values. The scale ranges from approximately -52 to -88. The color gradient is as follows:
* Dark Blue: ~-88
* Blue: ~-85 to -87
* Light Blue: ~-80 to -84
* Green: ~-70 to -79
* Yellow: ~-60 to -69
* Light Yellow: ~-50 to -59
* **Titles:** Each heatmap is labeled with the model name (e.g., "R1-Owen | AIME24").
* **Four Heatmaps:** Arranged horizontally, each representing a different model.
### Detailed Analysis
Here's a breakdown of the data within each heatmap, noting trends and approximate values.
**1. R1-Owen | AIME24**
* **Trend:** Generally, performance increases with increasing 'Local Window Size' and decreasing 'Ratio'.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 40.0
* Ratio 0.1, Window 1000: 47.3
* Ratio 0.1, Window 1500: 46.0
* Ratio 0.1, Window 2000: 42.7
* Ratio 0.5, Window 500: 48.7
* Ratio 0.5, Window 1000: 52.0
* Ratio 0.5, Window 1500: 47.3
* Ratio 0.5, Window 2000: 46.7
* Ratio 0.3, Window 1000: 45.3
* Ratio 0.4, Window 1000: 46.7
**2. R1-Owen | AIME25**
* **Trend:** Similar to AIME24, performance generally increases with increasing 'Local Window Size' and decreasing 'Ratio'.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 34.0
* Ratio 0.1, Window 1000: 35.7
* Ratio 0.1, Window 1500: 34.7
* Ratio 0.1, Window 2000: 32.7
* Ratio 0.5, Window 500: 35.3
* Ratio 0.5, Window 1000: 36.7
* Ratio 0.5, Window 1500: 36.0
* Ratio 0.5, Window 2000: 34.7
* Ratio 0.3, Window 1000: 34.3
* Ratio 0.4, Window 1000: 33.3
**3. R1-Owen | AMC23**
* **Trend:** Performance is generally higher than AIME24 and AIME25. Performance increases with increasing 'Local Window Size' and decreasing 'Ratio'.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 85.0
* Ratio 0.1, Window 1000: 87.5
* Ratio 0.1, Window 1500: 88.0
* Ratio 0.1, Window 2000: 86.5
* Ratio 0.5, Window 500: 88.5
* Ratio 0.5, Window 1000: 87.0
* Ratio 0.5, Window 1500: 85.0
* Ratio 0.5, Window 2000: 85.5
* Ratio 0.3, Window 1000: 85.0
* Ratio 0.4, Window 1000: 86.5
**4. R1-Owen | GPQA D**
* **Trend:** Performance is generally high, but slightly lower than AMC23. Performance increases with increasing 'Local Window Size' and decreasing 'Ratio'.
* **Data Points (approximate):**
* Ratio 0.1, Window 500: 47.6
* Ratio 0.1, Window 1000: 48.1
* Ratio 0.1, Window 1500: 49.7
* Ratio 0.1, Window 2000: 46.7
* Ratio 0.5, Window 500: 46.1
* Ratio 0.5, Window 1000: 47.2
* Ratio 0.5, Window 1500: 48.0
* Ratio 0.5, Window 2000: 47.6
* Ratio 0.3, Window 1000: 47.5
* Ratio 0.4, Window 1000: 46.4
### Key Observations
* **Model Performance:** R1-Owen | AMC23 consistently exhibits the highest Pass@1 values across all parameter combinations. R1-Owen | AIME25 shows the lowest performance.
* **Parameter Interaction:** For all models, decreasing the 'Ratio' and increasing the 'Local Window Size' generally leads to improved performance.
* **Performance Range:** The Pass@1 values vary significantly across models, ranging from approximately 32 to 88.
### Interpretation
The heatmap demonstrates the impact of 'Ratio' and 'Local Window Size' on the Pass@1 performance of different models. The consistent trend across all models suggests that these parameters play a crucial role in determining the model's ability to correctly answer questions. The superior performance of R1-Owen | AMC23 indicates that this model is more robust to variations in these parameters or benefits more from larger window sizes and lower ratios. The differences in performance between the models highlight the importance of model architecture and training data in achieving high accuracy. The heatmap provides valuable insights for optimizing model parameters and selecting the most appropriate model for a given task. The negative values on the y-axis suggest that the Pass@1 metric is being represented as a loss or error rate, where lower values indicate better performance.
</details>
Figure 17: Impact of different local window sizes and retention ratios of section window
Local Window and Retention Ratio Analysis over Benchmarks. Figures 17 illustrate the sensitivity of model performance to variations in Local Window Size and the Retention Ratio of the Selection Window. A moderate local window (e.g., 1000–2000) typically yields optimal results, suggesting that recent context saturation is reached relatively. Furthermore, we observe that the retention ratio between $0.3$ and $0.4$ across most benchmarks (e.g., AIME24, GPQA), where the model effectively balances and reasoning performance. Whereas lower ratios (e.g., $0.1$ ) consistently degrade accuracy due to excessive information loss.
## Appendix E Limitations and Future Work
Currently, DynTS is implemented based on the transformers library, and we are actively working on deploying it to other inference frameworks such as vLLM and SGLang. Additionally, our current training data focuses on mathematical reasoning, which may limit performance in other domains like coding or abstract reasoning. In the future, we plan to expand data diversity to adapt to a broader range of reasoning tasks. Moreover, constrained by computational resources, we used a relatively small dataset ( $\sim 7,000$ samples) for training. This dataset’s scale limits us to optimizing only the importance predictor’s parameters, since optimizing all parameters on the small dataset may compromise the model’s original generalization capabilities. This constraint may hinder the full potential of DynTS. Future work can focus on scaling up the dataset and jointly optimizing both the backbone and the predictor to elicit stronger capabilities.