2505.17813
Model: healer-alpha-free
# Donβt Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
## Abstract
Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive βthinkingβ chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answersβup to $34.5\$ more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes $k$ independent generations in parallel and halts computation once the first $m$ thinking processes are done. The final answer is chosen using majority voting among these $m$ chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settingsβusing up to $40\$ fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to $33\$ wall time reduction). To further validate our findings, we finetune LLMs using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer βthinkingβ does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.
## 1 Introduction
Scaling test-time compute has been shown to be an effective strategy for improving the performance of reasoning LLMs on complex reasoning tasks (OpenAI, 2024; 2025; Team, 2025b). This method involves generating extensive thinking βvery long sequences of tokens that contain enhanced reasoning trajectories, ultimately yielding more accurate solutions. Prior work has argued that longer model responses result in enhanced reasoning capabilities (Guo et al., 2025; Muennighoff et al., 2025; Anthropic, 2025). However, generating such long-sequences also leads to high computational cost and slow decoding time due to the autoregressive nature of LLMs.
In this work, we demonstrate that scaling test-time compute does not necessarily improve model performance as previously thought. We start with a somewhat surprising observation. We take four leading reasoning LLMs, and for each generate multiple answers to each question in four complex reasoning benchmarks. We then observe that taking the shortest answer for each question strongly and consistently outperforms both a strategy that selects a random answer (up to $18.8\$ gap) and one that selects the longest answer (up to $34.5\$ gap). These performance gaps are on top of the natural reduction in sequence lengthβthe shortest chains are $50\$ and $67\$ shorter than the random and longest chains, respectively.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Comparison of Two Problem-Solving Methods
### Overview
The image is a diagram comparing the outputs of two different methods, labeled "majority@k" and "short-1@k (ours)", when applied to the same mathematical problem. The diagram visually demonstrates that the "short-1@k" method yields the correct answer, while the "majority@k" method yields an incorrect one.
### Components/Axes
The diagram is structured into three main horizontal sections:
1. **Header/Problem Statement:** A single line of text at the top presenting the mathematical question.
2. **Method 1 (majority@k):** The upper section, featuring a cartoon character (a yellow, horned creature with glasses) on the left. To its right are four lines representing internal "thinking" processes, which converge via arrows to a final answer on the far right.
3. **Method 2 (short-1@k (ours)):** The lower section, separated by a dashed line. It features an identical cartoon character on the left. To its right are four lines representing its "thinking" processes, which also converge to a final answer on the far right.
**Labels and Text Elements:**
* **Problem Statement:** "Q: Find the sum of all positive integers n such that n+2 divides the product 3(n+3)(nΒ²+9)"
* **Method 1 Label:** "majority@k" (in red text)
* **Method 2 Label:** "short-1@k (ours)" (in blue text)
* **Thinking Process Tags:** Lines begin with `<think>` or `// Terminated thinking`.
* **Answers within Thinking:** Phrases like "So the answer is 52", "So the answer is 49", "So the answer is 33".
* **Final Answer Blocks:** "Final answer:" followed by a number (52 or 49).
* **Outcome Indicators:** A red "X" (β) next to the final answer for "majority@k", and a green checkmark (β) next to the final answer for "short-1@k".
### Detailed Analysis
**Method 1: majority@k**
* **Process:** Four parallel thinking streams are shown.
1. `<think> So the answer is 52`
2. `<think> So the answer is 49`
3. `<think> So the answer is 33`
4. `<think> So the answer is 52`
* **Convergence:** All four streams have arrows pointing to a single "Final answer: 52" block.
* **Outcome:** The final answer "52" is marked with a red "X", indicating it is incorrect.
**Method 2: short-1@k (ours)**
* **Process:** Four parallel thinking streams are shown.
1. `<think> So the answer is 49`
2. `<think> So the answer is 49`
3. `<think>.................... // Terminated thinking`
4. `<think>.................... // Terminated thinking`
* **Convergence:** Only the second stream (which concluded "49") has an arrow pointing to the "Final answer: 49" block. The other three streams are marked as terminated and do not contribute.
* **Outcome:** The final answer "49" is marked with a green checkmark, indicating it is correct.
### Key Observations
1. **Divergent Outputs:** The two methods, when processing the same problem, produce different final answers (52 vs. 49).
2. **Process Difference:** The "majority@k" method aggregates results from all its thinking streams (including conflicting ones like 49 and 33) to arrive at a majority-based answer (52 appears twice). The "short-1@k" method appears to terminate most thinking streams early, allowing only one stream to complete and provide the final answer.
3. **Correctness:** The diagram explicitly labels the output of "majority@k" as wrong and the output of "short-1@k" as correct.
4. **Visual Metaphor:** The identical cartoon characters suggest the underlying "agent" or model is the same; the difference lies in the method or strategy ("@k") applied to its reasoning process.
### Interpretation
This diagram is a technical illustration likely from a research paper or report on AI reasoning or problem-solving strategies. It argues for the superiority of the "short-1@k" method over the "majority@k" method.
* **What it demonstrates:** It shows that a strategy of terminating most reasoning paths ("short-1") can prevent an AI system from being misled by incorrect intermediate conclusions and lead it to the correct answer. In contrast, a strategy that aggregates multiple reasoning paths ("majority") can be corrupted by incorrect intermediate outputs, leading to a wrong final answer.
* **Relationship between elements:** The problem statement is the constant input. The two methods are competing algorithms applied to that input. The "thinking" lines represent the internal reasoning traces of the AI. The final answers and their correctness markers are the evaluated outcomes. The arrows show the flow of information from reasoning to conclusion.
* **Underlying message:** The diagram suggests that for certain types of complex problems (like the given number theory problem), quality and correctness of reasoning are more important than quantity or consensus. A method that can identify and halt flawed reasoning paths ("short-1") is more reliable than one that simply tallies the results of multiple paths, some of which may be flawed ("majority"). The "(ours)" label indicates the authors are proposing the "short-1@k" method as their contribution.
</details>
Figure 1: Visual comparison between majority voting and our proposed method short-m@k with $m=1$ (ββ¦β represent thinking time). Given $k$ parallel attempts for the same question, majority@ $k$ waits until all attempts are done, and perform majority voting among them. On the other hand, our short-m@k method halts computation for all attempts as soon as the first $m$ attempts finish βthinkingβ, which saves compute and time, and surprisingly also boost performance in most cases.
Building on these findings, we propose short-m@k βa novel inference method for reasoning LLMs. short-m@k executes $k$ generations in parallel and terminates computation for all generations as soon as the first $m$ thinking processes are completed. The final answer is then selected via majority voting among those shortest chains, where ties are broken by taking the shortest answer among the tied candidates. See Figure Λ 1 for visualization.
We evaluate short-m@k using six reasoning LLMs, and compare it to majority votingβthe most common aggregation method for evaluating reasoning LLMs on complex benchmarks (Wang et al., 2022; Abdin et al., 2025). We show that in low-compute regimes, short-1@k, i.e., taking the single shortest chain, outperforms majority voting, while significantly reducing the time and compute needed to generate the final answer. For example, using LN-Super- $49$ B (Bercovich and others, 2025), short-1@k can reduce up to $40\$ of the compute while giving the same performance as majority voting. Moreover, for high-compute regimes, short-3@k, which halts generation after three thinking chains are completed, consistently outperforms majority voting across all compute budgets, while running up to $33\$ faster.
To gain further insights into the underlining mechanism of why shorter thinking is preferable, we analyze the generated reasoning chains. We first show that while taking the shorter reasoning is beneficial per individual question, longer reasoning is still needed to solve harder questions, as claimed in recent studies (Anthropic, 2025; OpenAI, 2024; Muennighoff et al., 2025). Next, we analyze the backtracking and re-thinking behaviors of reasoning chains. We find that shorter reasoning paths are more effective, as they involve fewer backtracks, with a longer average backtrack length. This finding holds both generally and when controlling for overall trajectory length.
To further strengthening our findings, we study whether training on short reasoning chains can lead to more accurate models. To do so, we finetune two Qwen- $2.5$ (Yang and others, 2024) models ( $7$ B and $32$ B) on three variants of the S $1$ dataset (Muennighoff et al., 2025): S $1$ -short, S $1$ -long, and S $1$ -random, consisting of examples with the shortest, longest, and randomly sampled reasoning trajectories among several generations, respectively. Our experiments demonstrate that finetuning using S $1$ -short not only yields shorter thinking lengths, but also improves model performance. Conversely, finetuning on S $1$ -long increases reasoning time with no significant performance gains.
This work rethinks the test-time compute paradigm for reasoning LLMs, showing that longer thinking not only does not ensure better reasoning, but also leads to worse reasoning in most cases. Our short-m@k methods prioritize shorter reasoning, yielding improved performance and reduced computational costs for current reasoning LLMs. We also show that training reasoning LLMs with shorter reasoning trajectories can enhance performance and reduce costs. Our results pave the way towards a new era of efficient and high-performing reasoning LLMs.
## 2 Related work
Reasoning LLMs and test-time scaling.
Reasoning LLMs tackle complex tasks by employing extensive reasoning processes, often involving detailed, step-by-step trajectories (OpenAI (2024); OpenAI (2025); Q. Team (2025b); M. Abdin, S. Agarwal, A. Awadallah, V. Balachandran, H. Behl, L. Chen, G. de Rosa, S. Gunasekar, M. Javaheripi, N. Joshi, et al. (2025); Anthropic (2025); A. Bercovich et al. (2025); D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025); 27; G. DeepMind (2025); Q. Team (2025a)). This capability is fundamentally based on techniques like chain-of-thought (CoT; Wei et al., 2022), which encourage models to generate intermediate reasoning steps before arriving at a final answer. Modern LLMs use a large number of tokens, often referred to as βthinking tokensβ, to explore multiple problem-solving approaches, to employ self-reflection, and to perform verification. This thinking capability has allowed them to achieve superior performance on challenging tasks such as mathematical problem-solving and code generation (Ke et al., 2025).
The LLM thinking capability is typically achieved through post-training methods applied to a strong base model. The two primary approaches to instilling or improving this reasoning ability are using reinforcement learning (RL) (Guo et al., 2025; Team, 2025b) and supervised fine-tuning (Muennighoff et al., 2025; Ye et al., 2025). Guo et al. (2025) have demonstrated that as training progresses the model tends to generate longer thinking trajectories, which results in improved performance on complex tasks. Similarly, Anthropic (2025) and Muennighoff et al. (2025) have shown a correlation between increased average thinking length during inference and improved performance. We challenge this assumption, demonstrating that shorter sequences are more likely to yield an accurate answer.
Efficiency in reasoning LLMs.
While shortening the length of CoT is beneficial for non-reasoning models (Nayab et al., 2024; Kang et al., 2025), it is higly important for reasoning LLMs as they require a very large amount of tokens to perform the thinking process. As a result, recent studies tried to make the process more efficient, e.g., by using early exit techniques for reasoning trajectories (Pu et al., 2025; Yang et al., 2025), by suppressing backtracks (Wang et al., 2025a) or by training reasoning models which enable control over the thinking length (Yu et al., 2025).
Several recent works studied the relationship between reasoning trajectory length and correctness. Lu et al. (2025) proposed a method for reducing the length of thinking trajectories in reasoning training datasets. Their method employs a reasoning LLM several times over an existing trajectory in order to make it shorter. As this approach eventually trains a model over shorter trajectories it is similar to the method we employ in Section Λ 6. However, our method is simpler as it does not require an LLM to explicitly shorten the sequence. Fatemi et al. (2025); Qi et al. (2025) and Arora and Zanette (2025) proposed RL methods to shorten reasoning in language models. Fatemi et al. (2025) also observed that correct answers typically require shorter thinking trajectories by averaging lengths across examples, suggesting that lengthy responses might inherently stem from RL-based optimization during training. In Section Λ 5.1 we show that indeed correct answers usually use shorter thinking trajectories, but also highlight that averaging across all examples might hinder this effect as easier questions require sustainably lower amount of reasoning tokens compared to harder ones.
More relevant to our work, Wu et al. (2025) showed that there is an optimal thinking length range for correct answers according to the difficulty of the question, while Wang et al. (2025b) found that for a specific question, correct responses from reasoning models are usually shorter than incorrect ones. We provide further analysis supporting these observations in Sections Λ 3 and 5. Finally, our proposed inference method short-m@k is designed to enhance the efficiency of reasoning LLMs by leveraging this property, which can be seen as a generalization of the FFS method (Agarwal et al., 2025), which selects the shortest answer among several candidates as in our short-1@k.
## 3 Shorter thinking is preferable
As mentioned above, the common wisdom in reasoning LLMs suggests that increased test-time computation enhances model performance. Specifically, it is widely assumed that longer reasoning process, which entails extensive reasoning thinking chains, correlates with improved task performance (OpenAI, 2024; Anthropic, 2025; Muennighoff et al., 2025). We challenge this assumption and ask whether generating more tokens per trajectory actually leads to better performance. To that end, we generate multiple answers per question and compare performance based solely on the shortest, longest and randomly sampled thinking chains among the generated samples.
### 3.1 Experimental details
We consider four leading, high-performing, open, reasoning LLMs. Llama- $3.3$ -Nemotron-Super- $49$ B-v $1$ [LN-Super- $49$ B; Bercovich and others, 2025]: a reasoning RL-enhanced version of Llama- $3.3$ - $70$ B (Grattafiori et al., 2024); R $1$ -Distill-Qwen- $32$ B [R $1$ - $32$ B; Guo et al., 2025]: an SFT finetuned version of Qwen- $2.5$ - $32$ B-Instruct (Yang and others, 2024) derived from R $1$ trajectories; QwQ- $32$ B a reasoning RL-enhanced version Qwen- $2.5$ - $32$ B-Instruct (Team, 2025b); and R1- $0528$ a $670$ B RL-trained flagship reasoning model (R $1$ - $670$ B; Guo et al., 2025). We also include results for smaller models in Appendix Λ D.
We evaluate all models using four competitive reasoning benchmarks. We use AIME $2024$ (of America, 2024), AIME $2025$ (of America, 2025) and HMMT February $2025$ , from the Math Arena benchmark (BalunoviΔ et al., 2025). This three benchmarks are derived from math competitions, and involve solving problems that cover a broad range of mathematics topics. Each dataset consists of $30$ examples with varied difficulty. We also evaluate the models using the GPQA-diamond benchmark [GPQA-D; Rein et al., 2024], which consists of $198$ multiple-choice scientific questions, and is considered to be challenging for reasoning LLMs (DeepMind, 2025).
For each question, we generate $20$ responses per model, yielding a total of about $36$ k generations. For all models we use temperature of $0.7$ , top-p= $0.95$ and a maximum number of generated tokens of $32$ , $768$ . When measuring the thinking chain length, we measure the token count between the <think> and </think> tokens. We run inference for all models using paged attention via the vLLM framework (Kwon et al., 2023).
### 3.2 The shorter the better
Table 1: Shorter thinking performs better. Comparison between taking the shortest/longest/random generation per example.
| | GPQA-D | AIME 2024 | AIME 2025 | HMMT | Math Average | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | |
| LN-Super-49B | | | | | | | | | | |
| random | 5357 | 65.1 | 11258 | 58.8 | 12105 | 51.3 | 13445 | 33.0 | 12270 | 47.7 |
| longest | 8763 $(+64\$ | 57.6 | 18566 | 33.3 | 18937 | 30.0 | 19790 | 23.3 | 19098 $(+56\$ | 28.9 |
| shortest | 2790 $(-48\$ | 69.1 | 0 6276 | 76.7 | 0 7036 | 66.7 | 0 7938 | 46.7 | 7083 $(-42\$ | 63.4 |
| R1-32B | | | | | | | | | | |
| random | 5851 | 62.5 | 9614 | 71.8 | 11558 | 56.4 | 12482 | 38.3 | 11218 | 55.5 |
| longest | 9601 $(+64\$ | 57.1 | 17689 | 53.3 | 19883 | 36.7 | 20126 | 23.3 | 19233 $(+71\$ | 37.8 |
| shortest | 3245 $(-45\$ | 64.7 | 0 4562 | 80.0 | 0 6253 | 63.3 | 0 6557 | 36.7 | 5791 $(-48\$ | 60.0 |
| QwQ-32B | | | | | | | | | | |
| random | 8532 | 63.7 | 13093 | 82.0 | 14495 | 72.3 | 16466 | 52.5 | 14685 | 68.9 |
| longest | 12881 $(+51\$ | 54.5 | 20059 | 70.0 | 21278 | 63.3 | 24265 | 36.7 | 21867 $(+49\$ | 56.7 |
| shortest | 5173 $(-39\$ | 64.7 | 0 8655 | 86.7 | 10303 | 66.7 | 11370 | 60.0 | 10109 $(-31\$ | 71.1 |
| R1-670B | | | | | | | | | | |
| random | 11843 | 76.2 | 16862 | 83.8 | 18557 | 82.5 | 21444 | 68.2 | 18954 | 78.2 |
| longest | 17963 $(+52\$ | 63.1 | 22603 | 70.0 | 23570 | 66.7 | 27670 | 40.0 | 24615 $(+30\$ | 58.9 |
| shortest | 8116 $(-31\$ | 75.8 | 11229 | 83.3 | 13244 | 83.3 | 13777 | 83.3 | 12750 $(-33\$ | 83.3 |
For all generated answers, we compare short vs. long thinking chains for the same question, along with a random chain. Results are presented in Table Λ 1. In this section we exclude generations where thinking is not completed within the maximum generation length, as these often result in an infinite thinking loop. First, as expected, the shortest answers are $25\$ β $50\$ shorter compared to randomly sampled responses. However, we also note that across almost all models and benchmarks, considering the answer with the shortest thinking chain actually boosts performance, yielding an average absolute improvement of $2.2\$ β $15.7\$ on the math benchmarks compared to randomly selected generations. When considering the longest thinking answers among the generations, we further observe an increase in thinking chain length, with up to $75\$ more tokens per chain. These extended reasoning trajectories substantially degrade performance, resulting in average absolute reductions ranging between $5.4\$ β $18.8\$ compared to random generations over all benchmarks. These trends are most noticeable when comparing the shortest generation with the longest ones, with an absolute performance gain of up to $34.5\$ in average accuracy and a substantial drop in the number of thinking tokens.
The above results suggest that long generations might come with a significant price-tag, not only in running time, but also in performance. That is, within an individual example, shorter thinking trajectories are much more likely to be correct. In Section Λ 5.1 we examine how these results relate to the common assumption that longer trajectories leads to better LLM performance. Next, we propose strategies to leverage these findings to improve the efficiency and effectiveness of reasoning LLMs.
## 4 short-m@k : faster and better inference of reasoning LLMs
Based on the results presented in Section Λ 3, we suggest a novel inference method for reasoning LLMs. Our methodβ short-m@k βleverages batch inference of LLMs per question, using multiple parallel decoding runs for the same query. We begin by introducing our method in Section Λ 4.1. We then describe our evaluation methodology, which takes into account inference compute and running time (Section Λ 4.2). Finally, we present our results (Section Λ 4.3).
### 4.1 The short-m@k method
The short-m@k method, visualized in Figure Λ 1, performs parallel decoding of $k$ generations for a given question, halting computation across all generations as soon as the $m\leq k$ shortest thinking trajectories are completed. It then conducts majority voting among those shortest answers, resolving ties by selecting the answer with the shortest thinking chain. Given that thinking trajectories can be computationally intensive, terminating all generations once the $m$ shortest trajectories are completed not only saves computational resources but also significantly reduces wall time due to the parallel decoding approach, as shown in Section Λ 4.3.
Below we focus on short-1@k and short-3@k, with short-1@k being the most efficient variant of short-m@k and short-3@k providing the best balance of performance and efficiency (see Section Λ 4.3). Ablation studies on $m$ and other design choices are presented in Appendix Λ C, while results for smaller models are presented in Appendix Λ D.
### 4.2 Evaluation setup
We evaluate all methods under the same setup as described in Section Λ 3.1. We report the averaged results across the math benchmarks, while the results for GPQA-D presented in Appendix Λ A. The per benchmark resutls for the math benchmarks are in Appendix Λ B. We report results using our method (short-m@k) with $m\in\{1,3\}$ . We compare the proposed method to the standard majority voting (majority $@k$ ), arguably the most common method for aggregating multiple outputs (Wang et al., 2022), which was recently adapted for reasoning LLMs (Guo et al., 2025; Abdin et al., 2025; Wang et al., 2025b). As an oracle, we consider pass $@k$ (Kulal et al., 2019; Chen and others, 2021), which measures the probability of including the correct solution within $k$ generated responses.
We benchmark the different methods with sample sizes of $k\in\{1,2,...,10\}$ , assuming standard parallel decoding setup, i.e., all samples are generated in parallel. Section 5.3 presents sequential analysis where parallel decoding is not assumed. For the oracle (pass@ $k$ ) approach, we use the unbiased estimator presented in Chen and others (2021), with our $20$ generations per question ( $n$ $=$ $20$ ). For the short-1@k method, we use the rank-score@ $k$ metric (Hassid et al., 2024), where we sort the different generations according to thinking length. For majority $@k$ and short-m@k where $m>1$ , we run over all $k$ -sized subsets out of the $20$ generations per example.
We evaluate the different methods considering three main criteria: (a) Sample-size (i.e., $k$ ), where we compare methods while controlling for the number of generated samples; (b) Thinking-compute, where we measure the total number of thinking tokens used across all generations in the batch; and (c) Time-to-answer, which measures the wall time of running inference using each method. In this parallel framework, our method (short-m@k) terminates all other generations after the first $m$ decoding thinking processes terminate. Thus, the overall thinking compute is the total number of thinking tokens for each of the $k$ generations at that point. Similarly, the overall time is that of the $m$ βth shortest generation process. Conversely, for majority $@k$ , the methodβs design necessitates waiting for all generations to complete before proceeding. Hence, we consider the compute as the total amount of thinking tokens in all generations and run time according to the longest thinking chain. As for the oracle approach, we terminate all thinking trajectories once the shortest correct one is finished, and consider the compute and time accordingly.
### 4.3 Results
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance (accuracy) of four different methods or models as a function of increasing sample size. The chart demonstrates how accuracy improves with more data for each method, with one method consistently outperforming the others.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It represents the number of samples in thousands (k). The axis has discrete, evenly spaced markers from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It represents a performance metric, likely a proportion or probability. The axis ranges from 0.50 to 0.75, with major grid lines at intervals of 0.05 (0.50, 0.55, 0.60, 0.65, 0.70, 0.75).
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries, each associating a line color and marker shape with a label.
1. **Black dotted line with upward-pointing triangle markers:** Label is partially obscured but appears to be a model name (e.g., "Model A" or similar).
2. **Cyan solid line with diamond markers:** Label is partially obscured.
3. **Blue solid line with square markers:** Label is partially obscured.
4. **Red solid line with circle markers:** Label is partially obscured.
* **Grid:** A light gray grid is present, with both horizontal and vertical lines aligned with the axis ticks.
### Detailed Analysis
The chart plots four data series. All series begin at approximately the same accuracy point at k=1 and show a monotonic increase as sample size grows, but at different rates and to different final levels.
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **Black Dotted Line (Triangles):**
* **Trend:** Shows the steepest and most consistent upward slope, indicating the strongest positive correlation between sample size and accuracy. It maintains a clear lead over all other methods throughout the entire range.
* **Data Points (k, Accuracy):**
* (1, ~0.47)
* (2, ~0.575)
* (3, ~0.63)
* (4, ~0.665)
* (5, ~0.69)
* (6, ~0.705)
* (7, ~0.72)
* (8, ~0.735)
* (9, ~0.745)
* (10, ~0.75)
2. **Cyan Line (Diamonds):**
* **Trend:** Shows a strong initial increase from k=1 to k=4, after which the rate of improvement slows, forming a gentle curve that begins to plateau. It is the second-best performing method for most of the range.
* **Data Points (k, Accuracy):**
* (1, ~0.47)
* (2, ~0.525)
* (3, ~0.56)
* (4, ~0.58)
* (5, ~0.59)
* (6, ~0.60)
* (7, ~0.605)
* (8, ~0.61)
* (9, ~0.615)
* (10, ~0.62)
3. **Blue Line (Squares):**
* **Trend:** Follows a similar trajectory to the cyan line but consistently achieves slightly lower accuracy. The gap between the cyan and blue lines remains relatively constant after k=3.
* **Data Points (k, Accuracy):**
* (1, ~0.47)
* (2, ~0.525)
* (3, ~0.55)
* (4, ~0.57)
* (5, ~0.58)
* (6, ~0.585)
* (7, ~0.59)
* (8, ~0.595)
* (9, ~0.60)
* (10, ~0.60)
4. **Red Line (Circles):**
* **Trend:** Starts with the lowest accuracy and the shallowest initial slope. However, it shows a steady, almost linear increase and begins to converge with the blue line at higher sample sizes (k=8 to 10).
* **Data Points (k, Accuracy):**
* (1, ~0.47)
* (2, ~0.49)
* (3, ~0.51)
* (4, ~0.54)
* (5, ~0.56)
* (6, ~0.575)
* (7, ~0.585)
* (8, ~0.595)
* (9, ~0.60)
* (10, ~0.605)
### Key Observations
1. **Universal Improvement:** All four methods benefit from increased sample size, as evidenced by the positive slope of every line.
2. **Clear Performance Hierarchy:** A distinct and consistent ranking is established by k=2 and maintained: Black > Cyan > Blue > Red (until convergence at the end).
3. **Diminishing Returns:** The cyan, blue, and red lines all show signs of diminishing returns (a decreasing marginal gain in accuracy) as sample size increases beyond k=5 or 6. The black line also shows this but to a much lesser degree.
4. **Convergence at High k:** The performance gap between the blue and red methods nearly closes by k=10, suggesting they may have similar asymptotic performance limits.
5. **Significant Outlier:** The method represented by the black dotted line is a significant outlier in terms of performance, achieving substantially higher accuracy at every sample size greater than 1.
### Interpretation
This chart likely compares different machine learning models, algorithms, or training strategies on a specific task. The data suggests:
* **Data Efficiency:** The "Black" method is far more data-efficient, extracting significantly more accuracy from the same amount of data. This could indicate a superior model architecture, better feature engineering, or a more effective learning algorithm.
* **Performance Ceiling:** The cyan, blue, and red methods appear to be approaching a performance ceiling (around 0.60-0.62 accuracy) within the given sample size range, while the black method's ceiling is much higher (above 0.75).
* **Practical Implications:** If collecting labeled data is expensive, the black method is the clear choice. If the black method is computationally more complex, there is a clear trade-off between resource cost and performance gain. The convergence of the blue and red lines suggests that for large datasets, the choice between those two specific methods may become less critical based on accuracy alone.
* **Underlying Cause:** The stark difference in performance prompts investigation into what makes the black method unique. Is it a fundamentally different approach (e.g., a deep neural network vs. traditional models), or does it utilize the same data in a more effective way? The chart provides strong empirical evidence for its superiority but does not explain the cause.
</details>
(a) LN-Super-49B
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k) for Four Methods
### Overview
The image is a line chart plotting the performance metric "Accuracy" against "Sample Size (k)" for four distinct methods or algorithms. The chart demonstrates how the accuracy of each method changes as the sample size increases from 1 to 10. All methods begin at approximately the same accuracy point when the sample size is 1, but their performance trajectories diverge significantly as more samples are added.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has discrete integer markers from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has numerical markers at 0.55, 0.60, 0.65, 0.70, and 0.75. The scale is linear.
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries, each associating a line color and marker style with a method name.
1. **Black dotted line with upward-pointing triangle markers:** Labeled "Method A".
2. **Cyan solid line with diamond markers:** Labeled "Method B".
3. **Red solid line with circle markers:** Labeled "Method C".
4. **Blue solid line with square markers:** Labeled "Method D".
* **Grid:** A light gray grid is present, aligning with the major tick marks on both axes.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
1. **Method A (Black, Dotted, Triangles):**
* **Trend:** Shows a steep, concave-downward increasing curve. It has the highest accuracy at every sample size after k=1 and continues to show significant gains across the entire range.
* **Approximate Data Points:**
* k=1: ~0.54
* k=2: ~0.63
* k=3: ~0.67
* k=4: ~0.69
* k=5: ~0.71
* k=6: ~0.72
* k=7: ~0.73
* k=8: ~0.74
* k=9: ~0.745
* k=10: ~0.75
2. **Method B (Cyan, Solid, Diamonds):**
* **Trend:** Shows a steady, slightly concave-downward increase. It is initially the second-best performer but is overtaken by Method C around k=7.
* **Approximate Data Points:**
* k=1: ~0.54
* k=2: ~0.58
* k=3: ~0.615
* k=4: ~0.625
* k=5: ~0.635
* k=6: ~0.64
* k=7: ~0.642
* k=8: ~0.643
* k=9: ~0.644
* k=10: ~0.645
3. **Method C (Red, Solid, Circles):**
* **Trend:** Shows a steady, nearly linear increase. It starts as the third-best method but shows consistent improvement, surpassing Method B between k=6 and k=7 to become the second-best method by k=10.
* **Approximate Data Points:**
* k=1: ~0.54
* k=2: ~0.57
* k=3: ~0.59
* k=4: ~0.61
* k=5: ~0.625
* k=6: ~0.635
* k=7: ~0.642
* k=8: ~0.648
* k=9: ~0.65
* k=10: ~0.655
4. **Method D (Blue, Solid, Squares):**
* **Trend:** Shows a shallow, concave-downward increase that plateaus early. It provides the least improvement with additional samples.
* **Approximate Data Points:**
* k=1: ~0.54
* k=2: ~0.58
* k=3: ~0.595
* k=4: ~0.602
* k=5: ~0.605
* k=6: ~0.608
* k=7: ~0.609
* k=8: ~0.61
* k=9: ~0.61
* k=10: ~0.61
### Key Observations
* **Common Starting Point:** All four methods converge at an accuracy of approximately 0.54 when the sample size (k) is 1.
* **Performance Hierarchy:** A clear performance hierarchy is established by k=2 and maintained thereafter: Method A >> Method B β Method C > Method D. The gap between Method A and the others widens substantially.
* **Crossover Event:** Method C (red) overtakes Method B (cyan) between sample sizes 6 and 7. By k=10, Method C is clearly the second-best performer.
* **Diminishing Returns:** All methods show diminishing returns (the rate of accuracy improvement slows as k increases), but this effect is most pronounced for Method D, which nearly flatlines after k=5.
* **Method A's Dominance:** Method A's accuracy at k=10 (~0.75) is approximately 0.10 points higher than the next best method (Method C at ~0.655), representing a significant performance advantage.
### Interpretation
This chart likely compares the sample efficiency of different machine learning or statistical estimation methods. The data suggests:
1. **Superior Scalability of Method A:** Method A is not only the most accurate but also scales the best with increased data. Its steep initial rise indicates it learns very effectively from the first few additional samples. This could represent a more complex model (e.g., a deep neural network) or a more sophisticated algorithm that better leverages available data.
2. **The "Crossover" between B and C:** The point where Method C surpasses Method B is critical. It suggests that while Method B may be more effective with very small datasets (k < 7), Method C has a better long-term learning trajectory. This could indicate that Method C has a higher capacity or a more appropriate inductive bias for the underlying problem as more evidence becomes available.
3. **Limited Potential of Method D:** Method D's rapid plateau indicates it may be a simple model (e.g., a linear model or a naive baseline) that quickly reaches its performance ceiling. Adding more data beyond a small amount provides negligible benefit, suggesting the model is underfitting or has high bias.
4. **Practical Implication:** The choice of method depends on the expected sample size. If data is extremely scarce (k=1-3), the difference between B, C, and D is small. However, for any scenario where k can be increased beyond 5, investing in Method A yields substantial gains, and Method C becomes preferable over Method B. Method D is only justifiable if computational constraints are extreme and sample size is very limited.
**Spatial Grounding Note:** The legend is placed in the top-left quadrant, overlapping the grid lines but not obscuring any data points, as the lines all start in the bottom-left corner. The data series are clearly distinguishable by both color and marker shape, ensuring accessibility.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart plotting the performance (Accuracy) of four different methods or models as a function of increasing sample size (k). The chart demonstrates how the accuracy of each method changes as more data samples are used, from k=1 to k=10.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has discrete integer markers from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has a linear scale ranging from 0.66 to 0.82, with major gridlines at intervals of 0.02.
* **Data Series:** Four distinct lines, each identified by a unique color and marker shape. There is no separate legend box; the markers serve as the key.
1. **Black, dotted line with upward-pointing triangle markers (β²).**
2. **Cyan (light blue), solid line with diamond markers (β).**
3. **Red (dark red), solid line with circle markers (β).**
4. **Blue (medium blue), solid line with square markers (β ).**
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
All four series show an overall upward trend in accuracy as sample size increases, but at markedly different rates and with different saturation points.
1. **Black Dotted Line (β²):**
* **Trend:** Steepest and most consistent upward slope across the entire range. Shows no sign of plateauing.
* **Data Points (Approximate):**
* k=1: 0.665
* k=2: 0.732
* k=3: 0.758
* k=4: 0.774
* k=5: 0.787
* k=6: 0.797
* k=7: 0.805
* k=8: 0.812
* k=9: 0.819
* k=10: 0.825
2. **Cyan Line (β):**
* **Trend:** Strong initial improvement from k=1 to k=5, then the rate of improvement slows significantly, approaching a gentle upward slope.
* **Data Points (Approximate):**
* k=1: 0.665 (overlaps with others)
* k=2: 0.700
* k=3: 0.723
* k=4: 0.735
* k=5: 0.741
* k=6: 0.745
* k=7: 0.748
* k=8: 0.750
* k=9: 0.752
* k=10: 0.753
3. **Red Line (β):**
* **Trend:** Steady, nearly linear increase from k=1 to k=10. The slope is less steep than the black line but more consistent than the cyan line in the later stages.
* **Data Points (Approximate):**
* k=1: 0.665 (overlaps with others)
* k=2: 0.680
* k=3: 0.701
* k=4: 0.715
* k=5: 0.725
* k=6: 0.731
* k=7: 0.737
* k=8: 0.741
* k=9: 0.745
* k=10: 0.748
4. **Blue Line (β ):**
* **Trend:** Increases from k=1 to k=5, then plateaus and shows a very slight *decrease* from k=7 to k=10. This is the only series that does not show continuous improvement.
* **Data Points (Approximate):**
* k=1: 0.665 (overlaps with others)
* k=2: 0.700
* k=3: 0.710
* k=4: 0.715
* k=5: 0.717
* k=6: 0.718
* k=7: 0.718
* k=8: 0.717
* k=9: 0.716
* k=10: 0.715
### Key Observations
1. **Performance Hierarchy:** A clear and consistent ranking is established by k=3 and maintained thereafter: Black (β²) > Cyan (β) > Red (β) > Blue (β ).
2. **Convergence at k=1:** All four methods start at approximately the same accuracy (~0.665) when the sample size is minimal (k=1).
3. **Divergence with Data:** The performance gap between the best (Black) and worst (Blue) method widens dramatically as sample size increases, from 0 at k=1 to ~0.11 at k=10.
4. **Plateauing Behavior:** The Blue line's performance saturates and slightly degrades after k=5. The Cyan line's improvement slows markedly after k=5. The Red and Black lines show no plateau within the observed range.
5. **Relative Gain:** The Black method gains approximately 0.16 in accuracy from k=1 to k=10, while the Blue method gains only ~0.05.
### Interpretation
This chart likely compares different machine learning models, algorithms, or training strategies. The data suggests:
* **The method represented by the black dotted line (β²) is superior and most data-efficient.** It not only achieves the highest final accuracy but also improves most rapidly with additional data, indicating it effectively leverages larger datasets. This could represent a more complex model (e.g., a deep neural network) or a more sophisticated algorithm.
* **The cyan (β) and red (β) methods are intermediate performers.** The cyan method has a higher ceiling than the red one, but both show more modest gains from data compared to the black method. They might represent simpler models or baseline approaches.
* **The blue method (β ) is the least effective for larger datasets.** Its plateau and slight decline suggest it may be a simple model that quickly reaches its capacity (underfitting) or a method that becomes unstable or overfits with more data in this specific context. It is only competitive at very small sample sizes (k=2, where it matches the cyan line).
* **The choice of method is critical and depends on available data.** If data is extremely scarce (k=1), all methods perform similarly. However, for any application where more than a couple of samples are available, the black method is the clear choice. The chart makes a strong case for investing in the approach represented by the black line, as its return on investment (in terms of accuracy per added sample) is the highest.
</details>
(c) QwQ-32B
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance of four different methods or models as a function of sample size (k). The chart plots "Accuracy" on the y-axis against "Sample Size (k)" on the x-axis. All methods show an increasing trend in accuracy as the sample size grows, but they diverge significantly in their rates of improvement and final performance levels.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, from 1 to 10.
* **Markers/Ticks:** Integers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0.78 to 0.90.
* **Markers/Ticks:** 0.78, 0.80, 0.82, 0.84, 0.86, 0.88, 0.90.
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Series 1:** `pass@k (Oracle)` - Represented by a black, dotted line with upward-pointing triangle markers (β²).
* **Series 2:** `majority@k` - Represented by a solid, dark red (maroon) line with circle markers (β).
* **Series 3:** `short-1@k (Ours)` - Represented by a solid, bright blue line with square markers (β ).
* **Series 4:** `short-3@k (Ours)` - Represented by a solid, cyan (light blue-green) line with diamond markers (β).
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
The following table lists approximate accuracy values for each method at each sample size (k). Values are estimated from the chart's gridlines. The visual trend for each series is described first.
1. **pass@k (Oracle) [Black Dotted Line, β²]:** Shows the steepest and most consistent upward slope, achieving the highest accuracy at every point after k=1.
* k=1: ~0.78
* k=2: ~0.835
* k=3: ~0.855
* k=4: ~0.868
* k=5: ~0.877
* k=6: ~0.884
* k=7: ~0.890
* k=8: ~0.895
* k=9: ~0.899
* k=10: ~0.902
2. **short-3@k (Ours) [Cyan Line, β]:** Shows a strong upward slope, closely following but consistently below the Oracle line. It is the second-best performing method.
* k=1: ~0.78
* k=2: ~0.818
* k=3: ~0.848
* k=4: ~0.862
* k=5: ~0.871
* k=6: ~0.877
* k=7: ~0.881
* k=8: ~0.885
* k=9: ~0.888
* k=10: ~0.890
3. **majority@k [Dark Red Line, β]:** Shows a steady, moderate upward slope. It starts as the lowest-performing method but surpasses `short-1@k` around k=6.
* k=1: ~0.78
* k=2: ~0.798
* k=3: ~0.815
* k=4: ~0.828
* k=5: ~0.839
* k=6: ~0.846
* k=7: ~0.854
* k=8: ~0.860
* k=9: ~0.866
* k=10: ~0.871
4. **short-1@k (Ours) [Blue Line, β ]:** Shows an initial upward slope that flattens significantly after k=6, exhibiting a plateau effect. It is overtaken by `majority@k` in the latter half.
* k=1: ~0.78
* k=2: ~0.818
* k=3: ~0.831
* k=4: ~0.839
* k=5: ~0.844
* k=6: ~0.847
* k=7: ~0.848
* k=8: ~0.849
* k=9: ~0.849
* k=10: ~0.848
### Key Observations
1. **Performance Hierarchy:** A clear and consistent performance hierarchy is established for k > 1: `pass@k (Oracle)` > `short-3@k (Ours)` > `majority@k` β `short-1@k (Ours)` (with a crossover).
2. **Diminishing Returns:** All methods show diminishing returns (the slope decreases as k increases), but this is most extreme for `short-1@k (Ours)`, which nearly plateaus after k=6.
3. **Crossover Point:** The `majority@k` method, while starting slower, shows more sustained improvement and overtakes `short-1@k (Ours)` between k=5 and k=6.
4. **Convergence at k=1:** All four methods begin at approximately the same accuracy point (~0.78) when the sample size is 1.
5. **Gap Analysis:** The performance gap between the best (`Oracle`) and the proposed `short-3` method remains relatively constant (~0.01-0.015 accuracy points) across most sample sizes. The gap between `short-3` and `short-1` widens substantially as k increases.
### Interpretation
This chart likely comes from a research paper in machine learning or statistics, comparing different strategies for aggregating results from multiple samples or trials (where `k` is the number of samples). The "Oracle" represents an idealized upper-bound performance.
* **What the data suggests:** The proposed method `short-3@k` is highly effective, achieving performance close to the theoretical Oracle and significantly outperforming the baseline `majority@k` rule. However, the variant `short-1@k` suffers from severe saturation, indicating that its strategy does not scale well with increased sample size.
* **Relationship between elements:** The chart demonstrates the trade-off between sample efficiency and final accuracy. While all methods benefit from more samples (`k`), the choice of aggregation strategy (`short-3` vs. `short-1` vs. `majority`) dramatically impacts how effectively that additional data is utilized.
* **Notable anomaly/insight:** The plateau of `short-1@k` is a critical finding. It suggests that this particular method hits a fundamental limit in its ability to leverage additional samples, making it a poor choice for scenarios where `k` can be large. In contrast, `short-3@k` and `majority@k` continue to improve, with `short-3@k` maintaining a superior rate of improvement. This implies the "short-3" strategy is more robust and scalable.
</details>
(d) R1-670B
Figure 2: Comparing different inference methods under controlled sample size ( $k$ ). All methods improve with larger sample sizes. Interestingly, this trend also holds for the short-m@k methods.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart comparing the performance (Accuracy) of four different methods as a function of computational resources (Thinking Compute). The chart demonstrates how accuracy scales with increased "thinking tokens" for each method, showing distinct scaling laws and efficiency differences.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Label:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear, ranging from 0 to 120.
* **Major Tick Marks:** 0, 20, 40, 60, 80, 100, 120.
* **Y-Axis:**
* **Label:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.45 to 0.75.
* **Major Tick Marks:** 0.50, 0.55, 0.60, 0.65, 0.70, 0.75.
* **Legend:** Located in the top-left corner of the plot area. It contains four entries:
1. **Chain-of-Thought:** Black dotted line with upward-pointing triangle markers.
2. **Self-Consistency:** Cyan (light blue) solid line with diamond markers.
3. **Self-Consistency + CoT:** Cyan (light blue) solid line with square markers.
4. **Standard:** Red solid line with circle markers.
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
The chart plots four data series. All series begin at approximately the same point (Thinking Compute β 10k tokens, Accuracy β 0.47) and show increasing accuracy with more compute, but at markedly different rates.
1. **Chain-of-Thought (Black, Dotted, Triangles):**
* **Trend:** Exhibits the steepest, near-logarithmic upward slope. It shows the highest marginal gain in accuracy per additional thinking token.
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.57)
* (30, 0.63)
* (40, 0.66)
* (50, 0.69)
* (60, 0.71)
* (70, 0.73)
* (80, 0.75)
2. **Self-Consistency (Cyan, Solid, Diamonds):**
* **Trend:** Shows a strong, steady upward slope, but less steep than Chain-of-Thought. It consistently outperforms the Standard method.
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.53)
* (30, 0.56)
* (40, 0.58)
* (50, 0.59)
* (60, 0.60)
* (70, 0.61)
* (80, 0.615)
3. **Self-Consistency + CoT (Cyan, Solid, Squares):**
* **Trend:** Follows a path nearly identical to the "Self-Consistency" line, with data points often overlapping or lying very close. This suggests combining Self-Consistency with Chain-of-Thought provides minimal additional benefit over Self-Consistency alone in this specific evaluation.
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.53)
* (30, 0.57)
* (40, 0.58)
* (50, 0.59)
* (60, 0.60)
* (70, 0.605)
4. **Standard (Red, Solid, Circles):**
* **Trend:** Exhibits the shallowest, most linear upward slope. It has the lowest accuracy for any given compute budget beyond the initial point.
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.49)
* (30, 0.51)
* (40, 0.54)
* (50, 0.56)
* (60, 0.58)
* (70, 0.59)
* (80, 0.60)
* (90, 0.605)
* (100, 0.61)
### Key Observations
* **Performance Hierarchy:** For all compute budgets > 10k tokens, the performance order is clear and consistent: Chain-of-Thought > Self-Consistency β Self-Consistency + CoT > Standard.
* **Diminishing Returns:** All curves show signs of diminishing returns (flattening) as compute increases, but the point of significant flattening occurs much later for Chain-of-Thought.
* **Convergence at Low Compute:** At the lowest compute point (~10k tokens), all methods perform identically (~0.47 accuracy), suggesting the advanced techniques require a minimum compute threshold to show benefit.
* **Method Synergy:** The combination of "Self-Consistency + CoT" does not yield a performance curve above the "Self-Consistency" line, indicating these methods may be addressing similar aspects of the problem or that one subsumes the other's benefits in this context.
### Interpretation
This chart is a powerful visualization of **scaling laws for reasoning methods** in AI. It demonstrates that not all methods scale equally with increased computational resources ("thinking tokens").
* **Chain-of-Thought is Highly Efficient:** The steep slope of the Chain-of-Thought line indicates it is an exceptionally efficient method for converting additional compute into accuracy gains. It suggests that structuring a model's internal reasoning process yields disproportionate benefits.
* **Self-Consistency Provides a Solid Baseline Improvement:** The method offers a reliable, significant boost over the Standard approach but does not scale as aggressively as pure Chain-of-Thought.
* **The "Standard" Approach is Inefficient:** The shallow slope implies that simply allocating more tokens to a standard, unstructured generation process is a less effective strategy for improving accuracy on this task.
* **Practical Implication:** For tasks where accuracy is critical and compute is available, Chain-of-Thought is the most effective strategy shown. The chart provides a quantitative basis for choosing a reasoning method based on a given compute budget. The lack of synergy between Self-Consistency and CoT is a notable finding, suggesting research into their interaction is needed.
</details>
(a) LN-Super-49B
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in "thinking tokens." It displays four distinct data series, each representing a different model or method, showing how their performance scales with increased compute. The chart demonstrates a clear positive correlation between thinking compute and accuracy for all series, though the rate of improvement varies significantly.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale from 0 to 120, with major tick marks at 20, 40, 60, 80, 100, and 120.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0.55 to 0.75, with major tick marks at 0.55, 0.60, 0.65, 0.70, and 0.75.
* **Data Series (Legend inferred from visual markers):**
1. **Black Dotted Line with Upward-Pointing Triangles:** Positioned as the top-most line.
2. **Red Solid Line with Circles:** Positioned as the second-highest line at higher compute values.
3. **Cyan Solid Line with Diamonds:** Positioned in the middle range.
4. **Cyan Solid Line with Squares:** Positioned as the lowest line after the initial points.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **Series 1: Black Dotted Line (Triangles)**
* **Trend:** Shows the steepest initial increase in accuracy, rising sharply from low compute and continuing to climb steadily, though the slope decreases slightly at higher values. It consistently achieves the highest accuracy.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.54
* At ~20k tokens: Accuracy β 0.63
* At ~30k tokens: Accuracy β 0.67
* At ~40k tokens: Accuracy β 0.70
* At ~50k tokens: Accuracy β 0.72
* At ~60k tokens: Accuracy β 0.73
* At ~70k tokens: Accuracy β 0.74
* At ~80k tokens: Accuracy β 0.75 (final point)
* **Series 2: Red Solid Line (Circles)**
* **Trend:** Shows a steady, near-linear increase in accuracy across the entire compute range. It starts lower than the cyan lines but surpasses them as compute increases.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.54
* At ~40k tokens: Accuracy β 0.59
* At ~60k tokens: Accuracy β 0.625
* At ~80k tokens: Accuracy β 0.64
* At ~100k tokens: Accuracy β 0.65
* At ~120k tokens: Accuracy β 0.655 (final point)
* **Series 3: Cyan Solid Line (Diamonds)**
* **Trend:** Increases rapidly at first, then begins to plateau around 60k-80k tokens. It is overtaken by the red line at approximately 50k tokens.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.54
* At ~20k tokens: Accuracy β 0.58
* At ~30k tokens: Accuracy β 0.60
* At ~40k tokens: Accuracy β 0.625
* At ~50k tokens: Accuracy β 0.635
* At ~60k tokens: Accuracy β 0.64
* At ~70k tokens: Accuracy β 0.645
* At ~80k tokens: Accuracy β 0.645 (final point)
* **Series 4: Cyan Solid Line (Squares)**
* **Trend:** Increases initially but plateaus very early, showing minimal gains after approximately 40k tokens. It has the lowest final accuracy.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.54
* At ~20k tokens: Accuracy β 0.58
* At ~30k tokens: Accuracy β 0.60
* At ~40k tokens: Accuracy β 0.605
* At ~50k tokens: Accuracy β 0.61
* At ~60k tokens: Accuracy β 0.61
* At ~70k tokens: Accuracy β 0.61 (final point)
### Key Observations
1. **Universal Scaling Law:** All four methods show improved accuracy with increased thinking compute, confirming a fundamental relationship between computational resources and model performance on this task.
2. **Divergent Scaling Efficiency:** The methods scale with dramatically different efficiency. The black-dotted method is highly efficient, achieving high accuracy with relatively low compute. The red method scales steadily but requires more compute to reach the same accuracy as the black method. The two cyan methods show early saturation.
3. **Crossover Point:** The red line (circles) crosses above both cyan lines between 40k and 50k tokens, indicating that its scaling advantage becomes dominant at medium-to-high compute budgets.
4. **Performance Ceiling:** The cyan-square method hits a clear performance ceiling around 0.61 accuracy, suggesting a fundamental limitation in that approach regardless of additional compute.
### Interpretation
This chart likely compares different reasoning or "thinking" strategies for an AI model. The data suggests that the method represented by the **black dotted line** is superior in terms of **compute efficiency**βit extracts the most accuracy per thinking token, especially in the low-to-medium compute regime (10k-60k tokens). This could represent a more optimized or advanced reasoning algorithm.
The **red line** represents a method with **consistent, reliable scaling**. While less efficient initially, it continues to improve predictably and overtakes the other methods at higher compute budgets, making it potentially suitable for scenarios where maximum accuracy is the goal and compute is less constrained.
The two **cyan lines** appear to be less sophisticated methods that benefit from initial compute but quickly **saturate**, hitting a performance wall. The difference between the diamond and square markers might indicate a minor variation in an otherwise similar approach, with the diamond variant having a slightly higher ceiling.
The overarching message is that **how** a model "thinks" (the algorithm or strategy) is critically important. Simply throwing more compute at a problem yields diminishing returns if the underlying method is inefficient. The chart makes a strong case for investing in better reasoning architectures (like the black-dotted one) over brute-force scaling of less efficient methods.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in thinking tokens. It compares the performance of four distinct methods or models, showing how their accuracy scales with increased compute. The chart demonstrates a clear positive correlation between thinking compute and accuracy for all series, though with varying rates of improvement and saturation points.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from approximately 15 to 160, with major tick marks at 25, 50, 75, 100, 125, and 150.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.66 to 0.82, with major tick marks at 0.02 intervals (0.66, 0.68, 0.70, ..., 0.82).
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries, each with a colored line segment and a marker symbol:
1. **Black dotted line with upward-pointing triangle markers (β²)**
2. **Cyan solid line with diamond markers (β)**
3. **Red solid line with circle markers (β)**
4. **Cyan solid line with square markers (β )**
* **Grid:** A light gray grid is present, with vertical lines at each major x-axis tick and horizontal lines at each major y-axis tick.
### Detailed Analysis
The chart displays four data series. Below is an analysis of each, including approximate data points extracted from the visual plot.
**1. Black Dotted Line (β²)**
* **Trend:** Shows the steepest and most consistent upward slope across the entire range. It demonstrates the highest accuracy at every compute level beyond the initial point.
* **Approximate Data Points:**
* ~15k tokens: Accuracy β 0.665
* ~25k tokens: Accuracy β 0.73
* ~50k tokens: Accuracy β 0.775
* ~75k tokens: Accuracy β 0.80
* ~100k tokens: Accuracy β 0.815
* ~115k tokens: Accuracy β 0.825 (highest point on the chart)
**2. Cyan Solid Line (β)**
* **Trend:** Increases steadily but begins to show signs of diminishing returns (flattening) after approximately 75k tokens. It is the second-highest performing series.
* **Approximate Data Points:**
* ~15k tokens: Accuracy β 0.665
* ~25k tokens: Accuracy β 0.70
* ~50k tokens: Accuracy β 0.735
* ~75k tokens: Accuracy β 0.745
* ~100k tokens: Accuracy β 0.75
* ~120k tokens: Accuracy β 0.752
**3. Red Solid Line (β)**
* **Trend:** Shows a steady, nearly linear increase throughout the plotted range. Its slope is less steep than the black or cyan (β) lines but remains consistently positive.
* **Approximate Data Points:**
* ~15k tokens: Accuracy β 0.665
* ~50k tokens: Accuracy β 0.70
* ~75k tokens: Accuracy β 0.725
* ~100k tokens: Accuracy β 0.735
* ~125k tokens: Accuracy β 0.74
* ~155k tokens: Accuracy β 0.748
**4. Cyan Solid Line (β )**
* **Trend:** Increases initially but plateaus very early, showing almost no improvement after approximately 50k tokens. It has the lowest performance ceiling.
* **Approximate Data Points:**
* ~15k tokens: Accuracy β 0.665
* ~25k tokens: Accuracy β 0.70
* ~50k tokens: Accuracy β 0.715
* ~75k tokens: Accuracy β 0.718
* ~100k tokens: Accuracy β 0.715 (slight decline visible)
* ~110k tokens: Accuracy β 0.714
### Summary of Approximate Data Points
| Thinking Compute (k tokens) | Black (β²) Accuracy | Cyan (β) Accuracy | Red (β) Accuracy | Cyan (β ) Accuracy |
| :--- | :--- | :--- | :--- | :--- |
| ~15 | 0.665 | 0.665 | 0.665 | 0.665 |
| ~25 | 0.73 | 0.70 | - | 0.70 |
| ~50 | 0.775 | 0.735 | 0.70 | 0.715 |
| ~75 | 0.80 | 0.745 | 0.725 | 0.718 |
| ~100 | 0.815 | 0.75 | 0.735 | 0.715 |
| ~115 | 0.825 | - | - | - |
| ~120 | - | 0.752 | - | - |
| ~125 | - | - | 0.74 | - |
| ~155 | - | - | 0.748 | - |
### Key Observations
1. **Common Starting Point:** All four methods begin at nearly the same accuracy (~0.665) at the lowest compute level (~15k tokens).
2. **Performance Hierarchy:** A clear performance hierarchy is established quickly and maintained: Black (β²) > Cyan (β) > Red (β) > Cyan (β ).
3. **Saturation Points:** The methods exhibit different saturation behaviors. The Cyan (β ) line saturates earliest (~50k tokens). The Cyan (β) line begins to saturate around 75-100k tokens. The Black (β²) and Red (β) lines show no clear signs of saturation within the plotted range.
4. **Efficiency:** The Black (β²) method is the most compute-efficient, achieving higher accuracy with fewer tokens than any other method. For example, it reaches 0.75 accuracy at ~40k tokens, a level the Cyan (β) method only approaches at ~100k tokens.
### Interpretation
This chart illustrates the scaling law relationship between computational resources ("thinking tokens") and model performance (accuracy) for different approaches. The data suggests:
* **Investment Pays Off:** For all tested methods, allocating more compute for "thinking" leads to better accuracy, validating the core premise of scaling inference-time computation.
* **Methodological Superiority:** The method represented by the black dotted line (β²) is fundamentally more efficient and effective. It extracts more accuracy per thinking token and has a higher performance ceiling. This could indicate a superior architecture, training technique, or reasoning algorithm.
* **The Plateau Problem:** The early plateau of the Cyan (β ) line indicates a method that hits a fundamental limit; throwing more compute at it yields negligible gains. This is a critical insight for resource allocation.
* **Strategic Implications:** The choice of method involves a trade-off. If maximum accuracy is the goal and compute is available, the Black (β²) method is the clear choice. If compute is constrained, the relative efficiency of each method at different budget levels (e.g., Red (β) may be preferable to Cyan (β) at very high compute budgets due to its continued, albeit slower, improvement) becomes the key decision factor. The chart provides the empirical data needed to make that cost-benefit analysis.
</details>
(c) QwQ-32B
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Methods
### Overview
The image displays a line chart comparing the performance of four different methods or models. The chart plots "Accuracy" on the vertical axis against "Thinking Compute" (measured in thousands of thinking tokens) on the horizontal axis. The primary trend for all series is that accuracy increases with increased thinking compute, but the rate of improvement and the final accuracy achieved vary significantly between methods.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale ranging from approximately 0 to 180 (in thousands of tokens).
* **Major Ticks:** Labeled at 50, 100, and 150.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale ranging from 0.78 to 0.90.
* **Major Ticks:** Labeled at 0.78, 0.80, 0.82, 0.84, 0.86, 0.88, and 0.90.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. `pass@k (Oracle)`: Represented by a black, dotted line with upward-pointing triangle markers.
2. `majority@k`: Represented by a solid, dark red (maroon) line with circular markers.
3. `short-1@k (Ours)`: Represented by a solid, light blue (cyan) line with square markers.
4. `short-3@k (Ours)`: Represented by a solid, teal (darker cyan) line with diamond markers.
* **Grid:** A light gray grid is present, with vertical lines at the major x-axis ticks and horizontal lines at the major y-axis ticks.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **pass@k (Oracle) [Black Dotted Line, Triangle Markers]:**
* **Trend:** This line shows the steepest initial ascent and achieves the highest overall accuracy. It appears to be the upper-bound or ideal performance.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.78
* At ~25k tokens: Accuracy β 0.835
* At ~50k tokens: Accuracy β 0.855
* At ~75k tokens: Accuracy β 0.875
* At ~100k tokens: Accuracy β 0.885
* At ~125k tokens: Accuracy β 0.895
* At ~150k tokens: Accuracy β 0.902 (highest point on the chart)
2. **short-3@k (Ours) [Teal Line, Diamond Markers]:**
* **Trend:** This is the second-best performing method. It follows a similar curve to `pass@k (Oracle)` but consistently below it. The gap between this line and the oracle line narrows slightly as compute increases.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.78
* At ~25k tokens: Accuracy β 0.818
* At ~50k tokens: Accuracy β 0.848
* At ~75k tokens: Accuracy β 0.870
* At ~100k tokens: Accuracy β 0.878
* At ~125k tokens: Accuracy β 0.885
* At ~150k tokens: Accuracy β 0.890
3. **short-1@k (Ours) [Light Blue Line, Square Markers]:**
* **Trend:** This method improves rapidly at very low compute but then plateaus much earlier than the others. After approximately 75k tokens, its accuracy gains become negligible, and it is overtaken by `majority@k`.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.78
* At ~25k tokens: Accuracy β 0.818 (similar to `short-3@k` at this point)
* At ~50k tokens: Accuracy β 0.838
* At ~75k tokens: Accuracy β 0.845
* At ~100k tokens: Accuracy β 0.848
* At ~125k tokens: Accuracy β 0.848 (plateau)
* At ~150k tokens: Accuracy β 0.848 (plateau)
4. **majority@k [Dark Red Line, Circle Markers]:**
* **Trend:** This method starts with the lowest accuracy at low compute but shows steady, nearly linear improvement. It surpasses the plateaued `short-1@k` method at around 110k tokens and continues to climb.
* **Data Points:**
* At ~10k tokens: Accuracy β 0.78
* At ~25k tokens: Accuracy β 0.795
* At ~50k tokens: Accuracy β 0.815
* At ~75k tokens: Accuracy β 0.828
* At ~100k tokens: Accuracy β 0.838
* At ~125k tokens: Accuracy β 0.854
* At ~150k tokens: Accuracy β 0.860
* At ~180k tokens (estimated): Accuracy β 0.870
### Key Observations
1. **Performance Hierarchy:** A clear performance hierarchy is established: `pass@k (Oracle)` > `short-3@k (Ours)` > `majority@k` > `short-1@k (Ours)` at high compute levels (>110k tokens).
2. **Diminishing Returns:** All methods show diminishing returns (the slope of the curve decreases), but the point of severe plateauing varies. `short-1@k` plateaus earliest and most dramatically.
3. **Crossover Point:** A significant crossover occurs between `majority@k` and `short-1@k` at approximately 110k thinking tokens, where `majority@k` becomes the more accurate method despite starting lower.
4. **Oracle Gap:** The gap between the best proposed method (`short-3@k`) and the oracle (`pass@k`) remains relatively constant (β0.01-0.015 accuracy points) across most of the compute range, suggesting a consistent performance ceiling.
5. **Low-Compute Similarity:** At the lowest compute point (~10k tokens), all four methods start at nearly the same accuracy (β0.78), indicating that with minimal "thinking," the method choice is less impactful.
### Interpretation
This chart demonstrates the trade-off between computational cost ("Thinking Compute") and performance (Accuracy) for different reasoning or generation strategies in an AI system.
* **What the data suggests:** The `pass@k (Oracle)` line likely represents an idealized upper bound, perhaps achieved by having perfect knowledge of which of `k` generated samples is correct. The proposed methods, `short-1@k` and `short-3@k`, are practical attempts to approach this oracle performance. `short-3@k` is significantly more effective than `short-1@k`, suggesting that allowing for or considering more diverse or longer "short" reasoning paths (3 vs. 1) yields better results.
* **Relationship between elements:** The `majority@k` method, which likely selects the most common answer among `k` samples, serves as a strong baseline. Its steady climb shows that simple aggregation benefits consistently from more compute. The fact that `short-3@k` outperforms it indicates that the "short" methods are doing more than just aggregation; they are likely leveraging the compute to generate higher-quality individual samples.
* **Notable Anomalies/Insights:** The early plateau of `short-1@k` is critical. It implies that this method exhausts its ability to improve with more compute relatively quickly. In contrast, `short-3@k` and `majority@k` continue to scale, making them more suitable for scenarios where high compute budgets are available. The chart argues for the efficacy of the `short-3@k` approach, as it provides the best practical performance, closest to the oracle, across a wide range of compute budgets.
</details>
(d) R1-670B
Figure 3: Comparing different inference methods under controlled thinking compute. short-1@k is highly performant in low compute regimes. short-3@k dominates the curve compared to majority $@k$ .
Sample-size ( $k$ ).
We start by examining different methods across benchmarks for a fixed sample size $k$ . Results aggregated across math benchmarks are presented in Figure Λ 2, while Figure Λ 6 in Appendix Λ A presents GPQA-D results, and detailed results per benchmark can be seen at Appendix Λ B. We observe that, generally, all methods improve with larger sample sizes, indicating that increased generations per question enhance performance. This trend is somewhat expected for the oracle (pass@ $k$ ) and majority@ $k$ methods but surprising for our method, as it means that even when a large amount of generations is used, the shorter thinking ones are more likely to be correct. The only exception is QwQ- $32$ B (Figure Λ 2(c)), which shows a small of decline when considering larger sample sizes with the short-1@k method.
When comparing short-1@k to majority@ $k$ , the former outperforms at smaller sample sizes, but is outperformed by the latter in three out of four models when the sample size increases. Meanwhile, the short-3@k method demonstrates superior performance, dominating across nearly all models and sample sizes. Notably, for the R $1$ - $670$ B model, short-3@k exhibits performance nearly on par with the oracle across all sample sizes. We next analyze how this performance advantage translates into efficiency benefits.
Thinking-compute.
The aggregated performance results for math benchmarks, evaluated with respect to thinking compute, are presented in Figure Λ 3 (per-benchmark results provided in Appendix Λ B), while the GPQA-D respective results are presented in Figure Λ 7 in Appendix Λ A. We again observe that the short-1@k method outperforms majority $@k$ at lower compute budgets. Notably, for LN-Super- $49$ B (Figure Λ 3(a)), the short-1@k method surpasses majority $@k$ across all compute budgets. For instance, short-1@k achieves $57\$ accuracy with approximately $60\$ of the compute budget used by majority@ $k$ to achieve the same accuracy. For R $1$ - $32$ B, QwQ- $32$ B and R $1$ - $670$ B models, the short-1@k method exceeds majority $@k$ up to compute budgets of $45$ k, $60$ k and $100$ k total thinking tokens, respectively, but is underperformed by it on larger compute budgets.
The short-3@k method yields even greater performance improvements, incurring only a modest increase in thinking compute compared to short-1@k. When compared to majority $@k$ , short-3@k consistently achieves higher performance with lower thinking compute across all models and compute budgets. For example, with the QwQ- $32$ B model (Figure Λ 3(c)), and an average compute budget of $80$ k thinking tokens per example, short-3@k improves accuracy by $2\$ over majority@ $k$ . For the R $1$ - $670$ B model (Figure Λ 3(d)), short-3@k consistently outperforms majority voting, yielding an approximate $4\$ improvement with an average token budget of $100$ k.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods and k-values
### Overview
This image is a scatter plot comparing the performance of three distinct methods (differentiated by shape and color) across varying values of a parameter `k`. The plot visualizes the trade-off between prediction accuracy and computational time (measured as "Time-to-Answer" in thousands of units). Each data point is annotated with its corresponding `k` value.
### Components/Axes
* **Y-Axis:** Labeled **"Accuracy"**. The scale ranges from 0.48 to 0.62, with major gridlines at intervals of 0.02.
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale ranges from 8 to 18, with major gridlines at intervals of 2.
* **Data Series (Implicit Legend):** Three distinct series are represented by different shapes and colors:
1. **Cyan Squares:** One method.
2. **Cyan Diamonds:** A second method.
3. **Brown Circles:** A third method.
* **Annotations:** Each data point is labeled with text indicating the `k` value used for that run (e.g., "k=9").
### Detailed Analysis
**Data Points (Approximate Coordinates: Time-to-Answer, Accuracy):**
* **Cyan Squares Series:**
* `k=9`: (8.0, 0.60)
* `k=5`: (9.0, 0.58)
* `k=3`: (10.0, 0.55)
* **Trend:** As `k` increases, Accuracy increases while Time-to-Answer decreases.
* **Cyan Diamonds Series:**
* `k=9`: (10.5, 0.61)
* `k=5`: (12.0, 0.59)
* `k=3`: (15.5, 0.56)
* **Trend:** As `k` increases, Accuracy increases while Time-to-Answer decreases.
* **Brown Circles Series:**
* `k=9`: (18.5, 0.60)
* `k=5`: (17.0, 0.56)
* `k=3`: (15.5, 0.51)
* `k=1`: (12.5, 0.47)
* **Trend:** As `k` increases, both Accuracy and Time-to-Answer increase.
### Key Observations
1. **Inverse vs. Direct Relationship:** The two cyan series (Squares and Diamonds) show an **inverse relationship** between `k` and Time-to-Answer. In contrast, the brown Circles series shows a **direct relationship** where higher `k` leads to both higher accuracy and longer computation time.
2. **Performance Frontier:** The Cyan Diamond at `k=9` (10.5, 0.61) and the Cyan Square at `k=9` (8.0, 0.60) represent the highest accuracy points for their respective series and are achieved with relatively low Time-to-Answer. The Brown Circle at `k=9` achieves similar accuracy (0.60) but requires the highest Time-to-Answer (18.5).
3. **Lowest Performance Point:** The Brown Circle at `k=1` (12.5, 0.47) has the lowest accuracy on the chart.
4. **Convergence at High k:** For `k=9`, all three methods achieve high accuracy (0.60-0.61), but with vastly different time costs (8.0 to 18.5).
5. **Overlap Point:** At approximately (15.5, 0.56), a Cyan Diamond (`k=3`) and a Brown Circle (`k=5`) have nearly identical performance coordinates despite different `k` values and methods.
### Interpretation
This chart likely compares different algorithms or model configurations (the three shapes) where `k` is a key hyperparameter (e.g., number of neighbors, ensemble size, or search depth).
* **The cyan methods (Squares and Diamonds) appear to be more efficient.** They achieve their peak accuracy at lower `k` values, which, counter-intuitively for these series, also corresponds to *faster* computation times. This suggests these methods might have an optimization where increasing `k` prunes the search space or improves efficiency, leading to both better and faster results. The Diamond method generally offers a slightly better accuracy-time trade-off than the Square method for the same `k`.
* **The brown method (Circles) follows a more traditional, brute-force-like scaling pattern.** Increasing `k` improves accuracy but at a direct and significant cost in computation time. It is the least efficient method for achieving high accuracy.
* **The key takeaway is the stark difference in scaling behavior.** The choice of method fundamentally changes how the `k` parameter affects performance. For applications where speed is critical, the cyan methods (particularly Diamonds) are superior. If maximum accuracy is the only goal and time is secondary, the brown method at high `k` is competitive but inefficient. The overlap point demonstrates that different method/`k` combinations can yield equivalent performance, highlighting the importance of this type of analysis for model selection.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
This is a scatter plot comparing the performance of different methods or configurations, labeled by a parameter "k". The plot visualizes the trade-off between "Accuracy" (y-axis) and "Time-to-Answer" (x-axis). Each data point is uniquely identified by a combination of shape, color, and an explicit "k=" label.
### Components/Axes
* **Y-Axis:** Labeled "Accuracy". The scale ranges from 0.54 to 0.65, with major gridlines at intervals of 0.02 (0.54, 0.56, 0.58, 0.60, 0.62, 0.64).
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale ranges from approximately 7 to 19, with major gridlines at intervals of 2 (8, 10, 12, 14, 16, 18). The parenthetical note indicates the unit is "thousands" of some time measure (e.g., milliseconds, steps).
* **Data Series & Legend:** There is no separate legend box. The data series are differentiated by marker shape and color, with each point annotated with its "k" value.
* **Cyan Diamonds:** Represent one method/configuration.
* **Blue Squares:** Represent a second method/configuration.
* **Red Circles:** Represent a third method/configuration.
* **Data Point Annotations:** Each marker has a text label directly adjacent to it specifying the "k" value (k=1, k=3, k=5, k=9).
### Detailed Analysis
**Data Points (Approximate Coordinates & Labels):**
* **Cyan Diamond Series:**
* Point 1: (x β 12.0, y β 0.54), labeled **k=1**. This is the lowest accuracy point on the chart.
* Point 2: (x β 15.5, y β 0.617), labeled **k=3**.
* Point 3: (x β 11.5, y β 0.633), labeled **k=5**.
* Point 4: (x β 9.5, y β 0.645), labeled **k=9**. This is the highest accuracy point for the cyan diamond series.
* **Blue Square Series:**
* Point 1: (x β 8.5, y β 0.595), labeled **k=3**.
* Point 2: (x β 7.8, y β 0.605), labeled **k=5**.
* Point 3: (x β 7.0, y β 0.61), labeled **k=9**. This series occupies the leftmost region of the chart, indicating the fastest times.
* **Red Circle Series:**
* Point 1: (x β 15.5, y β 0.588), labeled **k=3**.
* Point 2: (x β 17.0, y β 0.625), labeled **k=5**.
* Point 3: (x β 19.0, y β 0.65), labeled **k=9**. This is the highest accuracy point on the entire chart and also has the longest time-to-answer.
**Visual Trends:**
* **Cyan Diamonds:** Shows a general **downward trend in accuracy as time increases** from k=9 to k=1. The point at k=1 is a significant outlier in both accuracy and trend direction.
* **Blue Squares:** Shows a slight **upward trend in accuracy as time increases** from k=3 to k=9, but all points are clustered in the fast-time region.
* **Red Circles:** Shows a clear **upward trend in accuracy as time increases** from k=3 to k=9.
### Key Observations
1. **Performance Clusters:** The three marker types form distinct clusters. Blue squares are fast but mid-accuracy. Cyan diamonds span a wide range of times and accuracies. Red circles are slower but achieve the highest accuracies.
2. **k-Value Impact:** For the Red Circle and Blue Square series, higher `k` values correlate with both higher accuracy and longer time. For the Cyan Diamond series, the relationship is inverse for `k=3,5,9`, with `k=1` being an extreme outlier.
3. **Outlier:** The Cyan Diamond at **k=1** (xβ12, yβ0.54) is a major outlier. It has the lowest accuracy on the chart and breaks the trend of its own series.
4. **Peak Performance:** The highest observed accuracy (β0.65) is achieved by the Red Circle method with **k=9**, but it requires the longest time (β19 thousand units).
5. **Speed-Accuracy Trade-off:** The plot clearly illustrates a trade-off. The fastest methods (Blue Squares) have moderate accuracy. The most accurate method (Red Circle, k=9) is the slowest.
### Interpretation
This chart likely compares different algorithms, model configurations, or reasoning strategies (differentiated by shape/color) where `k` represents a key parameter like the number of reasoning steps, beam search width, or ensemble size.
* **The data suggests** that increasing the `k` parameter generally improves accuracy for the Red and Blue methods, at the cost of increased computation time. The Cyan method behaves differently, suggesting it might be a fundamentally different approach where lower `k` is detrimental.
* **The elements relate** by showing how each method navigates the fundamental tension between solution quality (accuracy) and computational cost (time). The spatial separation of the clusters indicates different operational profiles for each method.
* **Notable anomalies** include the Cyan k=1 point, which performs poorly, and the fact that the Cyan k=9 point is both highly accurate and relatively fast, making it a potentially interesting "sweet spot" depending on the application's constraints. The chart invites a viewer to select a method and `k` value based on whether their priority is raw speed, maximum accuracy, or a balance between the two.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x12.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing the performance of different models or configurations, parameterized by a variable 'k'. It plots "Accuracy" on the vertical axis against "Time-to-Answer (longest thinking in thousands)" on the horizontal axis. Three distinct series are represented by different marker shapes and colors, each labeled with its corresponding 'k' value directly on the plot.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 11 to 21, with major tick marks at 12, 14, 16, 18, and 20.
* **Y-Axis:** Labeled "Accuracy". The scale runs from approximately 0.67 to 0.75, with major tick marks at 0.68, 0.70, 0.72, and 0.74.
* **Data Series & Legend (Implicit):** There is no separate legend box. The series are distinguished by marker shape and color, with labels placed adjacent to each data point.
* **Cyan Diamonds:** Represent one series.
* **Cyan Squares:** Represent a second series.
* **Red (Maroon) Circles:** Represent a third series.
* **Data Point Labels:** Each marker is accompanied by a text label indicating its 'k' value (e.g., "k=9", "k=5").
### Detailed Analysis
**Data Points (Approximate Coordinates):**
* **Cyan Diamond Series:**
* `k=9`: Position ~ (13.5, 0.750). This is the highest accuracy point on the chart.
* `k=5`: Position ~ (15.5, 0.740).
* `k=3`: Position ~ (18.5, 0.722).
* `k=1`: Position ~ (15.8, 0.670). This is the lowest accuracy point on the chart.
* **Cyan Square Series:**
* `k=9`: Position ~ (11.2, 0.715).
* `k=5`: Position ~ (11.8, 0.717).
* `k=3`: Position ~ (12.8, 0.710).
* **Red Circle Series:**
* `k=9`: Position ~ (21.0, 0.744). This is the point with the highest Time-to-Answer.
* `k=5`: Position ~ (19.8, 0.724).
* `k=3`: Position ~ (18.8, 0.701).
**Trend Verification:**
* **Cyan Diamonds:** The trend is non-monotonic. Accuracy is very high for `k=9` and `k=5`, drops for `k=3`, and plummets for `k=1`. Time-to-Answer increases from `k=9` to `k=3` but is moderate for the outlier `k=1`.
* **Cyan Squares:** This series shows a relatively flat trend in accuracy (all points clustered between ~0.710 and 0.717) with a slight increase in Time-to-Answer as 'k' decreases from 9 to 3.
* **Red Circles:** This series shows a clear positive correlation. As 'k' increases from 3 to 9, both Accuracy and Time-to-Answer increase.
### Key Observations
1. **Performance Clusters:** The three marker types form distinct clusters. Cyan squares are grouped at low Time-to-Answer (~11-13) and moderate accuracy (~0.71). Cyan diamonds are spread across the middle of the Time-to-Answer range but achieve the highest accuracies. Red circles are grouped at the high end of Time-to-Answer (~19-21) with moderate to high accuracy.
2. **The `k=1` Outlier:** The cyan diamond labeled `k=1` is a significant outlier. It has the lowest accuracy by a large margin (~0.67) but a moderate Time-to-Answer (~15.8), breaking the pattern of its series.
3. **Accuracy Ceiling:** The maximum accuracy achieved is approximately 0.75 (by the cyan diamond, `k=9`). No configuration exceeds this value.
4. **Time-to-Accuracy Trade-off:** The red circle series demonstrates a clear trade-off: higher 'k' yields higher accuracy but requires significantly more time. The cyan diamond series achieves similar or better accuracy than the red circles at a lower time cost for equivalent 'k' values (e.g., compare cyan diamond `k=5` at ~15.5 time vs. red circle `k=5` at ~19.8 time).
### Interpretation
This chart likely compares different algorithms, model architectures, or prompting strategies (represented by the three marker types) across a parameter 'k' (which could be the number of reasoning steps, retrieved documents, or ensemble members).
* **The cyan diamond strategy** appears to be the most efficient for achieving peak accuracy, offering the best accuracy-to-time ratio for `k=5` and `k=9`. However, its performance collapses at `k=1`, suggesting it requires a minimum level of complexity ('k') to function effectively.
* **The cyan square strategy** is the fastest (lowest Time-to-Answer) but hits an accuracy ceiling around 0.717. It is insensitive to changes in 'k' within the tested range, indicating it may be a simpler, less scalable method.
* **The red circle strategy** scales predictably: increasing 'k' reliably improves accuracy but at a high computational cost (time). It is the most time-intensive approach for any given 'k' value.
**Underlying Message:** The data suggests there is no single "best" configuration. The optimal choice depends on the priority: for maximum speed with acceptable accuracy, the cyan square method is best. For the highest possible accuracy with a moderate time budget, the cyan diamond method with `k=9` is optimal. The red circle method may be preferable if predictable, linear scaling with 'k' is required, despite its higher time cost. The dramatic failure of the cyan diamond at `k=1` is a critical finding, indicating a potential instability or threshold effect in that method.
</details>
(c) QwQ-32B
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods
### Overview
The image is a scatter plot comparing the performance of three different methods (`majority@k`, `short-1@k (Ours)`, and `short-3@k (Ours)`) across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). Each data point is labeled with its corresponding `k` value (k=1, 3, 5, 9). The plot suggests a trade-off between speed and accuracy, with the proposed methods (`short-1@k` and `short-3@k`) generally achieving higher accuracy at lower time costs compared to the baseline (`majority@k`).
### Components/Axes
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from 14 to 22, with major gridlines at intervals of 2 (14, 16, 18, 20, 22). The unit is implied to be thousands of some time measure (e.g., milliseconds, steps).
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from 0.78 to 0.88, with major gridlines at intervals of 0.02 (0.78, 0.80, 0.82, 0.84, 0.86, 0.88).
* **Legend:** Located in the bottom-right quadrant of the chart area. It defines three data series:
* **Red Circle:** `majority@k`
* **Blue Square:** `short-1@k (Ours)`
* **Cyan Diamond:** `short-3@k (Ours)`
* **Data Point Labels:** Each marker is annotated with text indicating its `k` value (e.g., "k=9").
### Detailed Analysis
**Data Series & Approximate Coordinates:**
1. **`majority@k` (Red Circles):**
* **Trend:** Both Time-to-Answer and Accuracy increase as `k` increases. The series forms a roughly linear upward slope from bottom-left to top-right.
* **Points:**
* `k=3`: Time β 21.5, Accuracy β 0.815
* `k=5`: Time β 22.5, Accuracy β 0.838
* `k=9`: Time β 23.5 (estimated, beyond axis limit), Accuracy β 0.865
2. **`short-1@k (Ours)` (Blue Squares):**
* **Trend:** Time-to-Answer *decreases* as `k` increases, while Accuracy *increases*. This creates a downward slope from left to right.
* **Points:**
* `k=3`: Time β 16.5, Accuracy β 0.830
* `k=5`: Time β 15.5, Accuracy β 0.845
* `k=9`: Time β 14.5, Accuracy β 0.850
3. **`short-3@k (Ours)` (Cyan Diamonds):**
* **Trend:** Shows a more complex pattern. Time increases from k=1 to k=3, then decreases for higher k. Accuracy peaks at k=9.
* **Points:**
* `k=1`: Time β 19.0, Accuracy β 0.780 (lowest accuracy on chart)
* `k=3`: Time β 21.5, Accuracy β 0.848
* `k=5`: Time β 19.5, Accuracy β 0.870
* `k=9`: Time β 17.5, Accuracy β 0.885 (highest accuracy on chart)
### Key Observations
1. **Performance Frontier:** The `short-3@k` method at `k=9` (cyan diamond, top-center) defines the Pareto frontier, offering the highest accuracy (~0.885) at a moderate time cost (~17.5).
2. **Efficiency of Proposed Methods:** Both `short-1@k` and `short-3@k` consistently achieve higher accuracy than `majority@k` for the same `k` value, and do so with significantly lower Time-to-Answer. For example, at `k=9`, `short-3@k` is ~0.02 more accurate and ~6 units faster than `majority@k`.
3. **Inverse Relationship for `short-1@k`:** This method uniquely shows that increasing `k` leads to both better accuracy *and* faster answers, suggesting an efficiency gain from the method's design.
4. **Outlier:** The `short-3@k` at `k=1` is a clear outlier, having the lowest accuracy by a significant margin (~0.78), indicating the method may require a minimum `k` to be effective.
### Interpretation
The data demonstrates the superiority of the proposed methods (`short-1@k` and `short-3@k`) over the `majority@k` baseline in the accuracy-speed trade-off. The core finding is that these methods can "think" more efficiently: they achieve better results (higher accuracy) while spending less computational time (lower Time-to-Answer).
* **`short-1@k`** appears optimized for speed, showing a remarkable property where scaling up `k` improves accuracy without a time penalty.
* **`short-3@k`** is optimized for peak accuracy, with its `k=9` configuration being the most accurate overall. Its non-linear time behavior suggests a more complex internal process where intermediate `k` values (like k=3) may involve more deliberation than both lower and higher `k` settings.
The chart argues that the choice of method and the `k` parameter allows for tuning a system along a spectrum from fast-and-accurate (`short-1@k`) to slower-but-most-accurate (`short-3@k` at high `k`), with both outperforming the standard majority voting approach. The "longest thinking in thousands" unit implies this is likely from a machine learning or AI reasoning context, where `k` could represent the number of reasoning steps, samples, or candidates considered.
</details>
(d) R1-670B
Figure 4: Comparing time-to-answer for different inference methods. Our methods substantially reduce time cost with no major loss in performance. Unlike majority $@k$ , which becomes slower as $k$ grows, our methods run faster with $k$ , as the probability of finding a short chain increases with $k$ .
Time-to-answer.
Finally, the math aggregated time-to-answer results are shown in Figure Λ 4, with GPQA-D results shown in Figure Λ 8 and pe math benchmark in Appendix Λ B. For readability, Figure 4 omits the oracle, and methods are compared across a subset of sample sizes. As sample size increases, majority $@k$ exhibits longer time-to-answer, driven by a higher probability of sampling generations with extended thinking chains, requiring all trajectories to complete. Conversely, the short-1@k method shows reduced time-to-answer with larger sample sizes, as the probability of encountering a short answer increases. This trend also holds for the short-3@k method after three reasoning processes complete.
This phenomenon makes the short-1@k and short-3@k methods substantially more usable compared to basic majority $@k$ . For example, when using the LN-Super- $49$ B model (Figure Λ 4(a)), with a sample size of $5$ , the short-1@k method reduces time consumption by almost $50\$ while also increasing performance by about $1.5\$ compared to majority@ $k$ . When considering a larger sample size of $9$ , the performance values are almost the same but short-1@k is more than $55\$ faster.
Finally, we observe that for most models and sample sizes, short-3@k boosts performance, while for larger ones it also reduces time-to-answer significantly. For example, on R $1$ - $32$ B (Figure Λ 4(b)), with $k=5$ , short-3@k is $33\$ faster than majority@ $k$ , while reaching superior performance. A similar boost in time-to-answer and performance is observed with QwQ- $32$ B/R $1$ - $670$ B and sample size $9$ (Figure Λ 4(c) and Figure Λ 4(d)).
## 5 Analysis
To obtain deeper insights into the underlying process, making shorter thinking trajectories preferable, we conduct additional analysis. We first investigate the relation between using shorter thinking per individual example, and the necessity of longer trajectories to solve more complex problems (Section Λ 5.1). Subsequently, we analyze backtracks in thinking trajectories to better understand the characteristics of shorter trajectories (Section Λ 5.2). Lastly, we analyze the perfomance of short-m@k in sequential setting (Section Λ 5.3). All experiments in this section use trajectories produced by our models as described in Section Λ 3.1. For Sections 5.1 and 5.2 we exclude generations that were not completed within the generation length.
### 5.1 Hard questions (still) require more thinking
We split the questions into three equal size groups according to modelβs success rate. Then, we calculate the average thinking length for each of the splits, and provide the average lengths for the correct and incorrect attempts per split.
Table 2: Average thinking tokens for correct (C), incorrect (IC) and all (A) answers, per split by difficulty for the math benchmarks. The numbers are in thousands of tokens.
| Model | Easy | Medium | Hard |
| --- | --- | --- | --- |
| C/IC/A | C/IC/A | C/IC/A | |
| LN-Super-49B | $\phantom{0}5.3/11.1/\phantom{0}5.7$ | $11.4/17.1/14.6$ | $12.4/16.8/16.6$ |
| R1-32B | $\phantom{0}4.9/13.7/\phantom{0}5.3$ | $10.9/17.3/13.3$ | $14.4/15.8/15.7$ |
| QwQ-32B The QwQ-32B and R1-670B models correctly answered all of their easier questions in all attempts. | $\phantom{0}8.4/\phantom{0.}\text{--}\phantom{0}/\phantom{0}8.4$ | $14.8/21.6/15.6$ | $19.1/22.8/22.3$ |
| R1-670B footnotemark: | $13.0/\phantom{0.}\text{--}\phantom{0}/13.0$ | $15.3/20.9/15.5$ | $23.0/31.7/28.4$ |
Tables Λ 2 and 5 shows the averages thinking tokens per split for the math benchmarks and GPQA-D, respectively. We first note that as observed in Section Λ 3.2, within each question subset, correct answers are typically shorter than incorrect ones. This suggests that correct answers tend to be shorter, and it holds for easier questions as well as harder ones.
Nevertheless, we also observe that models use more tokens for more challenging questions, up to a factor of $2.9$ . This finding is consistent with recent studies (Anthropic, 2025; OpenAI, 2024; Muennighoff et al., 2025) indicating that using longer thinking is needed in order to solve harder questions. To summarize, harder questions require a longer thinking process compared to easier ones, but within a single question (both easy and hard), shorter thinking is preferable.
### 5.2 Backtrack analysis
One may hypothesize that longer thinking reflect a more extensive and less efficient search path, characterized by a higher degree of backtracking and βrethinkingβ. In contrast, shorter trajectories indicate a more direct and efficient path, which often leads to a more accurate answer.
To this end, we track several keywords identified as indicators of re-thinking and backtracking within different trajectories. The keywords we used are: [βbutβ, βwaitβ, βhoweverβ, βalternativelyβ, βnot sureβ, βgoing backβ, βbacktrackβ, βtrace backβ, βhmmβ, βhmmmβ] We then categorize the trajectories into correct and incorrect sets, and measure the number of backtracks and their average length (quantified by keyword occurrences divided by total thinking length) for each set. We present the results for the math benchmarks and GPQA-D in Tables Λ 3 and 6, respectively.
Table 3: Average number of backtracks, and their average length for correct (C), incorrect (IC) and all (A) answers in math benchmarks.
| Model | # Backtracks | Backtrack Len. |
| --- | --- | --- |
| C/IC/A | C/IC/A | |
| LN-Super-49B | $106/269/193$ | $\phantom{0}88/\phantom{0}70/76$ |
| R1-32B | $\phantom{0}95/352/213$ | $117/\phantom{0}63/80$ |
| QwQ-32B | $182/269/193$ | $\phantom{0}70/\phantom{0}60/64$ |
| R1-670B | $188/323/217$ | $\phantom{0}92/102/99$ |
As our results indicate, for all models and benchmarks, correct trajectories consistently exhibit fewer backtracks compared to incorrect ones. Moreover, in almost all cases, backtracks of correct answers are longer. This may suggest that correct solutions involve less backtracking where each backtrack is longer, potentially more in-depth that leads to improved reasoning, whereas incorrect ones explores more reasoning paths that are abandoned earlier (hence tend to be shorter).
Lastly, we analyze the backtrack behavior in a length-controlled manner. Specifically, we divide trajectories into bins based on their length. Within each bin, we compare the number of backtracks between correct and incorrect trajectories. Our hypothesis is that even for trajectories of comparable length, correct trajectories would exhibit fewer backtracks, indicating a more direct path to the answer. The results over the math benchmarks and GPQA-D are presented in Appendix Λ F. As can be seen, in almost all cases, even among trajectories of comparable length, correct ones show a lower number of backtracks. The only exception is the R $1$ - $670$ B model over the math benchmarks. This finding further suggests that correct trajectories are superior because they spend less time on searching for the correct answer and instead dive deeply into a smaller set of paths.
### 5.3 short-m@k with sequential compute
Our results so far assume sufficient resources for generating the output in parallel. We now study the potential of our proposed method without this constraint, by comparing short-m@k to the baselines in sequential (non-batched) setting, and measuring the amount of thinking tokens used by each method. For short-m@k, each generation is terminated once its length exceeds the maximum length observed among the $m$ shortest previously completed generations.
The results for the math benchmarks are presented in Figure Λ 5 while the GPQA-D results are in Appendix Λ E. While the performance of short-m@k shows decreased efficiency in terms of total thinking compute usage compared to a fully batched decoding setup (Section Λ 4.3), the methodβs superiority over standard majority voting remains. Specifically, at low compute regimes, both short-1@k and short-3@k demonstrate higher efficiency and improved performance compared to majority voting. As for higher compute regimes, short-3@k outperforms the majority voting baseline.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting "Accuracy" against "Thinking Compute (thinking tokens in thousands)." It displays three distinct data series, each represented by a line with unique color and marker style, showing how accuracy changes as the computational resources (thinking tokens) increase. The chart demonstrates a positive correlation between compute and accuracy for all series, with diminishing returns at higher compute levels.
### Components/Axes
* **X-Axis (Horizontal):**
* **Title:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale from 20 to 120, with major tick marks at intervals of 20 (20, 40, 60, 80, 100, 120).
* **Y-Axis (Vertical):**
* **Title:** "Accuracy"
* **Scale:** Linear scale from 0.48 to 0.62, with major tick marks at intervals of 0.02 (0.48, 0.50, 0.52, 0.54, 0.56, 0.58, 0.60, 0.62).
* **Data Series (Legend inferred from visual attributes):**
* **Series 1:** Cyan line with diamond-shaped markers.
* **Series 2:** Cyan line with square-shaped markers.
* **Series 3:** Red (or dark brown) line with circular markers.
* **Grid:** A light gray grid is present, with vertical lines at each x-axis tick and horizontal lines at each y-axis tick.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
All three lines show a logarithmic-like growth curve: a steep initial increase in accuracy that gradually flattens as compute increases.
1. **Cyan Line with Diamonds (Top-most line at high compute):**
* **Trend:** Steepest initial slope, consistently the highest or tied for highest accuracy across the range.
* **Approximate Points:**
* (20, 0.47)
* (30, 0.525)
* (40, 0.56)
* (50, 0.578)
* (60, 0.59)
* (70, 0.599)
* (80, 0.605)
* (90, 0.61)
* (100, 0.613)
* (110, 0.615)
2. **Cyan Line with Squares (Middle line at high compute):**
* **Trend:** Follows a very similar trajectory to the diamond line but is consistently slightly lower after the initial point.
* **Approximate Points:**
* (20, 0.47) *[Starts at the same point as the diamond line]*
* (30, 0.52)
* (40, 0.555)
* (50, 0.575)
* (60, 0.585)
* (70, 0.59)
* (80, 0.595)
* (90, 0.60)
3. **Red Line with Circles (Bottom line):**
* **Trend:** Has the shallowest initial slope. It starts at the same accuracy as the others but grows more slowly, remaining below both cyan lines for the entire plotted range after the starting point.
* **Approximate Points:**
* (20, 0.47)
* (40, 0.51)
* (50, 0.54)
* (60, 0.562)
* (70, 0.577)
* (80, 0.588)
* (90, 0.595)
* (100, 0.601)
* (110, 0.606)
### Key Observations
1. **Common Origin:** All three series begin at approximately the same accuracy (~0.47) at the lowest compute point (20k tokens).
2. **Performance Hierarchy:** A clear performance hierarchy is established early (by 30k tokens) and maintained: Diamond β₯ Square > Circle. The gap between the cyan lines and the red line widens significantly between 40k and 80k tokens.
3. **Diminishing Returns:** All curves show clear diminishing returns. The gain in accuracy per additional 10k tokens decreases substantially after ~60k-80k tokens.
4. **Convergence at High Compute:** The slopes of all lines flatten considerably at the high end (100k-120k tokens), suggesting a potential ceiling effect for accuracy within this experimental setup.
5. **Cyan Line Proximity:** The two cyan lines (diamond and square) are very close in performance, with the diamond-marked line maintaining a slight but consistent advantage.
### Interpretation
This chart likely compares the efficiency of different AI model architectures, training methods, or reasoning strategies ("thinking" processes). The "thinking tokens" represent the computational budget allocated to internal reasoning before producing an output.
* **What the data suggests:** The two methods represented by the cyan lines are significantly more **compute-efficient** than the method represented by the red line. They achieve higher accuracy at every level of compute beyond the initial point. The diamond method appears to be the most efficient overall.
* **Relationship between elements:** The direct relationship shown is that increased reasoning compute leads to higher accuracy, but the **rate of improvement** is critically dependent on the underlying method. The chart argues that not all "thinking" is equal; some approaches yield better accuracy per token spent.
* **Notable implications:** The steep initial rise indicates that even a modest investment in thinking compute (from 20k to 40k tokens) yields large accuracy gains for all methods. However, to reach the highest accuracy levels (>0.60), the more efficient (cyan) methods require substantially less compute (~80k-90k tokens) than the less efficient (red) method, which needs over 100k tokens to approach similar performance. This has direct implications for cost and latency in deploying such AI systems. The plateau suggests that simply throwing more compute at the problem may not be a viable path to much higher accuracy without algorithmic improvements.
</details>
(a) LN-Super-49B
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against the amount of "thinking compute" allocated, measured in thousands of thinking tokens. It compares the performance of three distinct reasoning methods. The chart demonstrates that accuracy generally increases with more compute, but the rate of improvement and the point of diminishing returns differ significantly between methods.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale from 0 to 120, with major tick marks every 20 units (0, 20, 40, 60, 80, 100, 120).
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0.54 to 0.64, with major tick marks every 0.02 units (0.54, 0.56, 0.58, 0.60, 0.62, 0.64).
* **Legend (Top-Left Corner):**
* **Cyan line with diamond markers:** "Chain-of-Thought (CoT)"
* **Blue line with square markers:** "Self-Consistency (SC)"
* **Red line with circle markers:** "Tree-of-Thought (ToT)"
* **Grid:** A light gray grid is present, aligned with the major tick marks on both axes.
### Detailed Analysis
The chart contains three data series, each representing a different method. Their trends and approximate key data points are as follows:
1. **Chain-of-Thought (CoT) - Cyan line with diamonds:**
* **Trend:** Shows a very steep initial increase in accuracy, which then decelerates and begins to plateau at higher compute levels. It is the highest-performing method at low-to-mid compute ranges.
* **Key Data Points (Approximate):**
* (10, 0.54)
* (20, 0.58)
* (30, 0.617)
* (40, 0.627)
* (50, 0.633)
* (60, 0.638)
* (70, 0.641)
* (80, 0.643)
* (90, 0.644)
* (100, 0.645)
2. **Self-Consistency (SC) - Blue line with squares:**
* **Trend:** Increases steadily at first but plateaus much earlier and at a lower accuracy level than the other two methods. It shows the least benefit from additional compute beyond ~50k tokens.
* **Key Data Points (Approximate):**
* (10, 0.54)
* (20, 0.581)
* (30, 0.596)
* (40, 0.602)
* (50, 0.606)
* (60, 0.609)
* (70, 0.610)
* (80, 0.611)
3. **Tree-of-Thought (ToT) - Red line with circles:**
* **Trend:** Starts with a more gradual slope than CoT but maintains a steady, near-linear increase across the entire compute range shown. It surpasses the SC method around 45k tokens and eventually overtakes the CoT method at approximately 85k tokens, becoming the highest-performing method at high compute levels.
* **Key Data Points (Approximate):**
* (10, 0.54)
* (20, 0.56)
* (30, 0.58)
* (40, 0.59)
* (50, 0.611)
* (60, 0.625)
* (70, 0.634)
* (80, 0.641)
* (90, 0.646)
* (100, 0.650)
* (110, 0.653)
### Key Observations
* **Diminishing Returns:** All three methods exhibit diminishing returns; the accuracy gain per additional thousand tokens decreases as compute increases.
* **Crossover Point:** A critical crossover occurs at approximately 85,000 thinking tokens, where the Tree-of-Thought (ToT) method's accuracy surpasses that of Chain-of-Thought (CoT).
* **Early Plateau:** The Self-Consistency (SC) method shows the earliest and most pronounced plateau, suggesting it may not effectively utilize additional computational resources beyond a certain point.
* **Starting Point:** All three methods begin at the same accuracy point (~0.54) at the lowest compute level (10k tokens).
### Interpretation
This chart provides a comparative analysis of the scaling efficiency of different AI reasoning strategies. The data suggests a fundamental trade-off:
* **Chain-of-Thought (CoT)** is highly efficient at lower compute budgets, delivering rapid accuracy gains. It is the optimal choice when computational resources are constrained.
* **Tree-of-Thought (ToT)** demonstrates superior scaling. While less efficient initially, its performance continues to improve steadily with more compute, making it the best choice for high-performance scenarios where maximum accuracy is the goal and computational cost is a secondary concern.
* **Self-Consistency (SC)** appears to have a lower performance ceiling. Its early plateau indicates that the method's core mechanism (likely majority voting over multiple CoT paths) may saturate, and additional compute does not translate into proportionally better reasoning or accuracy.
The implication for system design is clear: the "best" method is context-dependent. One should select CoT for speed and efficiency in resource-limited settings, and invest in ToT for tasks demanding peak accuracy where ample compute is available. The SC method, in this specific comparison, seems outclassed by the other two across most of the compute spectrum.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational resources allocated for "thinking." It displays the performance of three distinct models or methods, represented by three colored lines with markers, as the amount of thinking compute increases.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from 0 to 150, with major tick marks at 25, 50, 75, 100, 125, and 150.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from approximately 0.68 to 0.75, with major tick marks at 0.68, 0.70, 0.72, and 0.74.
* **Legend:** Positioned in the top-left corner of the chart area. It contains three entries:
1. A cyan line with diamond markers.
2. A red line with circle markers.
3. A blue line with square markers.
* **Grid:** A light gray grid is present, aligning with the major tick marks on both axes.
### Detailed Analysis
**Data Series 1: Cyan Line with Diamond Markers**
* **Trend:** Shows a steep, concave-downward increase that begins to plateau at higher compute values. It is the top-performing series across the entire range.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.665
* At 25k tokens: Accuracy β 0.700
* At 50k tokens: Accuracy β 0.723
* At 75k tokens: Accuracy β 0.740
* At 100k tokens: Accuracy β 0.748
* At 125k tokens: Accuracy β 0.752
* At 150k tokens: Accuracy β 0.755
**Data Series 2: Red Line with Circle Markers**
* **Trend:** Shows a steady, nearly linear increase across the entire range. It starts as the lowest-performing series but surpasses the blue series around 65k tokens.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.665 (starts at the same point as cyan)
* At 25k tokens: Accuracy β 0.680
* At 50k tokens: Accuracy β 0.701
* At 75k tokens: Accuracy β 0.724
* At 100k tokens: Accuracy β 0.736
* At 125k tokens: Accuracy β 0.744
* At 150k tokens: Accuracy β 0.748
**Data Series 3: Blue Line with Square Markers**
* **Trend:** Shows an initial increase, peaks, and then exhibits a slight decline. It is the only series that does not show continuous improvement with more compute.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.665 (starts at the same point as others)
* At 25k tokens: Accuracy β 0.700
* At 50k tokens: Accuracy β 0.715
* At 75k tokens: Accuracy β 0.717 (appears to be the peak)
* At 100k tokens: Accuracy β 0.716
* At 125k tokens: Accuracy β 0.714
### Key Observations
1. **Performance Hierarchy:** The cyan (diamond) model is consistently the most accurate, followed by the red (circle) model after ~65k tokens. The blue (square) model is the least accurate for compute values above ~65k.
2. **Diminishing Returns:** All models show diminishing returns; the accuracy gain per additional thousand thinking tokens decreases as compute increases. This is most pronounced for the cyan model.
3. **Performance Degradation:** The blue model is unique in that its accuracy slightly decreases after peaking at approximately 75k thinking tokens, suggesting a potential overfitting or capacity limit.
4. **Convergence at Low Compute:** All three models start at nearly the same accuracy point (~0.665) when thinking compute is very low (~15k tokens).
### Interpretation
This chart demonstrates the relationship between computational investment in a model's "thinking" process and its resulting accuracy on a task. The data suggests:
* **Model Architecture Matters:** The significant performance gap between the cyan and blue lines, especially at higher compute, indicates fundamental differences in model architecture, training, or efficiency. The cyan model is far more effective at leveraging additional compute.
* **The "Thinking" Paradigm is Effective but Bounded:** For the top two models, allocating more tokens for internal reasoning ("thinking") reliably improves performance, but with clear diminishing returns. There is a practical limit to the benefits of simply throwing more compute at the problem.
* **Potential for Over-Optimization:** The blue model's decline after a peak is a critical anomaly. It implies that for some systems, there is an optimal compute budget, beyond which performance may degradeβpossibly due to the model generating irrelevant or confusing internal reasoning that harms its final output.
* **Strategic Implications:** For resource-constrained applications, the red model might offer a better cost-benefit ratio at very high compute levels, as its improvement is more linear. However, for maximum accuracy regardless of cost, the cyan model is the clear choice. The blue model appears unsuitable for high-compute scenarios.
**Language:** The chart text is in English. No other languages are present.
</details>
(c) QwQ-32B
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Methods
### Overview
This image is a line chart comparing the performance (accuracy) of three different methods as a function of computational resources ("Thinking Compute"). The chart demonstrates how accuracy scales with increased compute for a baseline method and two proposed methods labeled "(Ours)".
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear scale from approximately 0 to 180 (thousands of tokens).
* **Major Ticks:** 50, 100, 150.
* **Y-Axis (Vertical):**
* **Label:** `Accuracy`
* **Scale:** Linear scale from 0.78 to approximately 0.89.
* **Major Ticks:** 0.78, 0.80, 0.82, 0.84, 0.86, 0.88.
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Entries:**
1. `majority@k` - Represented by a dark red line with circular markers.
2. `short-1@k (Ours)` - Represented by a medium blue line with square markers.
3. `short-3@k (Ours)` - Represented by a cyan/turquoise line with diamond markers.
### Detailed Analysis
The chart plots three distinct data series. The general trend for all is increasing accuracy with increased thinking compute, but the rate and ceiling differ significantly.
**1. Data Series: `majority@k` (Dark Red, Circles)**
* **Trend:** Shows a steady, near-linear upward slope across the entire compute range. It starts as the lowest-performing method at low compute but continues to improve without obvious plateauing within the chart's bounds.
* **Approximate Data Points:**
* At ~10k tokens: Accuracy β 0.78
* At ~50k tokens: Accuracy β 0.815
* At ~100k tokens: Accuracy β 0.838
* At ~150k tokens: Accuracy β 0.860
* At ~180k tokens: Accuracy β 0.870 (estimated, line extends past last tick)
**2. Data Series: `short-1@k (Ours)` (Blue, Squares)**
* **Trend:** Shows a rapid initial increase in accuracy at low compute, followed by a clear plateau. The curve flattens significantly after approximately 75k-100k thinking tokens, indicating diminishing returns.
* **Approximate Data Points:**
* At ~10k tokens: Accuracy β 0.78 (similar starting point to others)
* At ~50k tokens: Accuracy β 0.839
* At ~75k tokens: Accuracy β 0.846
* At ~100k tokens: Accuracy β 0.848
* At ~125k tokens: Accuracy β 0.848 (plateau)
* At ~150k tokens: Accuracy β 0.848 (plateau)
**3. Data Series: `short-3@k (Ours)` (Cyan, Diamonds)**
* **Trend:** Exhibits the steepest initial ascent and achieves the highest overall accuracy. It continues to show meaningful improvement across the entire plotted range, though the rate of increase slows slightly at higher compute levels.
* **Approximate Data Points:**
* At ~10k tokens: Accuracy β 0.78
* At ~50k tokens: Accuracy β 0.848
* At ~75k tokens: Accuracy β 0.862
* At ~100k tokens: Accuracy β 0.877
* At ~125k tokens: Accuracy β 0.882
* At ~150k tokens: Accuracy β 0.885
* At ~175k tokens: Accuracy β 0.888 (estimated)
### Key Observations
1. **Performance Hierarchy:** At virtually all compute levels above the initial point, `short-3@k (Ours)` > `short-1@k (Ours)` > `majority@k`.
2. **Efficiency:** The proposed methods (`short-1@k` and `short-3@k`) are significantly more compute-efficient. For example, to reach an accuracy of 0.84, `short-3@k` requires ~45k tokens, `short-1@k` requires ~60k tokens, while `majority@k` requires ~110k tokens.
3. **Plateau Behavior:** `short-1@k` plateaus around 0.848 accuracy. `short-3@k` shows no clear plateau within the chart, suggesting it may benefit from even more compute. `majority@k` shows no plateau but remains less efficient.
4. **Convergence at Low Compute:** All three methods start at nearly the same accuracy (~0.78) when thinking compute is very low (~10k tokens).
### Interpretation
This chart presents a classic efficiency-performance trade-off analysis in machine learning or AI system design. The data suggests that the authors' proposed methods (`short-1@k` and `short-3@k`) are superior to the baseline (`majority@k`) in terms of the accuracy achieved per unit of computational cost (thinking tokens).
* **`short-3@k`** is the most effective method, offering the highest accuracy ceiling and strong scaling. It is the recommended choice when maximum performance is the goal and compute is available.
* **`short-1@k`** offers a good balance, providing a substantial boost over the baseline with moderate compute but hitting a performance limit. It might be suitable for resource-constrained environments where the additional gain from `short-3@k` is not justified.
* **`majority@k`** serves as a simple baseline. Its linear scaling indicates it is a robust but inefficient method, requiring disproportionate compute for gains in accuracy.
The underlying message is that the authors' techniques ("Ours") enable more accurate results with less computational waste, which is critical for scaling AI models efficiently. The plateau of `short-1@k` versus the continued rise of `short-3@k` might indicate differences in model capacity or the effectiveness of the "short" strategy at different scales.
</details>
(d) R1-670B
Figure 5: Comparing different methods for the math benchmarks under sequential (non-parallel) decoding.
## 6 Finetuning using shorter trajectories
Based on our findings, we investigate whether fine-tuning on shorter reasoning chains improves LLM reasoning accuracy. While one might intuitively expect this to be the case, given the insights from Section Λ 3 and Section Λ 5, this outcome is not trivial. A potential counterargument is that training on shorter trajectories could discourage the model from performing necessary backtracks (Section Λ 5.2), thereby hindering its ability to find a correct solution. Furthermore, the benefit of using shorter trajectories for bootstrapping reasoning remains an open question.
To do so, we follow the S $1$ paradigm, which fine-tunes an LLM to perform reasoning using only $1,$ $000$ trajectories (Muennighoff et al., 2025). We create three versions of the S $1$ dataset, built from examples with the shortest, longest, and random reasoning chains among several generations.
Data creation and finetuning setup.
To construct the three variants of S $1$ , we generate multiple responses for each S $1$ question-answer pair. Specifically, for each example, we produce $10$ distinct answers using the QwQ- $32$ B model, which we select for its superior performance with respect to model size (Section Λ 3). From these $10$ responses per example, we derive three dataset variantsβS $1$ -short, S $1$ -long, and S $1$ -randomβby selecting the shortest/longest/random response, respectively. This results in three datasets, each containing the same $1,$ $000$ queries but with distinct reasoning trajectories and answers. We then finetune the Qwen- $2.5$ - $7$ B-Instruct and Qwen- $2.5$ - $32$ B-Instruct models on the three S $1$ variants. We further detail about the generation process, the finetuning setup and evaluation setup in Appendix Λ G.
Finetuning results.
Results for the 32B model are presented in Table Λ 4 (7B model results are in Table Λ 10). For GPQA-D, AIME $2025$ and HMMT benchmarks, the S $1$ -short variant achieves superior performance while using fewer thinking tokens. While performance on AIME $2024$ is similar across models, S $1$ -short still demonstrates the shortest thinking. Aggregated results across math benchmarks reveal that S $1$ -short improves relative performance by $2.8\$ compared to the S $1$ -random baseline, with a reduction of $5.8\$ in thinking tokens. Conversely, the S $1$ -long model consumes more tokens than S $1$ -random, but obtains similar performance.
These results suggest that training on shorter reasoning sequences can lead to better reasoning models that exhibit reduced computational overhead. This observation aligns with our findings in Section Λ 3, which shows that answers with shorter thinking trajectories tend to be more accurate. We believe that developing models that reason more effectively with less computation holds significant potential.
Table 4: Results for our finetuned models over the S $1$ variants using Qwen-2.5-32B-Instruct: S $1$ -short/long/random. The S $1$ -short model improves performance over the other two models, while using fewer thinking tokens.
| | GPQA-D | AIME 2024 | AIME 2025 | HMMT | Math Average | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | |
| S1-random | 11566 | 62.5 | 16145 | 68.8 | 17798 | 59.3 | 19243 | 40.8 | 17729 | 56.3 |
| S1-long | 12279 $(+6.1\$ | 63.7 | 16912 | 67.3 | 17973 | 58.5 | 19397 | 42.1 | 18094 $(+2.1\$ | 56.0 |
| S1-short | 10845 $(-6.2\$ | 64.8 | 15364 | 68.3 | 17195 | 60.2 | 17557 | 45.2 | 16706 $(-5.8\$ | 57.9 |
## 7 Conclusion
In this work, we challenged the common assumption that increased test-time computation leads to better performance in reasoning LLMs. Through empirical analysis on four complex mathematical and reasoning benchmarks, we showed that shorter reasoning chains consistently outperform longer ones, both in accuracy and computational efficiency. Building on this insight, we introduced short-m@k, an inference method that prioritizes early-terminating generations. short-1@k, our most efficient variant, is preferred over traditional majority voting in low-compute settings. short-3@k, while slightly less efficient, outperforms majority voting across all compute budgets. We further investigate thinking trajectories, and find that shorter thinking usually involve less backtracks, and a more direct way to solution. To further validate our findings, we fine-tuned an LLM on short reasoning trajectories and observed improved accuracy and faster runtime, whereas training on longer chains yields diminishing returns. These findings highlight a promising direction for developing faster and more effective reasoning LLMs by embracing brevity over extended computation.
#### Acknowledgments
We thank Miri Varshavsky Hassid for the great feedback and moral support.
## References
- M. Abdin, S. Agarwal, A. Awadallah, V. Balachandran, H. Behl, L. Chen, G. de Rosa, S. Gunasekar, M. Javaheripi, N. Joshi, et al. (2025) Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318. Cited by: Β§1, Β§2, Β§4.2.
- A. Agarwal, A. Sengupta, and T. Chakraborty (2025) First finish search: efficient test-time scaling in large language models. arXiv preprint arXiv:2505.18149. Cited by: Β§2.
- Anthropic (2025) Claudeβs extended thinking. External Links: Link Cited by: Β§1, Β§1, Β§2, Β§2, Β§3, Β§5.1.
- D. Arora and A. Zanette (2025) Training language models to reason efficiently. arXiv preprint arXiv:2502.04463. Cited by: Β§2.
- M. BalunoviΔ, J. Dekoninck, I. Petrov, N. JovanoviΔ, and M. Vechev (2025) MathArena: evaluating llms on uncontaminated math competitions. SRI Lab, ETH Zurich. External Links: Link Cited by: Β§3.1.
- A. Bercovich et al. (2025) Llama-nemotron: efficient reasoning models. External Links: 2505.00949, Link Cited by: Appendix D, Β§1, Β§2, Β§3.1.
- M. Chen et al. (2021) Evaluating large language models trained on code. Note: arXiv:2107.03374 External Links: Link Cited by: Β§4.2, Β§4.2.
- G. DeepMind (2025) Gemini 2.5: our most intelligent ai model. External Links: Link Cited by: Β§2, Β§3.1.
- M. Fatemi, B. Rafiee, M. Tang, and K. Talamadupula (2025) Concise reasoning via reinforcement learning. arXiv preprint arXiv:2504.05185. Cited by: Β§2.
- A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: Β§3.1.
- D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: Appendix D, Β§1, Β§2, Β§2, Β§3.1, Β§4.2.
- M. Hassid, T. Remez, J. Gehring, R. Schwartz, and Y. Adi (2024) The larger the better? improved llm code-generation via budget reallocation. arXiv preprint arXiv:2404.00725 arXiv:2404.00725. External Links: Link Cited by: Β§4.2.
- Y. Kang, X. Sun, L. Chen, and W. Zou (2025) C3ot: generating shorter chain-of-thought without compromising effectiveness. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 24312β24320. Cited by: Β§2.
- Z. Ke, F. Jiao, Y. Ming, X. Nguyen, A. Xu, D. X. Long, M. Li, C. Qin, P. Wang, S. Savarese, et al. (2025) A survey of frontiers in llm reasoning: inference scaling, learning to reason, and agentic systems. arXiv preprint arXiv:2504.09037. Cited by: Β§2.
- S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. S. Liang (2019) SPoC: search-based pseudocode to code. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchΓ©-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: Β§4.2.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611β626. Cited by: Β§3.1.
- X. Lu, S. Han, D. Acuna, H. Kim, J. Jung, S. Prabhumoye, N. Muennighoff, M. Patwary, M. Shoeybi, B. Catanzaro, et al. (2025) Retro-search: exploring untaken paths for deeper and efficient reasoning. arXiv preprint arXiv:2504.04383. Cited by: Β§2.
- N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto (2025) S1: simple test-time scaling. arXiv preprint arXiv:2501.19393. Cited by: Appendix G, §1, §1, §1, §2, §3, §5.1, §6.
- S. Nayab, G. Rossolini, M. Simoni, A. Saracino, G. Buttazzo, N. Manes, and F. Giacomelli (2024) Concise thoughts: impact of output length on llm reasoning and cost. arXiv preprint arXiv:2407.19825. Cited by: Β§2.
- M. A. of America (2024) AIME 2024. External Links: Link Cited by: Β§3.1.
- M. A. of America (2025) AIME 2025. External Links: Link Cited by: Β§3.1.
- OpenAI (2024) Learning to reason with llms. External Links: Link Cited by: Β§1, Β§1, Β§2, Β§3, Β§5.1.
- OpenAI (2025) OpenAI o3-mini. Note: Accessed: 2025-02-24 External Links: Link Cited by: Β§1, Β§2.
- X. Pu, M. Saxon, W. Hua, and W. Y. Wang (2025) THOUGHTTERMINATOR: benchmarking, calibrating, and mitigating overthinking in reasoning models. arXiv preprint arXiv:2504.13367. Cited by: Β§2.
- P. Qi, Z. Liu, T. Pang, C. Du, W. S. Lee, and M. Lin (2025) Optimizing anytime reasoning via budget relative policy optimization. arXiv preprint arXiv:2505.13438. Cited by: Β§2.
- D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: Β§3.1.
- [27] (2025) Skywork open reasoner series. Note: https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 Notion Blog Cited by: Β§2.
- Q. Team (2025a) Qwen3. External Links: Link Cited by: Β§2.
- Q. Team (2025b) QwQ-32b: embracing the power of reinforcement learning. External Links: Link Cited by: Β§1, Β§2, Β§2, Β§3.1.
- C. Wang, Y. Feng, D. Chen, Z. Chu, R. Krishna, and T. Zhou (2025a) Wait, we donβt need to" wait"! removing thinking tokens improves reasoning efficiency. arXiv preprint arXiv:2506.08343. Cited by: Β§2.
- J. Wang, S. Zhu, J. Saad-Falcon, B. Athiwaratkun, Q. Wu, J. Wang, S. L. Song, C. Zhang, B. Dhingra, and J. Zou (2025b) Think deep, think fast: investigating efficiency of verifier-free inference-time-scaling methods. arXiv preprint arXiv:2504.14047. Cited by: Β§2, Β§4.2.
- X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou (2022) Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. Cited by: Β§1, Β§4.2.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824β24837. Cited by: Β§2.
- Y. Wu, Y. Wang, Z. Ye, T. Du, S. Jegelka, and Y. Wang (2025) When more is less: understanding chain-of-thought length in llms. arXiv preprint arXiv:2502.07266. Cited by: Β§2.
- A. Yang et al. (2024) Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: Β§1, Β§3.1.
- C. Yang, Q. Si, Y. Duan, Z. Zhu, C. Zhu, Z. Lin, L. Cao, and W. Wang (2025) Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895. Cited by: Β§2.
- Y. Ye, Z. Huang, Y. Xiao, E. Chern, S. Xia, and P. Liu (2025) LIMO: less is more for reasoning. arXiv preprint arXiv:2502.03387. Cited by: Β§2.
- Z. Yu, Y. Wu, Y. Zhao, A. Cohan, and X. Zhang (2025) Z1: efficient test-time scaling with code. arXiv preprint arXiv:2504.00810. Cited by: Β§2.
## Appendix A GPQA diamond results
We present below results for the GPQA-D benchmark. Figure Λ 6, Figure Λ 7 and Figure Λ 8 are the sample-size/compute/time-to-answer results for GPQA-D, respectively. Table Λ 5 correspond to the Table Λ 2 in Section Λ 5.1. Table Λ 6 and Table Λ 9 correspond to Table Λ 3 and Table Λ 8 in Section Λ 5.2, respectively.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image displays a line chart plotting "Accuracy" on the vertical y-axis against "Sample Size (k)" on the horizontal x-axis. The chart compares the performance of three distinct methods or models, represented by three colored lines, as the sample size increases from 1 to 10 (in thousands, denoted by 'k'). All three lines show a positive correlation between sample size and accuracy, with diminishing returns as the sample size grows.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, from 1 to 10 in increments of 1.
* **Markers:** Major tick marks at each integer value (1, 2, 3, ..., 10).
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0.650 to 0.690 in increments of 0.005.
* **Markers:** Major tick marks at 0.650, 0.655, 0.660, 0.665, 0.670, 0.675, 0.680, 0.685, 0.690.
* **Legend:** Located in the top-left corner of the plot area. It contains three entries:
1. A cyan line with a diamond marker (β).
2. A brown line with a circle marker (β).
3. A light blue line with a square marker (β ).
* **Grid:** A light gray grid is present, with both horizontal and vertical lines aligned with the major tick marks.
### Detailed Analysis
**Data Series and Trends:**
1. **Cyan Line (Diamond Markers):**
* **Trend:** This line exhibits the steepest initial ascent and achieves the highest overall accuracy. It shows a rapid increase from k=1 to k=5, after which the rate of improvement slows significantly, plateauing near the top of the chart.
* **Approximate Data Points:**
* k=1: ~0.651
* k=2: ~0.668
* k=3: ~0.675
* k=4: ~0.683
* k=5: ~0.688
* k=6: ~0.690
* k=7: ~0.691
* k=8: ~0.691
* k=9: ~0.691
* k=10: ~0.691
2. **Brown Line (Circle Markers):**
* **Trend:** This line shows a steady, consistent increase in accuracy. It starts at the same point as the other lines at k=1 but grows at a slower rate than the cyan line. It maintains a position between the cyan and light blue lines for most of the chart.
* **Approximate Data Points:**
* k=1: ~0.651
* k=2: ~0.660
* k=3: ~0.671
* k=4: ~0.675
* k=5: ~0.679
* k=6: ~0.681
* k=7: ~0.682
* k=8: ~0.683
* k=9: ~0.684
* k=10: ~0.684
3. **Light Blue Line (Square Markers):**
* **Trend:** This line follows a trajectory very similar to the brown line but consistently achieves slightly lower accuracy values after k=2. The gap between the brown and light blue lines appears to narrow slightly as the sample size increases.
* **Approximate Data Points:**
* k=1: ~0.651
* k=2: ~0.668
* k=3: ~0.673
* k=4: ~0.675
* k=5: ~0.677
* k=6: ~0.678
* k=7: ~0.679
* k=8: ~0.681
* k=9: ~0.682
* k=10: ~0.683
### Key Observations
* **Convergence at Start:** All three methods begin at approximately the same accuracy (~0.651) when the sample size is smallest (k=1).
* **Performance Hierarchy:** A clear performance hierarchy is established by k=3 and maintained thereafter: Cyan > Brown > Light Blue.
* **Diminishing Returns:** All curves demonstrate diminishing marginal returns. The most significant gains in accuracy occur between k=1 and k=5. After k=6 or k=7, the improvements become very small, especially for the top-performing (cyan) method.
* **Plateau Point:** The cyan method's accuracy effectively plateaus after k=7, showing negligible improvement from k=7 to k=10.
* **Close Competition:** The brown and light blue methods perform very similarly, with the brown method maintaining a narrow but consistent lead of approximately 0.001 to 0.002 accuracy points from k=5 onward.
### Interpretation
This chart likely compares the learning curves of three different algorithms, models, or training techniques. The data suggests that the method represented by the **cyan line** is the most data-efficient and effective, achieving superior accuracy with fewer samples and maintaining a significant lead. Its rapid early learning and high plateau indicate a robust model that effectively leverages additional data up to a point.
The **brown and light blue methods** show similar learning patterns but are less effective overall. The narrow gap between them suggests they may be variants of a similar approach, with the brown variant having a slight, consistent advantage. The fact that all methods start at the same point implies they were evaluated under identical initial conditions or with a minimal baseline dataset.
The universal trend of diminishing returns is a critical insight. It indicates that for this particular task, increasing the sample size beyond approximately 6,000 to 7,000 samples (k=6 or 7) yields minimal benefit for any of the tested methods. This has practical implications for resource allocation, suggesting that collecting data beyond this point may not be cost-effective. The chart effectively answers the question: "How much data is enough?" for these specific approaches.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image displays a line chart plotting "Accuracy" on the vertical Y-axis against "Sample Size (k)" on the horizontal X-axis. The chart compares the performance of three distinct data series, represented by lines of different colors and markers, as the sample size increases from 1 to 10 (in thousands, denoted by 'k'). The overall trend for all series is an increase in accuracy with sample size, though the rate of improvement and final plateau differ significantly.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, from 1 to 10.
* **Tick Marks:** Integers 1 through 10.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0.620 to 0.650.
* **Tick Marks:** 0.620, 0.625, 0.630, 0.635, 0.640, 0.645, 0.650.
* **Grid:** A light gray grid is present, with vertical lines at each integer X-value and horizontal lines at each 0.005 Y-interval.
* **Legend:** **No explicit legend is present in the image.** The three series are distinguished solely by line color and marker shape.
* **Data Series (Identified by Color and Marker):**
1. **Cyan Line with Diamond Markers:** The top-performing series.
2. **Red (Maroon) Line with Circle Markers:** The middle-performing series.
3. **Blue Line with Square Markers:** The lowest-performing series after the initial rise.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **Cyan Line (Diamond Markers):**
* **Trend:** Shows a strong, steady logarithmic growth. It rises sharply from k=1 to k=5, then the rate of increase slows, approaching a plateau near the top of the chart.
* **Data Points (k, Accuracy):**
* (1, ~0.620)
* (2, ~0.633)
* (3, ~0.640)
* (4, ~0.644)
* (5, ~0.647)
* (6, ~0.649)
* (7, ~0.650)
* (8, ~0.651)
* (9, ~0.651)
* (10, ~0.651)
* **Red Line (Circle Markers):**
* **Trend:** Shows steady, near-linear growth that begins to taper off after k=6. It consistently performs below the cyan line but above the blue line after k=3.
* **Data Points (k, Accuracy):**
* (1, ~0.620)
* (2, ~0.627)
* (3, ~0.636)
* (4, ~0.640)
* (5, ~0.643)
* (6, ~0.645)
* (7, ~0.646)
* (8, ~0.647)
* (9, ~0.648)
* (10, ~0.648)
* **Blue Line (Square Markers):**
* **Trend:** Exhibits a rapid initial increase from k=1 to k=3, after which it plateaus abruptly. From k=3 to k=10, its accuracy remains nearly constant, showing minimal improvement with additional samples.
* **Data Points (k, Accuracy):**
* (1, ~0.620)
* (2, ~0.634)
* (3, ~0.636)
* (4, ~0.637)
* (5, ~0.637)
* (6, ~0.6365)
* (7, ~0.6365)
* (8, ~0.6365)
* (9, ~0.637)
* (10, ~0.6375)
### Key Observations
1. **Common Starting Point:** All three models/series begin at approximately the same accuracy (~0.620) when the sample size is minimal (k=1).
2. **Divergent Growth Paths:** The performance diverges immediately after k=1. The cyan series shows the most efficient learning curve, the red series shows moderate learning, and the blue series learns quickly but then stops improving.
3. **Plateau Points:** The blue series plateaus very early (around k=3). The red series begins to plateau after k=6. The cyan series shows signs of plateauing after k=7 but continues a very slight upward trend.
4. **Final Performance Hierarchy:** At the maximum sample size shown (k=10), the clear performance order is: Cyan (~0.651) > Red (~0.648) > Blue (~0.6375). The gap between the best (cyan) and worst (blue) is approximately 0.0135 in accuracy.
### Interpretation
This chart likely compares the sample efficiency of three different machine learning models, algorithms, or training strategies. The data suggests:
* **The "Cyan" method** is the most robust and scalable. It continues to effectively leverage additional data across the entire range tested, making it the best choice if large datasets are available.
* **The "Red" method** is a solid, consistent performer that benefits from more data but with diminishing returns. It may represent a good balance between complexity and performance.
* **The "Blue" method** appears to have a low capacity or hits a fundamental performance ceiling very quickly. Adding more data beyond a small threshold (kβ3) provides negligible benefit. This could indicate a simpler model, one prone to early overfitting, or one that is not designed to scale with data volume.
The **Peircean investigative** reading reveals a story of **divergent potential**. Starting from an identical point of ignorance (low accuracy at k=1), the three entities reveal their inherent capabilities through their interaction with experience (increasing data). The cyan entity demonstrates a high capacity for growth and adaptation, the red entity shows steady but limited growth, and the blue entity reveals a fixed, limited nature. The chart is a visual testament to the principle that initial conditions do not determine final outcomes; the underlying architecture or strategy does.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart plotting "Accuracy" on the vertical y-axis against "Sample Size (k)" on the horizontal x-axis. It displays the performance trends of three distinct methods or models as the sample size increases from 1 to 10 (in thousands, denoted by 'k'). All three series begin at the same accuracy point for the smallest sample size but diverge as the sample size grows.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, with major tick marks and labels at integer values from 1 to 10.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear, with major tick marks and labels at intervals of 0.005, ranging from 0.635 to 0.665.
* **Data Series (Lines):** Three distinct lines are plotted, differentiated by color and marker shape. There is no explicit legend box within the chart area; identification is based on visual attributes.
1. **Cyan Line with Diamond Markers:** The top-performing series.
2. **Red (Maroon) Line with Circle Markers:** The middle-performing series.
3. **Blue Line with Square Markers:** The lowest-performing series after the initial point.
* **Grid:** A light gray grid is present, with vertical lines at each x-axis tick and horizontal lines at each y-axis tick.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **Cyan Line (Diamonds):** Shows a strong, consistent upward trend that begins to plateau slightly at higher sample sizes.
* (k=1, ~0.636) -> (k=2, ~0.647) -> (k=3, ~0.653) -> (k=4, ~0.658) -> (k=5, ~0.661) -> (k=6, ~0.6625) -> (k=7, ~0.6635) -> (k=8, ~0.664) -> (k=9, ~0.665) -> (k=10, ~0.6655)
* **Red Line (Circles):** Shows a steady upward trend that flattens significantly after k=7.
* (k=1, ~0.636) -> (k=2, ~0.642) -> (k=3, ~0.6475) -> (k=4, ~0.650) -> (k=5, ~0.6535) -> (k=6, ~0.654) -> (k=7, ~0.656) -> (k=8, ~0.656) -> (k=9, ~0.6575) -> (k=10, ~0.6575)
* **Blue Line (Squares):** Shows an initial upward trend that peaks around k=7 and then exhibits a slight downward trend.
* (k=1, ~0.636) -> (k=2, ~0.647) -> (k=3, ~0.650) -> (k=4, ~0.6515) -> (k=5, ~0.6525) -> (k=6, ~0.653) -> (k=7, ~0.6535) -> (k=8, ~0.6535) -> (k=9, ~0.653) -> (k=10, ~0.6525)
### Key Observations
1. **Common Origin:** All three methods start at the same accuracy (~0.636) when the sample size is smallest (k=1).
2. **Divergence with Scale:** Performance diverges immediately as sample size increases. The cyan method shows the most significant and sustained improvement.
3. **Performance Hierarchy:** A clear and consistent hierarchy is established from k=2 onward: Cyan > Red > Blue.
4. **Plateauing Effects:** All three curves show signs of diminishing returns. The cyan curve's growth slows after k=6. The red curve plateaus after k=7. The blue curve peaks at k=7 and then slightly regresses.
5. **Relative Gains:** At the largest sample size (k=10), the cyan method's accuracy is approximately 0.0295 points higher than the blue method's, representing a substantial relative improvement.
### Interpretation
This chart demonstrates the relationship between training/evaluation sample size and model accuracy for three different approaches. The data suggests that:
* **More Data Generally Helps:** All methods benefit from increased sample size, at least initially.
* **Method Scalability Differs Dramatically:** The primary insight is the stark difference in how effectively each method leverages additional data. The cyan method is highly scalable, converting larger datasets into significant accuracy gains. The red method scales moderately well but hits a performance ceiling. The blue method scales poorly, with its optimal performance achieved at a moderate sample size (k=7), after which additional data may introduce noise or lead to slight overfitting, causing a minor decline.
* **Practical Implication:** For applications where large datasets are available, the cyan method is the clear choice. If computational resources or data availability are limited to moderate sizes (k around 5-7), the performance gap between the methods is smaller, though the cyan method still leads. The blue method appears unsuitable for scenarios with very large datasets. The chart provides a visual rationale for selecting a model based on expected data scale.
</details>
(c) QwQ-32B
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size for Three Methods
### Overview
This line chart compares the performance (accuracy) of three different methods as a function of increasing sample size (`k`). The chart demonstrates how the accuracy of each method evolves from a sample size of 1 to 10.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has discrete integer markers from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has a linear scale ranging from 0.74 to 0.81, with major gridlines at intervals of 0.01.
* **Legend:** Located in the bottom-right corner of the plot area. It contains three entries:
1. `majority@k` - Represented by a dark red line with circular markers.
2. `short-1@k (Ours)` - Represented by a light blue line with square markers.
3. `short-3@k (Ours)` - Represented by a cyan line with diamond markers.
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
The chart plots three data series. Below is an analysis of each, including approximate values extracted from the chart. Values are estimated based on gridline positions.
**1. `majority@k` (Dark Red, Circles)**
* **Trend:** This line shows a strong, consistent upward trend. It starts as the lowest-performing method at k=1 but exhibits the steepest and most sustained growth, ultimately surpassing the other methods.
* **Data Points (Approximate):**
* k=1: 0.740
* k=2: 0.755
* k=3: 0.771
* k=4: 0.782
* k=5: 0.790
* k=6: 0.795
* k=7: 0.799
* k=8: 0.802
* k=9: 0.806
* k=10: 0.808
**2. `short-1@k (Ours)` (Light Blue, Squares)**
* **Trend:** This line shows initial growth that quickly plateaus. It rises from k=1 to k=4, then remains nearly flat with a very slight downward trend from k=7 to k=10.
* **Data Points (Approximate):**
* k=1: 0.740
* k=2: 0.762
* k=3: 0.769
* k=4: 0.772
* k=5: 0.773
* k=6: 0.774
* k=7: 0.774
* k=8: 0.774
* k=9: 0.773
* k=10: 0.773
**3. `short-3@k (Ours)` (Cyan, Diamonds)**
* **Trend:** This line shows strong initial growth, similar to `majority@k`, but its rate of improvement slows after k=5. It is overtaken by `majority@k` between k=6 and k=7.
* **Data Points (Approximate):**
* k=1: 0.740
* k=2: 0.762
* k=3: 0.780
* k=4: 0.789
* k=5: 0.793
* k=6: 0.795
* k=7: 0.796
* k=8: 0.797
* k=9: 0.798
* k=10: 0.799
### Key Observations
1. **Common Starting Point:** All three methods begin at the same accuracy point (~0.740) when the sample size (k) is 1.
2. **Crossover Point:** The `majority@k` line intersects and surpasses the `short-3@k` line between k=6 and k=7. Before this point, `short-3@k` is the top performer; after, `majority@k` is.
3. **Performance Plateau:** The `short-1@k` method shows the most pronounced plateau, with negligible accuracy gains after k=4.
4. **Final Ranking (k=10):** At the largest sample size shown, the order from highest to lowest accuracy is: `majority@k` (~0.808) > `short-3@k` (~0.799) > `short-1@k` (~0.773).
### Interpretation
The chart illustrates a classic trade-off between method complexity/parameterization and performance scaling.
* **`majority@k`** likely represents a simple, baseline ensemble method (e.g., majority voting). Its steady climb suggests it effectively leverages additional samples without saturation, making it the best choice when `k` is large (β₯7 in this range).
* **`short-1@k` and `short-3@k`** are labeled "(Ours)", indicating they are the proposed methods from the source document. The numbers 1 and 3 likely refer to a hyperparameter (e.g., sequence length, number of hints).
* `short-1@k` provides a quick boost over the baseline for very small `k` (2-4) but fails to scale, suggesting it may be a lightweight or constrained model that exhausts its useful information early.
* `short-3@k` offers a superior balance. It matches or beats the baseline for `k` up to 6 and maintains strong performance thereafter. This suggests that increasing the "short" parameter from 1 to 3 grants the model better scalability and sustained learning from more data, making it the most robust method across the entire range of sample sizes tested.
The data suggests that for applications where the sample size `k` is small and fixed (e.g., 3-6), the proposed `short-3@k` method is optimal. If `k` can be large and computational cost is not a concern, the simple `majority@k` baseline becomes competitive and eventually superior. The `short-1@k` method appears to have limited utility outside of a very narrow range of small `k`.
</details>
(d) R1-670B
Figure 6: GPQA-D - sample size ( $k$ ) comparison.
<details>
<summary>x22.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against the amount of "thinking compute" allocated, measured in thousands of thinking tokens. It compares the performance of three distinct methods or models, represented by three lines with different colors and markers. The chart demonstrates a clear positive correlation between increased thinking compute and accuracy, with diminishing returns at higher compute levels.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale from approximately 5 to 55 (thousands of tokens).
* **Major Tick Marks:** 10, 20, 30, 40, 50.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0.650 to 0.690.
* **Major Tick Marks:** 0.650, 0.655, 0.660, 0.665, 0.670, 0.675, 0.680, 0.685, 0.690.
* **Legend:** Positioned in the top-left corner of the plot area.
* **Cyan line with diamond markers:** Label not visible in the image.
* **Cyan line with square markers:** Label not visible in the image.
* **Red (brownish) line with circle markers:** Label not visible in the image.
* **Grid:** Light gray grid lines are present for both major x and y ticks.
### Detailed Analysis
**Data Series 1: Cyan line with diamond markers**
* **Trend:** Shows the steepest initial increase and reaches the highest overall accuracy plateau.
* **Data Points (Approximate):**
* (5, 0.651)
* (10, 0.673)
* (15, 0.675)
* (20, 0.683)
* (25, 0.688)
* (30, 0.690)
* (35, 0.691)
* (40, 0.691) - Plateau begins.
* **Observation:** Accuracy rises sharply from 5k to 25k tokens, then plateaus near 0.691 from approximately 30k tokens onward.
**Data Series 2: Cyan line with square markers**
* **Trend:** Increases steadily but at a lower rate than the diamond-marked cyan line, and plateaus at a lower accuracy.
* **Data Points (Approximate):**
* (5, 0.651)
* (10, 0.668)
* (15, 0.675)
* (20, 0.677)
* (25, 0.679)
* (30, 0.681)
* (35, 0.683) - Final data point.
* **Observation:** Follows a similar initial trajectory to the red line but diverges upward around 15k tokens. It ends at approximately 0.683 at 35k tokens.
**Data Series 3: Red (brownish) line with circle markers**
* **Trend:** Shows the most gradual increase, requiring significantly more compute to reach accuracy levels achieved earlier by the cyan lines.
* **Data Points (Approximate):**
* (5, 0.651)
* (15, 0.671)
* (20, 0.675)
* (25, 0.678)
* (30, 0.680)
* (35, 0.682)
* (40, 0.683)
* (45, 0.684)
* (50, 0.684) - Plateau begins.
* **Observation:** This line extends the furthest along the x-axis. It reaches an accuracy of ~0.684 at around 45k-50k tokens, a level the cyan diamond line surpassed at ~20k tokens.
### Key Observations
1. **Common Starting Point:** All three methods begin at the same accuracy point (~0.651) at the lowest compute level (5k tokens).
2. **Diverging Efficiency:** The methods diverge immediately. The cyan diamond method is the most compute-efficient, achieving high accuracy rapidly. The red method is the least compute-efficient for reaching high accuracy.
3. **Diminishing Returns:** All curves exhibit diminishing returns; the slope (accuracy gain per additional token) decreases as compute increases. The cyan diamond curve plateaus most distinctly.
4. **Performance Hierarchy:** At any given compute level above ~10k tokens, the performance order is consistent: Cyan Diamond > Cyan Square > Red. The gap between the top two narrows at very high compute for the red line.
### Interpretation
This chart illustrates a fundamental trade-off in AI model reasoning: the relationship between allocated "thinking" resources (compute tokens) and output quality (accuracy). The data suggests that:
1. **More Compute Helps, But Isn't Free:** Increasing the token budget for internal reasoning consistently improves accuracy across all tested methods, validating the core premise of "thinking" models.
2. **Algorithm/Architecture Matters Profoundly:** The stark difference between the three lines indicates that the underlying method for utilizing thinking tokens is a critical factor. The cyan diamond method represents a significantly more efficient reasoning algorithm, achieving superior performance with half the compute of the red method. This could be due to better planning, more effective use of context, or superior internal verification steps.
3. **The "Accuracy Ceiling" Concept:** Each method appears to approach an asymptotic maximum accuracy (a ceiling) for the given task. The cyan diamond method's ceiling (~0.691) is higher than the others, suggesting it may be fundamentally more capable, not just more efficient. The red method's ceiling (~0.684) is lower, implying that beyond a certain point, throwing more compute at this particular method yields no further benefit.
4. **Practical Implications:** For deployment, this analysis is crucial for cost-performance optimization. If latency or cost is tied to token count, the cyan diamond method is the clear choice. The red method would only be considered if it had other compensating advantages not shown here (e.g., lower base cost, better performance on a different metric). The chart provides the empirical data needed to make that engineering decision.
**Language:** All text in the image is in English.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image displays a line chart plotting model accuracy against computational effort, measured in thinking tokens. It compares the performance of three distinct methods or models, showing how accuracy scales with increased "thinking compute." All three lines demonstrate a positive correlation, but with markedly different growth rates and saturation points.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from approximately 5 to 60, with major tick marks at 10, 20, 30, 40, 50, and 60.
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.620 to 0.650, with major tick marks at 0.620, 0.625, 0.630, 0.635, 0.640, 0.645, and 0.650.
* **Legend:** Located in the top-right corner of the plot area. It identifies three data series:
1. **Cyan line with diamond markers:** (Label not visible in the provided crop, but the line is clearly identifiable).
2. **Red line with circle markers:** (Label not visible).
3. **Light blue line with square markers:** (Label not visible).
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
**Data Series Trends & Approximate Points:**
1. **Cyan Line (Diamond Markers):**
* **Trend:** Shows the steepest initial ascent and achieves the highest overall accuracy. It exhibits a classic logarithmic growth curve, rising sharply before beginning to plateau.
* **Key Data Points (Approximate):**
* At ~5k tokens: Accuracy β 0.620
* At ~10k tokens: Accuracy β 0.633
* At ~20k tokens: Accuracy β 0.644
* At ~30k tokens: Accuracy β 0.649
* At ~40k tokens: Accuracy β 0.651 (appears to plateau near this value).
2. **Red Line (Circle Markers):**
* **Trend:** Shows a steady, strong increase that is less steep than the cyan line initially but continues to grow significantly at higher compute levels. It does not appear to have fully plateaued by 60k tokens.
* **Key Data Points (Approximate):**
* At ~5k tokens: Accuracy β 0.620
* At ~10k tokens: Accuracy β 0.627
* At ~20k tokens: Accuracy β 0.636
* At ~30k tokens: Accuracy β 0.643
* At ~40k tokens: Accuracy β 0.646
* At ~60k tokens: Accuracy β 0.648
3. **Light Blue Line (Square Markers):**
* **Trend:** Rises quickly at very low compute but then hits a performance ceiling much earlier than the others. After approximately 15k tokens, its accuracy gains become negligible, forming a near-flat plateau.
* **Key Data Points (Approximate):**
* At ~5k tokens: Accuracy β 0.620
* At ~10k tokens: Accuracy β 0.636
* At ~15k tokens: Accuracy β 0.637 (reaches plateau).
* At ~35k tokens: Accuracy β 0.637 (plateau continues).
### Key Observations
1. **Common Starting Point:** All three methods begin at nearly the same accuracy (~0.620) when thinking compute is very low (~5k tokens).
2. **Divergent Scaling:** The primary difference lies in how efficiently each method converts additional compute into accuracy gains. The cyan method is the most efficient, followed by the red, with the light blue method being the least scalable.
3. **Plateau Levels:** The light blue line plateaus at a significantly lower accuracy (~0.637) compared to the cyan line (~0.651). The red line is still climbing at 60k tokens and may eventually reach or surpass the cyan line's plateau with even more compute.
4. **Crossover Point:** Around 15k-20k tokens, the cyan line definitively overtakes the light blue line, which has already begun to plateau.
### Interpretation
This chart illustrates the concept of **scaling efficiency** in AI model reasoning or "thinking" processes. The data suggests that:
* **More compute generally improves accuracy,** but with **diminishing returns.** The shape of the curves indicates that the initial tokens of "thinking" are the most valuable.
* The three lines likely represent **different algorithms, model architectures, or prompting strategies** for utilizing thinking tokens. The cyan method represents a highly optimized approach that effectively leverages additional compute for substantial gains.
* The light blue method's early plateau indicates a **fundamental limitation** in its designβit cannot effectively utilize extra thinking tokens beyond a certain point, making it inefficient for high-compute scenarios.
* The red method shows **strong, sustained scaling,** suggesting it may be a robust approach that continues to benefit from very large compute budgets, potentially making it the best choice if computational resources are abundant.
**In essence, the chart is a performance comparison showing that not all methods for using "thinking compute" are equal. The choice of method dramatically impacts the maximum achievable accuracy and the cost (in tokens) to get there.**
</details>
(b) R $1$ - $32$ B
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
This is a line chart plotting model accuracy against computational effort, measured in "thinking tokens." It compares the performance of three distinct models or methods, represented by three colored lines with different markers. The chart demonstrates how accuracy scales with increased compute for each approach.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from approximately 10 to 85, with major gridlines and numerical markers at 20, 40, 60, and 80.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.635 to 0.665, with major gridlines and numerical markers at 0.635, 0.640, 0.645, 0.650, 0.655, 0.660, and 0.665.
* **Data Series (Legend inferred from visual markers and colors):**
1. **Cyan line with diamond markers:** Positioned as the top-performing series.
2. **Cyan line with square markers:** Positioned as the middle-performing series initially, which later plateaus.
3. **Brown line with circle markers:** Positioned as the lowest-performing series initially, which shows steady growth.
### Detailed Analysis
**Series 1: Cyan Line with Diamond Markers**
* **Trend:** Shows the steepest and most consistent upward slope, indicating the strongest positive correlation between compute and accuracy. It achieves the highest overall accuracy.
* **Approximate Data Points:**
* (10, 0.636)
* (15, 0.647)
* (20, 0.650)
* (25, 0.653)
* (30, 0.658)
* (35, 0.661)
* (40, 0.663)
* (45, 0.664)
* (50, 0.665)
* (55, 0.666)
* (60, 0.667)
**Series 2: Cyan Line with Square Markers**
* **Trend:** Increases rapidly at low compute levels (10-25k tokens), then the rate of improvement slows significantly, forming a plateau between 30k and 55k tokens.
* **Approximate Data Points:**
* (10, 0.636)
* (15, 0.647)
* (20, 0.650)
* (25, 0.652)
* (30, 0.653)
* (35, 0.6535)
* (40, 0.6535)
* (45, 0.6535)
* (50, 0.653)
* (55, 0.6525)
**Series 3: Brown Line with Circle Markers**
* **Trend:** Shows a steady, approximately linear upward slope. It starts at the same point as the others but improves at a slower, more constant rate, never surpassing the cyan diamond line.
* **Approximate Data Points:**
* (10, 0.636)
* (25, 0.6475)
* (35, 0.650)
* (45, 0.6535)
* (50, 0.6545)
* (55, 0.656)
* (60, 0.656)
* (70, 0.6575)
* (80, 0.6575)
### Key Observations
1. **Common Origin:** All three models begin at the same accuracy point (~0.636) at the lowest compute level (~10k tokens).
2. **Diverging Paths:** Performance diverges immediately after the starting point. The cyan diamond model pulls ahead decisively.
3. **Plateau Effect:** The cyan square model exhibits a clear performance plateau, suggesting diminishing returns or a capacity limit beyond ~30k thinking tokens.
4. **Steady Growth:** The brown circle model demonstrates reliable, linear scaling without signs of plateauing within the observed range, but at a lower rate of return than the top model.
5. **Peak Performance:** The highest accuracy on the chart (~0.667) is achieved by the cyan diamond model at ~60k thinking tokens.
### Interpretation
This chart likely illustrates a benchmark comparing different AI reasoning or "thinking" architectures. The data suggests that the method represented by the **cyan diamond line** is the most efficient and effective, converting additional computational resources (thinking tokens) into accuracy gains more successfully than the alternatives.
The **cyan square line** may represent a model that hits a fundamental bottleneck or reaches its optimal performance threshold relatively early. Investing more compute beyond ~30k tokens yields negligible benefit for this approach.
The **brown circle line** represents a stable, predictable model that scales reliably but is less compute-efficient than the top performer. It might be a simpler or more constrained architecture.
The key takeaway is that not all "thinking" compute is equal. The architecture or method (indicated by the line/color) has a profound impact on how effectively additional computational resources can be leveraged to improve task accuracy. The chart argues for the superiority of the "cyan diamond" approach for maximizing performance when scaling compute.
</details>
(c) QwQ-32B
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Reasoning Methods
### Overview
The image is a line chart comparing the performance of three different computational reasoning methods. It plots model accuracy against the amount of "Thinking Compute" allocated, measured in thousands of thinking tokens. The chart demonstrates how accuracy scales with increased computational resources for each method.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from 0 to 120, with major tick marks at intervals of 20 (0, 20, 40, 60, 80, 100, 120).
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.74 to 0.81, with major tick marks at intervals of 0.01 (0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.80, 0.81).
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains three entries:
1. `majority@k` - Represented by a solid red line with circular markers.
2. `short-1@k (Ours)` - Represented by a solid blue line with square markers.
3. `short-3@k (Ours)` - Represented by a solid cyan (light blue) line with diamond markers.
* **Grid:** A light gray grid is present, aligning with the major tick marks on both axes.
### Detailed Analysis
**1. `majority@k` (Red Line, Circle Markers):**
* **Trend:** Shows a steady, near-linear upward trend across the entire range of compute. It has the steepest sustained slope.
* **Data Points (Approximate):**
* Starts at ~0.74 accuracy at 10k tokens.
* Crosses 0.77 accuracy at ~40k tokens.
* Crosses 0.79 accuracy at ~65k tokens.
* Crosses 0.80 accuracy at ~95k tokens.
* Ends at the highest point on the chart, ~0.808 accuracy at 120k tokens.
**2. `short-1@k (Ours)` (Blue Line, Square Markers):**
* **Trend:** Increases rapidly at low compute, then plateaus. The curve flattens significantly after ~50k tokens, showing diminishing returns.
* **Data Points (Approximate):**
* Starts at ~0.74 accuracy at 10k tokens.
* Rises sharply to ~0.762 at 20k tokens.
* Reaches ~0.771 at 40k tokens.
* Peaks and plateaus around 0.774 between 50k and 80k tokens.
* Shows a slight decline to ~0.773 at 90k tokens (the last data point for this series).
**3. `short-3@k (Ours)` (Cyan Line, Diamond Markers):**
* **Trend:** Exhibits the most rapid initial gain in accuracy, then also plateaus, but at a higher level than `short-1@k`. Its growth rate slows considerably after ~50k tokens.
* **Data Points (Approximate):**
* Starts at ~0.74 accuracy at 10k tokens.
* Rises very steeply to ~0.763 at 20k tokens.
* Reaches ~0.78 at 35k tokens.
* Crosses 0.79 accuracy at ~45k tokens.
* Plateaus near 0.799-0.80 between 80k and 100k tokens (the last data point for this series).
### Key Observations
1. **Initial Efficiency:** Both "Ours" methods (`short-1@k` and `short-3@k`) show a steeper initial slope than `majority@k`, indicating they achieve higher accuracy with very low compute budgets (below ~30k tokens).
2. **Crossover Point:** The `majority@k` line intersects and surpasses the `short-1@k` line at approximately 40k tokens. It intersects the `short-3@k` line at approximately 80k tokens.
3. **Plateau vs. Continuous Growth:** The two "Ours" methods plateau, suggesting a limit to the accuracy gains achievable by their specific approach with more compute. In contrast, `majority@k` continues to improve steadily, indicating its scaling behavior is different and potentially more robust at high compute levels.
4. **Performance Hierarchy:** At low compute (<40k tokens), the order is `short-3@k` > `short-1@k` β `majority@k`. At high compute (>80k tokens), the order becomes `majority@k` > `short-3@k` > `short-1@k`.
### Interpretation
This chart illustrates a classic trade-off in machine learning between **sample efficiency** and **scalability**.
* The methods labeled "(Ours)" (`short-1@k` and `short-3@k`) are highly **sample/compute-efficient**. They extract maximum accuracy from a limited thinking budget, making them ideal for applications where computational cost or latency is a primary constraint. `short-3@k` is clearly the more effective of the two efficient methods.
* The `majority@k` method represents a **scalable** approach. While less efficient initially, its performance continues to improve predictably with more resources. This suggests it may be a more reliable or powerful technique when computational constraints are relaxed, and maximum accuracy is the goal, regardless of cost.
* The plateauing of the "Ours" methods could indicate a fundamental limitation in their architecture or strategyβthey may be "thinking" in a way that quickly hits a ceiling of effectiveness. The continuous rise of `majority@k` suggests its "thinking" process (likely involving majority voting over multiple reasoning paths) benefits more consistently from additional computation.
* **Practical Implication:** The choice between these methods depends on the deployment context. For a real-time chatbot, an efficient method like `short-3@k` is preferable. For a offline, high-stakes analysis where accuracy is paramount, `majority@k` would be the better choice given sufficient compute resources.
</details>
(d) R1-670B
Figure 7: GPQA-D - thinking compute comparison.
<details>
<summary>x26.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different k-Values
### Overview
The image is a scatter plot comparing model **Accuracy** (y-axis) against **Time-to-Answer** (x-axis) for different experimental conditions, labeled by a parameter `k`. The data is grouped into three distinct series differentiated by marker shape and color, with an additional single outlier point. There is no separate legend; labels are placed directly next to each data point.
### Components/Axes
* **Y-Axis:** Labeled **"Accuracy"**. Scale ranges from **0.650** to **0.690**, with major grid lines at intervals of 0.005.
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. Scale ranges from **3** to **8**, with major grid lines at integer intervals.
* **Data Series & Labels:** Data points are identified by shape, color, and an adjacent text label indicating the `k` value.
* **Blue Squares:** Located in the left region of the chart (Time-to-Answer ~3-4).
* **Cyan Diamonds:** Located in the central region (Time-to-Answer ~5-7).
* **Red Circles:** Located in the right region (Time-to-Answer ~7-8).
* **Cyan Star:** A single point located at the bottom-center.
### Detailed Analysis
**Data Point Extraction (Approximate Coordinates):**
| Marker Shape | Color | Label | Approx. Time-to-Answer (x) | Approx. Accuracy (y) |
| :--- | :--- | :--- | :--- | :--- |
| Square | Blue | k=9 | 3.2 | 0.682 |
| Square | Blue | k=5 | 3.6 | 0.677 |
| Square | Blue | k=3 | 4.0 | 0.673 |
| Diamond | Cyan | k=9 | 5.3 | 0.691 |
| Diamond | Cyan | k=5 | 5.5 | 0.688 |
| Star | Cyan | k=1 | 5.4 | 0.651 |
| Diamond | Cyan | k=3 | 6.8 | 0.675 |
| Circle | Red | k=3 | 6.8 | 0.671 |
| Circle | Red | k=5 | 7.3 | 0.678 |
| Circle | Red | k=9 | 7.9 | 0.683 |
**Trend Verification by Series:**
* **Blue Squares (Left Group):** The line formed by these points slopes **downward**. As Time-to-Answer increases from ~3.2 to 4.0, Accuracy decreases from ~0.682 to 0.673.
* **Cyan Diamonds (Middle Group):** The relationship is non-linear. The highest Accuracy points (k=9, k=5) are clustered at a moderate Time-to-Answer (~5.3-5.5). The k=3 point has a longer Time-to-Answer (~6.8) and lower Accuracy (~0.675).
* **Red Circles (Right Group):** The line formed by these points slopes **upward**. As Time-to-Answer increases from ~6.8 to 7.9, Accuracy increases from ~0.671 to 0.683.
* **Cyan Star (k=1):** This is a standalone point with the lowest Accuracy (~0.651) at a moderate Time-to-Answer (~5.4).
### Key Observations
1. **Performance Clusters:** The data forms three distinct clusters based on Time-to-Answer, each associated with a different marker/color.
2. **k-Value Impact:** Within each cluster (except the single k=1 point), a higher `k` value generally correlates with higher Accuracy. However, the relationship with Time-to-Answer varies by cluster.
3. **Optimal Region:** The **Cyan Diamond** series achieves the highest overall Accuracy values (peaking at ~0.691 for k=9) within a middle range of Time-to-Answer (5-6).
4. **Outlier:** The **k=1** point (Cyan Star) is a significant outlier, showing drastically lower Accuracy than all other points, despite having a Time-to-Answer similar to the high-performing Cyan Diamond k=5 and k=9 points.
5. **Trade-off Variability:** The trade-off between speed (Time-to-Answer) and performance (Accuracy) is not consistent. The Blue series shows a negative trade-off (slower = less accurate), the Red series shows a positive trade-off (slower = more accurate), and the Cyan series shows a complex, non-monotonic relationship.
### Interpretation
This chart likely visualizes the performance of different algorithms or model configurations (represented by the three color/shape groups) across a spectrum of computational budgets (Time-to-Answer). The parameter `k` is a key hyperparameter within each configuration.
The data suggests that:
* The **Cyan Diamond** configuration is the most effective for achieving high accuracy, but it requires a moderate, non-minimal thinking time. Simply increasing `k` within this configuration yields diminishing returns or even a slight decrease in accuracy at the highest `k=9` point compared to `k=5`.
* The **Blue Square** configuration is the fastest but sacrifices accuracy, and its performance degrades as `k` increases, suggesting `k` may have a detrimental effect in this regime.
* The **Red Circle** configuration is the slowest but shows a clear benefit from increasing `k`, indicating it may be a more complex model that scales better with additional computation.
* The **k=1** point serves as a critical baseline, demonstrating that a minimal `k` value is insufficient for achieving competitive accuracy in the Cyan configuration, regardless of thinking time.
In essence, the chart reveals that the optimal choice of model configuration and `k` value depends heavily on the available computational budget (Time-to-Answer) and the desired accuracy threshold. There is no single "best" point; the selection involves a strategic trade-off.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x27.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing model accuracy against time-to-answer (measured in thousands of thinking steps) for different values of a parameter 'k'. The plot displays three distinct series of data points, differentiated by marker shape and color, each representing a different method or condition. The data suggests a trade-off between computational time and accuracy, with performance varying significantly across the different 'k' settings and series.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 3.5 to 9.5, with major grid lines at 4, 6, and 8.
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.620 to 0.650, with major grid lines at intervals of 0.005 (0.620, 0.625, 0.630, 0.635, 0.640, 0.645, 0.650).
* **Data Series & Legend (Inferred from markers):**
* **Cyan Diamonds:** A series where accuracy increases with 'k'.
* **Cyan Squares:** A series clustered at lower time and moderate accuracy.
* **Red Circles:** A series located at higher time and accuracy values.
* **Cyan Star:** A single data point for k=1.
* **Data Point Labels:** Each marker is annotated with its corresponding 'k' value (e.g., "k=9", "k=5", "k=3", "k=1").
### Detailed Analysis
**Data Points (Approximate Coordinates):**
* **Cyan Diamond Series:**
* k=9: (x β 5.0, y β 0.650) - Highest accuracy point on the plot.
* k=5: (x β 6.0, y β 0.647)
* k=3: (x β 7.5, y β 0.640)
* *Trend:* This series shows a clear negative correlation between Time-to-Answer and Accuracy. As 'k' increases, time decreases and accuracy increases.
* **Cyan Square Series:**
* k=9: (x β 3.8, y β 0.637)
* k=5: (x β 4.2, y β 0.636)
* k=3: (x β 4.5, y β 0.636)
* *Trend:* This series is tightly clustered. Accuracy remains relatively flat (~0.636-0.637) while Time-to-Answer increases slightly with decreasing 'k'.
* **Red Circle Series:**
* k=9: (x β 9.2, y β 0.647)
* k=5: (x β 8.5, y β 0.643)
* k=3: (x β 7.8, y β 0.636)
* *Trend:* This series shows a positive correlation. As 'k' increases, both Time-to-Answer and Accuracy increase.
* **Single Point:**
* Cyan Star, k=1: (x β 6.0, y β 0.620) - The lowest accuracy point on the plot.
### Key Observations
1. **Performance Clusters:** The three series occupy distinct regions of the plot. Cyan squares are left (fastest), cyan diamonds are central, and red circles are right (slowest).
2. **k=1 Baseline:** The k=1 point (cyan star) serves as a low-accuracy baseline, positioned at a moderate time cost.
3. **Accuracy Ceiling:** The highest achieved accuracy is approximately 0.650 (cyan diamond, k=9).
4. **Time-Accuracy Trade-off:** The relationship between time and accuracy is not uniform. For the cyan diamond series, higher accuracy comes with *lower* time cost as 'k' increases. For the red circle series, higher accuracy requires *higher* time cost as 'k' increases.
5. **Convergence at k=3:** The k=3 data points for the cyan square and red circle series have nearly identical accuracy (~0.636), but the red circle point requires about 73% more time (xβ7.8 vs xβ4.5).
### Interpretation
This chart likely compares different strategies (represented by marker shapes/colors) for a reasoning or search task where 'k' is a key hyperparameter (e.g., number of candidates, beam width, or reasoning steps).
* The **Cyan Diamond** strategy appears highly efficient: increasing 'k' improves accuracy *and* reduces computation time, suggesting it better focuses its effort or prunes ineffective paths early.
* The **Cyan Square** strategy is fast but hits an accuracy plateau quickly. Varying 'k' has minimal impact, indicating it may be a simpler or more constrained method.
* The **Red Circle** strategy is computationally expensive. Its positive correlation suggests a brute-force or expansive approach where investing more time (higher 'k') directly yields better results, but with diminishing returns in efficiency.
* The **k=1** point represents a minimal-effort baseline, confirming that some level of 'thinking' (k>1) is crucial for acceptable performance.
The data demonstrates that algorithmic choice (series) is more impactful than simply tuning 'k'. The optimal strategy (cyan diamonds) achieves top accuracy at moderate time cost, while other methods either sacrifice accuracy for speed (squares) or incur high time costs for similar gains (circles). The investigation would benefit from knowing what the different series represent to explain these divergent behaviors.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x28.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing the performance of different configurations, labeled by a parameter 'k', across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). The plot displays nine distinct data points, each represented by a specific marker shape and color, and labeled with its corresponding 'k' value. The data suggests a comparison of different algorithms, models, or settings where 'k' is a key variable.
### Components/Axes
* **X-Axis:**
* **Label:** "Time-to-Answer (longest thinking in thousands)"
* **Scale:** Linear, ranging from approximately 6 to 12. Major tick marks are at 6, 8, 10, and 12.
* **Interpretation:** Represents the computational time or latency, measured in thousands of units (e.g., tokens, steps, or milliseconds). Higher values indicate slower performance.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from approximately 0.635 to 0.665. Major tick marks are at 0.635, 0.640, 0.645, 0.650, 0.655, 0.660, and 0.665.
* **Interpretation:** Represents a performance metric, likely a proportion or score (e.g., 66.5% accuracy). Higher values indicate better performance.
* **Data Series & Legend (Inferred from Markers):**
* There is no explicit legend box. The marker shapes and colors differentiate data series, and each point is directly labeled with its 'k' value.
* **Cyan Squares:** Three points labeled k=9, k=5, k=3.
* **Cyan Diamonds:** Three points labeled k=9, k=5, k=3.
* **Cyan Star:** One point labeled k=1.
* **Red Circles:** Three points labeled k=3, k=5, k=9.
* **Spatial Grounding:** The cyan markers are clustered on the left side of the plot (Time-to-Answer ~6-8.5). The red circles are clustered on the right side (Time-to-Answer ~10.5-12). The single cyan star is an outlier at the bottom-center.
### Detailed Analysis
**Data Point Extraction (Approximate Coordinates):**
The following table lists each data point by its inferred series (marker), label, and approximate (x, y) coordinates.
| Marker & Color | Label | Approx. Time-to-Answer (x) | Approx. Accuracy (y) | Position in Plot |
| :--- | :--- | :--- | :--- | :--- |
| Cyan Square | k=9 | 6.0 | 0.653 | Leftmost, mid-height |
| Cyan Square | k=5 | 6.5 | 0.652 | Left, slightly lower |
| Cyan Square | k=3 | 7.0 | 0.650 | Left, lower |
| Cyan Diamond | k=9 | 7.5 | 0.665 | **Highest accuracy point** |
| Cyan Diamond | k=5 | 8.5 | 0.661 | Upper-center |
| Cyan Star | k=1 | 8.5 | 0.636 | **Lowest accuracy point**, bottom-center |
| Cyan Diamond | k=3 | 10.5 | 0.653 | Center-right, mid-height |
| Red Circle | k=3 | 10.5 | 0.647 | Center-right, lower |
| Red Circle | k=5 | 11.0 | 0.654 | Right, mid-height |
| Red Circle | k=9 | 12.0 | 0.657 | **Rightmost point** |
**Trend Verification:**
* **Cyan Squares (k=9,5,3):** This series shows a **downward trend**. As Time-to-Answer increases from ~6 to 7, Accuracy decreases from ~0.653 to ~0.650.
* **Cyan Diamonds (k=9,5,3):** This series also shows a **downward trend**. As Time-to-Answer increases from ~7.5 to 10.5, Accuracy decreases from ~0.665 to ~0.653.
* **Red Circles (k=3,5,9):** This series shows a clear **upward trend**. As Time-to-Answer increases from ~10.5 to 12, Accuracy increases from ~0.647 to ~0.657.
* **Cyan Star (k=1):** This is a single, isolated point with the lowest accuracy and a moderate Time-to-Answer.
### Key Observations
1. **Performance Clusters:** The data forms two distinct clusters based on Time-to-Answer. The "Cyan" group (squares, diamonds, star) operates in a faster regime (6-10.5), while the "Red Circle" group operates in a slower regime (10.5-12).
2. **Accuracy vs. Speed Trade-off:** Within the Cyan group, higher 'k' values (e.g., k=9) tend to achieve higher accuracy but at a slightly increased time cost compared to lower 'k' values in the same series. However, the overall trend for the Cyan group is that increased time correlates with decreased accuracy.
3. **Opposing Trends:** The most striking observation is the **opposite correlation** between the two main clusters. For the Cyan group, more time is associated with *lower* accuracy. For the Red Circle group, more time is associated with *higher* accuracy.
4. **Outlier:** The k=1 point (Cyan Star) is a significant outlier, having the worst accuracy despite a mid-range time cost, suggesting this configuration is highly inefficient.
5. **Peak Performance:** The highest accuracy (~0.665) is achieved by the Cyan Diamond with k=9 at a Time-to-Answer of ~7.5. The fastest time (~6) is achieved by the Cyan Square with k=9, with respectable accuracy (~0.653).
### Interpretation
This chart likely visualizes the performance of different **inference strategies or model configurations** (denoted by 'k') for an AI system, plotting their effectiveness (Accuracy) against their computational cost (Time-to-Answer).
* **What the data suggests:** The opposing trends between the Cyan and Red groups imply they represent fundamentally different approaches. The **Cyan group** might be a family of **efficient, approximate methods** (e.g., different levels of pruning, quantization, or a retrieval-based system where 'k' is the number of retrieved documents). Here, spending more time (higher 'k') within this efficient paradigm yields diminishing or negative returns on accuracy. The **Red Circle group** could represent a **more exhaustive, precise method** (e.g., a larger model, a different algorithm, or a chain-of-thought process). For this method, investing more time (which correlates with higher 'k', perhaps meaning more reasoning steps) directly translates to better accuracy.
* **How elements relate:** The parameter 'k' is not a universal dial. Its effect on the accuracy-time trade-off is entirely dependent on which method (Cyan vs. Red) it is applied to. This highlights that optimization is context-specific.
* **Notable Implications:** The chart presents a clear decision matrix. If the priority is **maximum accuracy** regardless of speed, the Red Circle method with k=9 is a strong candidate. If the priority is **high accuracy under strict time constraints**, the Cyan Diamond with k=9 is optimal. The k=1 configuration should be avoided. The data argues against a one-size-fits-all approach to tuning 'k' and underscores the importance of selecting the right underlying strategy for the given operational constraints (speed vs. precision).
</details>
(c) QwQ-32B
<details>
<summary>x29.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Voting Methods
### Overview
The image is a scatter plot comparing the performance of three different methods or models across two metrics: Accuracy (y-axis) and Time-to-Answer (x-axis). The plot visualizes a trade-off between computational cost (time) and performance (accuracy) for different values of a parameter `k`. The data points are grouped into three distinct series, each represented by a unique marker shape and color.
### Components/Axes
* **Chart Type:** Scatter Plot
* **Y-Axis:**
* **Label:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.74 to 0.81.
* **Major Ticks:** 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.80.
* **X-Axis:**
* **Label:** `Time-to-Answer (longest thinking in thousands)`
* **Scale:** Linear, ranging from approximately 9 to 17.
* **Major Ticks:** 10, 12, 14, 16.
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Red Circle:** `majority@k`
* **Blue Square:** `short-1@k (Ours)`
* **Cyan Diamond:** `short-3@k (Ours)`
* **Data Point Annotations:** Each data point is labeled with its corresponding `k` value (e.g., `k=9`, `k=5`, `k=3`, `k=1`).
### Detailed Analysis
The plot contains nine distinct data points, three for each method series. The analysis is segmented by series for clarity.
**1. Series: `majority@k` (Red Circles)**
* **Trend:** This series shows a clear positive correlation. As the Time-to-Answer increases, the Accuracy also increases. The points form a roughly linear upward trend from bottom-left to top-right.
* **Data Points (Approximate):**
* `k=3`: Accuracy β 0.771, Time-to-Answer β 14.8
* `k=5`: Accuracy β 0.790, Time-to-Answer β 15.8
* `k=9`: Accuracy β 0.805, Time-to-Answer β 16.5
**2. Series: `short-1@k (Ours)` (Blue Squares)**
* **Trend:** This series shows a slight negative or flat trend. Accuracy decreases marginally as Time-to-Answer increases. The points are clustered in the lower-left region of the plot, indicating lower time cost but also lower accuracy compared to other series.
* **Data Points (Approximate):**
* `k=3`: Accuracy β 0.769, Time-to-Answer β 10.2
* `k=5`: Accuracy β 0.773, Time-to-Answer β 9.8
* `k=9`: Accuracy β 0.774, Time-to-Answer β 9.5
**3. Series: `short-3@k (Ours)` (Cyan Diamonds)**
* **Trend:** This series shows a mixed trend. Accuracy initially increases from `k=1` to `k=9`, but the Time-to-Answer also increases. The point for `k=1` is an outlier in terms of both low accuracy and low time.
* **Data Points (Approximate):**
* `k=1`: Accuracy β 0.741, Time-to-Answer β 12.5
* `k=3`: Accuracy β 0.781, Time-to-Answer β 14.8
* `k=5`: Accuracy β 0.793, Time-to-Answer β 12.3
* `k=9`: Accuracy β 0.798, Time-to-Answer β 10.8
### Key Observations
1. **Performance Hierarchy:** For a given `k` value (e.g., `k=9`), the `majority@k` method achieves the highest accuracy but requires the most time. The `short-3@k` method offers a middle ground, and the `short-1@k` method is the fastest but least accurate.
2. **Efficiency Frontier:** The `short-3@k` series, particularly at `k=9` and `k=5`, appears to form an efficiency frontier. These points offer a better accuracy-to-time ratio than the `majority@k` series, achieving near-top accuracy with significantly less time.
3. **Impact of Parameter `k`:** For the `majority@k` and `short-3@k` methods, increasing `k` generally improves accuracy. For the `short-1@k` method, increasing `k` has a negligible positive effect on accuracy while slightly reducing time.
4. **Outlier:** The `short-3@k` point for `k=1` is a clear outlier, sitting far below the other points in accuracy, suggesting that a very low `k` value is detrimental for this method.
### Interpretation
This chart demonstrates a classic trade-off in computational systems: **accuracy versus latency**. The `majority@k` method represents a "brute-force" or high-reliability approach, where investing more computational time (higher Time-to-Answer) yields better results. The proposed methods, `short-1@k` and `short-3@k`, are optimizations designed to reduce this time cost.
The data suggests that the `short-3@k` method is particularly effective. It manages to achieve accuracy levels close to the `majority@k` method (e.g., `short-3@k=9` at ~0.798 vs. `majority@k=5` at ~0.790) while using less than two-thirds of the time (~10.8 vs. ~15.8). This indicates a more efficient algorithm or model architecture.
The relationship between `k` and performance is not uniform across methods. For the "short" methods, the benefit of increasing `k` diminishes or behaves non-linearly, implying they may be leveraging a different underlying mechanism than simple majority voting. The chart effectively argues that the authors' methods (`Ours`) provide a superior balance, enabling high-accuracy results in a more time-constrained setting.
</details>
(d) R1-670B
Figure 8: GPQA-D - time-to-answer comparison.
Table 5: Average thinking tokens for correct (C), incorrect (IC) and all (A) answers, per split by difficulty for GPQA-D. The numbers are in thousands of tokens.
| Model | Easy | Medium | Hard |
| --- | --- | --- | --- |
| C/IC/A | C/IC/A | C/IC/A | |
| LN-Super-49B | $2.5/\text{--}/2.5$ | $\phantom{0}6.2/\phantom{0}7.8/\phantom{0}6.6$ | $\phantom{0}7.1/\phantom{0}6.9/\phantom{0}7.0$ |
| R1-32B | $3.4/\text{--}/3.4$ | $\phantom{0}6.4/\phantom{0}7.9/\phantom{0}6.8$ | $\phantom{0}8.3/\phantom{0}7.8/\phantom{0}7.9$ |
| QwQ-32B | $5.3/\text{--}/5.3$ | $\phantom{0}8.9/13.0/\phantom{0}9.7$ | $11.1/10.6/10.6$ |
| R1-670B | $8.1/\text{--}/8.1$ | $10.9/16.0/11.4$ | $17.9/17.9/17.9$ |
Table 6: Average number of backtracks, and their average length for correct (C), incorrect (IC) and all (A) answers in GPQA-D.
| Model | # Backtracks | Backtrack Len. |
| --- | --- | --- |
| C/IC/A | C/IC/A | |
| LN-Super-49B | $\phantom{0}89/107/\phantom{0}94$ | $72/56/63$ |
| R1-32B | $\phantom{0}92/173/120$ | $78/48/60$ |
| QwQ-32B | $152/241/178$ | $52/41/46$ |
| R1-670B | $122/237/156$ | $83/60/69$ |
## Appendix B Per benchmark results
We present the per-benchmark results for each of the criteria presented in Section Λ 4.2. The sample-size ( $k$ ) results are presented in Figures Λ 9, 10 and 11. The thinking compute comparison results are presented in Figures Λ 12, 13 and 14. The time-to-answer results per benchamrk are presented in Figures Λ 15, 16 and 17.
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image displays a line chart plotting "Accuracy" on the y-axis against "Sample Size (k)" on the x-axis. It compares the performance of four distinct methods or models, represented by four lines with different colors and markers. All lines show a positive trend, with accuracy increasing as the sample size grows, but at different rates and reaching different final values.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, from 1 to 10.
* **Markers:** Integers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from approximately 0.57 to 0.84.
* **Major Gridlines:** At 0.60, 0.65, 0.70, 0.75, 0.80.
* **Legend:** Located in the top-left quadrant of the plot area, overlapping the gridlines. It contains four entries, each with a colored line segment and a marker symbol.
* **Data Series (from top to bottom at k=10):**
1. **Black, dotted line with upward-pointing triangle markers (β²).**
2. **Cyan (light blue) solid line with diamond markers (β).**
3. **Blue solid line with square markers (β ).**
4. **Red solid line with circle markers (β).**
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **Black Dotted Line (β²):**
* **Trend:** Starts as the lowest line at k=1, increases very steeply until k=4, then continues to increase at a decreasing rate, becoming the highest line from k=3 onward.
* **Data Points:**
* k=1: ~0.57
* k=2: ~0.67
* k=3: ~0.71
* k=4: ~0.74
* k=5: ~0.76
* k=6: ~0.78
* k=7: ~0.80
* k=8: ~0.81
* k=9: ~0.82
* k=10: ~0.83
* **Cyan Line (β):**
* **Trend:** Starts tied for lowest at k=1, increases steadily, maintaining the second-highest position from k=3 onward.
* **Data Points:**
* k=1: ~0.57
* k=2: ~0.64
* k=3: ~0.68
* k=4: ~0.71
* k=5: ~0.73
* k=6: ~0.74
* k=7: ~0.75
* k=8: ~0.76
* k=9: ~0.77
* k=10: ~0.775
* **Blue Line (β ):**
* **Trend:** Starts tied for lowest at k=1, increases steadily but slightly below the cyan line, maintaining the third-highest position from k=3 onward.
* **Data Points:**
* k=1: ~0.57
* k=2: ~0.64
* k=3: ~0.675
* k=4: ~0.70
* k=5: ~0.715
* k=6: ~0.725
* k=7: ~0.735
* k=8: ~0.74
* k=9: ~0.745
* k=10: ~0.75
* **Red Line (β):**
* **Trend:** Starts tied for lowest at k=1, increases at the slowest rate, remaining the lowest line throughout.
* **Data Points:**
* k=1: ~0.57
* k=2: ~0.60
* k=3: ~0.62
* k=4: ~0.645
* k=5: ~0.66
* k=6: ~0.675
* k=7: ~0.69
* k=8: ~0.70
* k=9: ~0.705
* k=10: ~0.71
### Key Observations
1. **Universal Improvement:** All four methods show improved accuracy with increased sample size.
2. **Performance Hierarchy:** A clear and consistent performance ranking is established by k=3 and holds through k=10: Black (β²) > Cyan (β) > Blue (β ) > Red (β).
3. **Convergence at Start:** All methods begin at approximately the same accuracy (~0.57) when the sample size is minimal (k=1).
4. **Diminishing Returns:** The rate of accuracy improvement slows for all lines as k increases, but the effect is most pronounced for the top-performing (Black) line.
5. **Significant Spread:** By k=10, there is a substantial performance gap of approximately 0.12 accuracy units between the best (Black, ~0.83) and worst (Red, ~0.71) methods.
### Interpretation
The chart demonstrates the relationship between data availability (sample size) and model performance (accuracy) for four different approaches. The key insight is that while more data benefits all methods, the **Black (β²) method scales significantly better** than the others. It transforms from the worst performer at k=1 to the best by a wide margin at k=10, suggesting it is a more sophisticated or data-efficient algorithm that can better leverage additional samples.
The **Cyan (β) and Blue (β ) methods** show similar, moderate scaling behavior, with Cyan maintaining a slight but consistent edge. The **Red (β) method** exhibits the poorest scaling, indicating it may be a simpler baseline or a method that hits a performance ceiling more quickly.
The convergence at k=1 suggests that with extremely limited data, the choice of method may be less critical. However, as data becomes more available (k > 2), the selection of the appropriate method (Black in this case) becomes crucial for maximizing performance. The chart argues for the superiority of the Black method in data-rich scenarios within the tested range.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x31.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size for Multiple Methods
### Overview
The image is a line chart comparing the performance (Accuracy) of four different methods as a function of increasing Sample Size (k). The chart demonstrates how accuracy improves for each method as more samples are used, with one method consistently outperforming the others.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:** Labeled "Sample Size (k)". It has discrete integer markers from 1 to 10.
* **Y-Axis:** Labeled "Accuracy". It has numerical markers from 0.72 to 0.86, in increments of 0.02.
* **Legend:** Located in the top-left corner of the plot area. It contains four entries:
1. **Proposed Method:** Black, dotted line with upward-pointing triangle markers.
2. **Baseline A:** Cyan (light blue), solid line with diamond markers.
3. **Baseline B:** Red, solid line with circle markers.
4. **Baseline C:** Blue, solid line with square markers.
* **Grid:** A light gray grid is present, aligned with the major ticks on both axes.
### Detailed Analysis
The following data points are approximate visual estimates from the chart.
**Trend Verification & Data Points:**
1. **Proposed Method (Black, Dotted, Triangles):**
* **Trend:** Shows a strong, steady logarithmic-like increase. It starts lowest at k=1 but quickly becomes the highest-performing method and maintains the largest lead.
* **Approximate Values:**
* k=1: ~0.72
* k=2: ~0.80
* k=3: ~0.825
* k=4: ~0.838
* k=5: ~0.845
* k=6: ~0.850
* k=7: ~0.855
* k=8: ~0.858
* k=9: ~0.862
* k=10: ~0.865
2. **Baseline A (Cyan, Solid, Diamonds):**
* **Trend:** Shows a rapid initial increase that begins to plateau after k=4. It is the second-best performing method for all k > 1.
* **Approximate Values:**
* k=1: ~0.71
* k=2: ~0.76
* k=3: ~0.80
* k=4: ~0.812
* k=5: ~0.820
* k=6: ~0.826
* k=7: ~0.830
* k=8: ~0.835
* k=9: ~0.838
* k=10: ~0.840
3. **Baseline B (Red, Solid, Circles):**
* **Trend:** Shows a steady, near-linear increase. It starts tied for last place but surpasses Baseline C after k=2.
* **Approximate Values:**
* k=1: ~0.71
* k=2: ~0.74
* k=3: ~0.775
* k=4: ~0.795
* k=5: ~0.808
* k=6: ~0.816
* k=7: ~0.822
* k=8: ~0.825
* k=9: ~0.828
* k=10: ~0.830
4. **Baseline C (Blue, Solid, Squares):**
* **Trend:** Shows an initial increase that plateaus most significantly after k=3. It is the lowest-performing method for k > 2.
* **Approximate Values:**
* k=1: ~0.71
* k=2: ~0.76
* k=3: ~0.775
* k=4: ~0.784
* k=5: ~0.788
* k=6: ~0.792
* k=7: ~0.794
* k=8: ~0.796
* k=9: ~0.798
* k=10: ~0.800
### Key Observations
* **Performance Hierarchy:** A clear and consistent performance hierarchy is established by k=3 and maintained thereafter: Proposed Method > Baseline A > Baseline B > Baseline C.
* **Convergence at Start:** All four methods begin at nearly the same accuracy point (~0.71-0.72) when Sample Size (k) is 1.
* **Diminishing Returns:** All methods show diminishing returns (the slope of the line decreases) as k increases, but the degree varies. Baseline C shows the most severe plateauing.
* **Widening Gap:** The performance gap between the Proposed Method and the best baseline (Baseline A) widens as k increases, from ~0.04 at k=3 to ~0.025 at k=10.
### Interpretation
This chart presents a classic comparative analysis in machine learning or statistical modeling, evaluating how different algorithms or techniques scale with increased data (sample size).
* **What the data suggests:** The "Proposed Method" is demonstrably superior, not only in final accuracy but also in its learning efficiency. It extracts more performance gain per additional sample, especially in the early stages (k=1 to k=3). The baselines, particularly Baseline C, appear to hit a performance ceiling much earlier.
* **Relationship between elements:** The x-axis (Sample Size) is the independent variable, representing resource investment (data collection). The y-axis (Accuracy) is the dependent variable, representing the return on that investment. The lines model the "learning curve" for each method.
* **Notable Anomalies/Patterns:** The most striking pattern is the immediate and sustained divergence of the Proposed Method from the pack after just one additional sample (k=2). This suggests a fundamental architectural or algorithmic advantage. The near-identical starting point for all methods is also notable, indicating they share a common baseline performance with minimal data, but their capacity to leverage additional data differs radically.
* **Practical Implication:** For applications where data is scarce (low k), the choice of method may be less critical. However, as more data becomes available, selecting the Proposed Method becomes increasingly important to maximize performance. The chart argues strongly for the adoption of the Proposed Method in data-rich environments.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x32.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance (Accuracy) of four different methods or models as a function of increasing Sample Size (k). All four series begin at the same accuracy point for k=1 and show improvement as k increases, but they diverge significantly in their rate of improvement and final accuracy.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has discrete integer markers from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has numerical markers from 0.80 to 0.90, in increments of 0.02.
* **Legend:** No explicit legend is present within the chart area. The four data series are distinguished solely by line color, line style, and marker shape.
* **Grid:** A light gray grid is present, with vertical lines at each integer k value and horizontal lines at each 0.02 accuracy increment.
### Detailed Analysis
The chart contains four distinct data series. Their visual trends and approximate data points are as follows:
1. **Black Dotted Line with Upward-Pointing Triangle Markers:**
* **Trend:** This line shows the steepest and most consistent upward slope, indicating the highest rate of improvement with sample size. It is the top-performing series for all k > 1.
* **Data Points (Approximate):**
* k=1: 0.80
* k=2: ~0.845
* k=3: ~0.860
* k=4: ~0.872
* k=5: ~0.880
* k=6: ~0.886
* k=7: ~0.892
* k=8: ~0.898
* k=9: ~0.903
* k=10: ~0.908
2. **Cyan Solid Line with Diamond Markers:**
* **Trend:** This line shows a strong, steady upward slope, second only to the black line. It maintains a clear gap above the blue and red lines.
* **Data Points (Approximate):**
* k=1: 0.80
* k=2: ~0.829
* k=3: ~0.846
* k=4: ~0.854
* k=5: ~0.860
* k=6: ~0.865
* k=7: ~0.869
* k=8: ~0.873
* k=9: ~0.876
* k=10: ~0.879
3. **Blue Solid Line with Square Markers:**
* **Trend:** This line shows a moderate upward slope. Its rate of improvement is slower than the cyan line but faster than the red line.
* **Data Points (Approximate):**
* k=1: 0.80
* k=2: ~0.829 (appears to overlap with cyan at this point)
* k=3: ~0.839
* k=4: ~0.846
* k=5: ~0.850
* k=6: ~0.854
* k=7: ~0.857
* k=8: ~0.859
* k=9: ~0.861
* k=10: ~0.862
4. **Red Solid Line with Circle Markers:**
* **Trend:** This line shows the shallowest upward slope, indicating the slowest rate of improvement with sample size. It is the lowest-performing series for all k > 1.
* **Data Points (Approximate):**
* k=1: 0.80
* k=2: ~0.812
* k=3: ~0.825
* k=4: ~0.836
* k=5: ~0.840
* k=6: ~0.844
* k=7: ~0.846
* k=8: ~0.849
* k=9: ~0.851
* k=10: ~0.852
### Key Observations
* **Common Origin:** All four methods start at an identical accuracy of approximately 0.80 when the sample size (k) is 1.
* **Divergence:** The performance gap between the methods widens consistently as the sample size increases. The hierarchy (Black > Cyan > Blue > Red) is established by k=3 and maintained thereafter.
* **Diminishing Returns:** All curves show signs of diminishing returns; the slope (rate of accuracy gain) decreases as k increases, particularly for the blue and red lines.
* **Performance Gap:** At k=10, the difference between the best (Black, ~0.908) and worst (Red, ~0.852) performing methods is approximately 0.056 accuracy points.
### Interpretation
This chart demonstrates a classic learning curve analysis, comparing how different algorithms or models benefit from increased data (sample size). The key takeaway is that while all methods improve with more data, their **data efficiency** and **learning capacity** differ markedly.
* The method represented by the **black dotted line** is not only the most accurate but also the most data-efficient, showing the largest gains per additional sample. It suggests a model with high capacity that effectively leverages more information.
* The **red line** represents a method with lower capacity or a less effective learning algorithm for this task, as it plateaus more quickly and achieves the lowest final accuracy.
* The **cyan and blue lines** represent intermediate performers. The cyan method consistently outperforms the blue one, indicating a superior approach, though both are clearly outpaced by the black method.
**Notable Anomaly/Consideration:** The critical missing information is the **identity of the four methods**. Without a legend, the chart shows a clear performance ranking but cannot attribute it to specific algorithms (e.g., "Model A vs. Model B"). The analysis is therefore purely comparative based on visual trends. The overlapping point for the cyan and blue lines at k=2 is also a precise detail worth noting.
</details>
(c) QwQ-32B
<details>
<summary>x33.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size for Different Methods
### Overview
This image is a line chart comparing the performance (accuracy) of four different methods as the sample size (k) increases. The chart demonstrates how each method's accuracy changes with more samples, showing distinct trends and convergence patterns.
### Components/Axes
- **Chart Type:** Line chart with markers.
- **X-Axis:** Labeled "Sample Size (k)". It has linear scale with integer markers from 1 to 10.
- **Y-Axis:** Labeled "Accuracy". It has a linear scale ranging from 0.84 to approximately 0.93, with major gridlines at intervals of 0.02 (0.84, 0.86, 0.88, 0.90, 0.92).
- **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. `pass@k (Oracle)`: Black dotted line with upward-pointing triangle markers.
2. `majority@k`: Dark red solid line with circle markers.
3. `short-1@k (Ours)`: Blue solid line with square markers.
4. `short-3@k (Ours)`: Cyan solid line with diamond markers.
- **Grid:** A light gray grid is present for both x and y axes.
### Detailed Analysis
**Trend Verification & Data Points (Approximate Values):**
1. **pass@k (Oracle) [Black dotted line, triangles]:**
* **Trend:** Shows a strong, steady upward logarithmic-like curve. It is the top-performing method for all k > 1.
* **Data Points:**
* k=1: ~0.840
* k=2: ~0.880
* k=3: ~0.898
* k=4: ~0.910
* k=5: ~0.918
* k=6: ~0.923
* k=7: ~0.927
* k=8: ~0.930
* k=9: ~0.932
* k=10: ~0.933
2. **majority@k [Dark red solid line, circles]:**
* **Trend:** Shows a steady, nearly linear upward trend. It starts at the same point as others but improves at a slower, constant rate.
* **Data Points:**
* k=1: ~0.840
* k=2: ~0.864
* k=3: ~0.875
* k=4: ~0.885
* k=5: ~0.895
* k=6: ~0.905
* k=7: ~0.913
* k=8: ~0.919
* k=9: ~0.922
* k=10: ~0.924
3. **short-1@k (Ours) [Blue solid line, squares]:**
* **Trend:** Increases initially, peaks around k=5-6, and then shows a clear downward trend for k > 6. This is the only method that degrades with larger sample sizes.
* **Data Points:**
* k=1: ~0.840
* k=2: ~0.864
* k=3: ~0.874
* k=4: ~0.879
* k=5: ~0.881
* k=6: ~0.881
* k=7: ~0.880
* k=8: ~0.877
* k=9: ~0.874
* k=10: ~0.870
4. **short-3@k (Ours) [Cyan solid line, diamonds]:**
* **Trend:** Shows a rapid initial increase, then plateaus, closely following but remaining slightly below the `pass@k (Oracle)` line. It converges with the oracle method at higher k.
* **Data Points:**
* k=1: ~0.840
* k=2: ~0.864
* k=3: ~0.894
* k=4: ~0.906
* k=5: ~0.913
* k=6: ~0.917
* k=7: ~0.920
* k=8: ~0.922
* k=9: ~0.923
* k=10: ~0.923
### Key Observations
1. **Common Starting Point:** All four methods begin at the same accuracy (~0.840) when the sample size k=1.
2. **Performance Hierarchy:** For k > 1, the order from highest to lowest accuracy is consistently: `pass@k (Oracle)` > `short-3@k (Ours)` > `majority@k` > `short-1@k (Ours)` (for k >= 7).
3. **Diverging Trends:** The `short-1@k` method is an outlier, as its performance peaks and then declines, while all other methods show continuous improvement.
4. **Convergence:** The `short-3@k (Ours)` method nearly matches the performance of the `pass@k (Oracle)` baseline at higher sample sizes (k >= 8), with the gap becoming very small (~0.01 difference at k=10).
5. **Linear vs. Curved Growth:** `majority@k` exhibits linear growth, while `pass@k` and `short-3@k` show curved, diminishing-returns growth.
### Interpretation
This chart likely evaluates methods for improving the accuracy of a system (e.g., a code generation or question-answering model) by using multiple samples (k). The `pass@k (Oracle)` represents an ideal upper-bound performance.
The key insight is that the proposed method `short-3@k (Ours)` is highly effective, achieving near-oracle performance with a sample size of 10, significantly outperforming the standard `majority@k` voting approach. This suggests that the "short-3" strategy is a robust way to leverage multiple samples.
The anomalous behavior of `short-1@k (Ours)` is critical. Its performance degradation after k=6 indicates that this particular strategy may introduce noise or overfit to a subset of samples when given too many options, making it unsuitable for large k. The contrast between `short-1@k` and `short-3@k` highlights that the specific design of the sampling or selection strategy ("short-1" vs. "short-3") is crucial for success.
In summary, the data demonstrates that with the right strategy (`short-3@k`), one can approach oracle-level accuracy using a moderate number of samples, offering a practical improvement over simple majority voting. The failure mode of `short-1@k` serves as an important cautionary result.
</details>
(d) R $1$ - $670$ B
Figure 9: AIME 2024 - sample size ( $k$ ) comparison.
<details>
<summary>x34.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance (Accuracy) of four different methods or models as a function of increasing sample size (k). The chart demonstrates how accuracy improves for each method as more samples are used, with one method (black dotted line) consistently outperforming the others.
### Components/Axes
* **X-Axis:** Labeled "Sample Size (k)". It is a linear scale with major tick marks and labels at integer values from 1 to 10.
* **Y-Axis:** Labeled "Accuracy". It is a linear scale with major tick marks and labels at 0.05 intervals, ranging from 0.50 to 0.75. Grid lines extend horizontally from these major ticks.
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries, each associating a line style/color with a label:
1. A black dotted line with upward-pointing triangle markers (β²).
2. A solid red line with circle markers (β).
3. A solid cyan line with diamond markers (β).
4. A solid cyan line with square markers (β ).
* **Grid:** A light gray grid is present, with both horizontal and vertical lines aligning with the major axis ticks.
### Detailed Analysis
The chart plots four data series. Below is an analysis of each, including approximate data points extracted by visual inspection. Values are approximate (Β±0.005).
**1. Black Dotted Line (β²)**
* **Trend:** Shows the steepest initial increase and maintains the highest accuracy throughout. The slope is very steep from k=1 to k=4, then becomes more gradual but continues to rise steadily.
* **Approximate Data Points:**
* k=1: 0.51
* k=2: 0.63
* k=3: 0.69
* k=4: 0.72
* k=5: 0.74
* k=6: 0.755
* k=7: 0.765
* k=8: 0.77
* k=9: 0.775
* k=10: 0.78
**2. Red Solid Line (β)**
* **Trend:** Starts as the lowest-performing method at k=1. It shows a steady, nearly linear increase, eventually crossing above both cyan lines between k=5 and k=6.
* **Approximate Data Points:**
* k=1: 0.51
* k=2: 0.54
* k=3: 0.56
* k=4: 0.60
* k=5: 0.625
* k=6: 0.64
* k=7: 0.65
* k=8: 0.655
* k=9: 0.66
* k=10: 0.665
**3. Cyan Solid Line (β)**
* **Trend:** Starts tied for lowest at k=1 but rises quickly, outperforming the red line until k=5. After k=6, its growth rate slows significantly, showing diminishing returns.
* **Approximate Data Points:**
* k=1: 0.51
* k=2: 0.57
* k=3: 0.60
* k=4: 0.625
* k=5: 0.635
* k=6: 0.645
* k=7: 0.65
* k=8: 0.652
* k=9: 0.653
* k=10: 0.655
**4. Cyan Solid Line (β )**
* **Trend:** Follows a very similar trajectory to the cyan diamond (β) line but consistently performs slightly worse after k=2. It also exhibits strong diminishing returns after k=5.
* **Approximate Data Points:**
* k=1: 0.51
* k=2: 0.57
* k=3: 0.595
* k=4: 0.61
* k=5: 0.62
* k=6: 0.625
* k=7: 0.628
* k=8: 0.63
* k=9: 0.63
* k=10: 0.63
### Key Observations
1. **Dominant Performance:** The method represented by the black dotted line (β²) is clearly superior, achieving ~0.78 accuracy at k=10, which is over 0.11 points higher than the next best method.
2. **Crossover Event:** The red line (β) starts poorly but demonstrates the most consistent growth rate, eventually surpassing both cyan lines. This suggests it may benefit more from additional data in the long run.
3. **Diminishing Returns:** Both cyan lines (β and β ) show a pronounced "knee" in their curves around k=4 or k=5, after which adding more samples yields very small improvements in accuracy.
4. **Convergence at Start:** All four methods begin at approximately the same accuracy (~0.51) when the sample size is minimal (k=1).
### Interpretation
This chart likely compares different machine learning models, algorithms, or training strategies. The data suggests:
* **Sample Efficiency:** The black-dotted method is highly sample-efficient, extracting significant performance gains from the first few samples. This could indicate a more complex model, a better inductive bias for the task, or the use of superior features.
* **Model Behavior:** The red line's steady climb suggests a model whose performance scales more predictably with data, possibly a simpler or more robust algorithm that hasn't yet reached its capacity.
* **Performance Plateau:** The cyan lines represent methods that quickly reach a performance ceiling. This plateau could be due to model simplicity, high bias, or an inherent limitation in the approach that more data cannot overcome.
* **Practical Implication:** If data collection is expensive (low k), the black method is the clear choice. If massive datasets are available (k >> 10), the red method's trajectory suggests it might continue to close the gap, though the black method's lead is substantial. The cyan methods appear best suited for scenarios with limited data (k < 5) where they are competitive, but they are not optimal for leveraging larger datasets.
The chart effectively communicates that not all methods benefit equally from more data, and the choice of algorithm should consider both the expected dataset size and the desired performance ceiling.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x35.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart plotting "Accuracy" on the y-axis against "Sample Size (k)" on the x-axis. It displays the performance of four distinct methods or models, each represented by a unique line style and color, as the amount of training data (sample size) increases from 1 to 10.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has major tick marks and labels for integer values from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has major tick marks and labels at intervals of 0.05, ranging from 0.55 to 0.75. The axis extends slightly beyond 0.75 at the top.
* **Legend:** There is no explicit legend box within the chart area. The four data series are distinguished solely by their line color, line style, and marker shape.
* **Grid:** A light gray grid is present, with vertical lines at each integer sample size and horizontal lines at each 0.05 accuracy increment.
### Detailed Analysis
The chart contains four data series. Their visual trends and approximate data points are as follows:
1. **Black Dotted Line with Upward-Pointing Triangle Markers:**
* **Trend:** This line shows the steepest initial increase and achieves the highest overall accuracy. It rises sharply from k=1 to k=4, then continues to increase at a decreasing rate, appearing to approach an asymptote near 0.77.
* **Approximate Data Points:**
* k=1: ~0.545
* k=2: ~0.645
* k=3: ~0.690
* k=4: ~0.715
* k=5: ~0.730
* k=6: ~0.745
* k=7: ~0.755
* k=8: ~0.760
* k=9: ~0.765
* k=10: ~0.770
2. **Red Solid Line with Circle Markers:**
* **Trend:** This line starts at the same point as the others at k=1. It shows a steady, nearly linear increase throughout the range, surpassing the cyan line around k=6 and ending as the second-highest.
* **Approximate Data Points:**
* k=1: ~0.545
* k=2: ~0.585
* k=3: ~0.595
* k=4: ~0.615
* k=5: ~0.630
* k=6: ~0.635
* k=7: ~0.645
* k=8: ~0.650
* k=9: ~0.650
* k=10: ~0.655
3. **Cyan Solid Line with Diamond Markers:**
* **Trend:** This line increases rapidly from k=1 to k=3, then its growth slows significantly, plateauing from approximately k=6 onward. It is overtaken by the red line in the latter half.
* **Approximate Data Points:**
* k=1: ~0.545
* k=2: ~0.585
* k=3: ~0.620
* k=4: ~0.630
* k=5: ~0.635
* k=6: ~0.640
* k=7: ~0.640
* k=8: ~0.640
* k=9: ~0.640
* k=10: ~0.640
4. **Blue Solid Line with Square Markers:**
* **Trend:** This line has the most gradual slope. It increases slowly and steadily but remains the lowest-performing series for all sample sizes greater than 1.
* **Approximate Data Points:**
* k=1: ~0.545
* k=2: ~0.585
* k=3: ~0.600
* k=4: ~0.605
* k=5: ~0.610
* k=6: ~0.615
* k=7: ~0.615
* k=8: ~0.620
* k=9: ~0.620
* k=10: ~0.625
### Key Observations
* **Common Starting Point:** All four methods begin at approximately the same accuracy (~0.545) when the sample size is 1.
* **Performance Hierarchy:** A clear performance hierarchy is established by k=2 and maintained thereafter: Black (highest) > Cyan/Red (middle, swapping order) > Blue (lowest).
* **Diminishing Returns:** All lines show diminishing returns (the slope decreases as k increases), but the rate varies dramatically. The black line's gains remain significant up to k=6, while the cyan line's gains largely stop after k=5.
* **Crossover Event:** The red line (circles) and cyan line (diamonds) cross between k=5 and k=6. The red line's more sustained growth allows it to surpass the cyan line, which plateaus earlier.
### Interpretation
This chart likely compares the learning efficiency of four different algorithms or model architectures. The data suggests:
1. **Superior Method:** The method represented by the black dotted line is significantly more data-efficient and achieves a higher performance ceiling. It effectively leverages additional samples to improve accuracy well beyond the others.
2. **Growth Patterns:** The red method demonstrates consistent, reliable learning. The cyan method learns quickly from very few samples but hits a performance limit early. The blue method learns slowly and has the lowest capacity for improvement with more data in this range.
3. **Practical Implications:** If data is extremely scarce (k=1 to 3), the cyan method is competitive with the red. However, for any application where more data is available (k>5), the red method is clearly superior to cyan, and the black method is superior to all. The choice between methods would depend on the expected available sample size and the required accuracy threshold.
4. **Underlying Cause:** The differences in curve shape (steep rise vs. plateau) could indicate variations in model complexity, the presence of regularization, or the inherent difficulty of the learning task for each approach. The black line's shape is characteristic of a high-capacity model that hasn't yet saturated, while the cyan line's shape suggests a simpler model that has reached its expressive limit.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x36.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image displays a line chart plotting "Accuracy" on the y-axis against "Sample Size (k)" on the x-axis. It compares the performance of four distinct methods or models, represented by four lines with different colors and markers, as the sample size increases from 1 to 10.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has major tick marks and labels for integer values from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has major tick marks and labels at intervals of 0.02, ranging from 0.72 to 0.82.
* **Grid:** A light gray grid is present, with vertical lines at each integer x-value and horizontal lines at each 0.02 y-interval.
* **Data Series (Lines):** There are four distinct lines. No explicit legend is present within the chart area. The series are identified by their color, line style, and marker shape:
1. **Black, Dotted Line with Upward-Pointing Triangle Markers:** This is the top-most line for most of the chart.
2. **Red, Solid Line with Circle Markers:** This line is generally the second highest.
3. **Cyan (Light Blue), Solid Line with Diamond Markers:** This line starts near the red line but diverges downward after a certain point.
4. **Blue, Solid Line with Square Markers:** This is the lowest-performing line for most sample sizes greater than 2.
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
* **Black Dotted Line (Triangles):**
* Trend: Shows a steep, consistent logarithmic-like increase, flattening slightly as k increases.
* Points: (1, ~0.725), (2, ~0.790), (3, ~0.802), (4, ~0.806), (5, ~0.808), (6, ~0.810), (7, ~0.812), (8, ~0.814), (9, ~0.816), (10, ~0.818).
* **Red Solid Line (Circles):**
* Trend: Shows a steady, near-linear increase that begins to plateau after k=6.
* Points: (1, ~0.723), (2, ~0.755), (3, ~0.774), (4, ~0.785), (5, ~0.794), (6, ~0.796), (7, ~0.799), (8, ~0.800), (9, ~0.800), (10, ~0.800).
* **Cyan Solid Line (Diamonds):**
* Trend: Increases to a peak, then shows a gradual decline.
* Points: (1, ~0.723), (2, ~0.755), (3, ~0.780), (4, ~0.788), (5, ~0.790 - **Peak**), (6, ~0.789), (7, ~0.788), (8, ~0.786), (9, ~0.784), (10, ~0.782).
* **Blue Solid Line (Squares):**
* Trend: Increases to an early peak, then declines sharply and consistently.
* Points: (1, ~0.723), (2, ~0.755), (3, ~0.756 - **Peak**), (4, ~0.751), (5, ~0.744), (6, ~0.736), (7, ~0.728), (8, ~0.721), (9, ~0.714), (10, ~0.708).
### Key Observations
1. **Convergence at k=1:** All four methods start at nearly the same accuracy point (~0.723-0.725) when the sample size is 1.
2. **Performance Hierarchy:** For k > 2, a clear and consistent performance hierarchy is established: Black > Red > Cyan > Blue.
3. **Divergent Behaviors:** The lines demonstrate fundamentally different responses to increasing sample size:
* **Black & Red:** Benefit from more data, showing monotonic improvement (though with diminishing returns).
* **Cyan:** Benefits up to a point (k=5), after which performance slightly degrades.
* **Blue:** Benefits only very slightly up to k=3, after which performance degrades significantly and linearly.
4. **Peak Performance:** The highest accuracy achieved by any method is approximately 0.818 (Black line at k=10). The lowest final accuracy is approximately 0.708 (Blue line at k=10).
### Interpretation
This chart likely compares different machine learning models, algorithms, or training strategies (e.g., different levels of regularization, model complexities, or data augmentation techniques) on a specific task.
* **The Black Dotted Line** represents the most robust and effective method. Its steep initial rise and sustained improvement suggest a model that generalizes well and effectively leverages additional data without overfitting. It may represent a well-regularized complex model or an ensemble method.
* **The Red Solid Line** shows a stable, reliable method that improves predictably but has a lower performance ceiling than the black line. It could be a simpler, well-tuned model.
* **The Cyan and Blue Lines** demonstrate classic signs of **overfitting**. Their performance peaks at low sample sizes (k=5 and k=3, respectively) and then deteriorates as more data is added. This is counter-intuitive but can happen if the additional data introduces noise the model cannot handle, or if the model is too complex and starts fitting to irrelevant patterns. The blue method is particularly unstable, showing a severe and continuous decline. These lines might represent highly complex models trained without proper regularization or with flawed assumptions.
**Conclusion:** The data suggests that for this specific task, method "Black" is superior, especially as more data becomes available. Methods "Cyan" and "Blue" are unreliable and perform worse with more data beyond a very small sample, indicating a fundamental flaw in their approach for this problem domain. The chart effectively argues for the importance of selecting a model that scales well with data size.
</details>
(c) QwQ-32B
<details>
<summary>x37.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance (accuracy) of four different methods as the sample size (k) increases from 1 to 10. The chart demonstrates how accuracy scales with more samples for an oracle method, a majority voting baseline, and two variants of a proposed method ("Ours").
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** `Sample Size (k)`
* **Scale:** Linear, integer values from 1 to 10.
* **Tick Marks:** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
* **Y-Axis:**
* **Label:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.825 to 0.895.
* **Tick Marks:** 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. `pass@k (Oracle)`: Black dotted line with upward-pointing triangle markers.
2. `majority@k`: Dark red solid line with circle markers.
3. `short-1@k (Ours)`: Blue solid line with square markers.
4. `short-3@k (Ours)`: Cyan solid line with diamond markers.
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
**Data Series Trends & Approximate Values:**
| k | pass@k (Oracle) | short-3@k (Ours) | majority@k | short-1@k (Ours) |
|---|-----------------|------------------|------------|------------------|
| 1 | ~0.825 | ~0.825 | ~0.825 | ~0.825 |
| 2 | ~0.862 | ~0.842 | ~0.842 | ~0.842 |
| 3 | ~0.871 | ~0.868 | ~0.854 | ~0.845 |
| 4 | ~0.877 | ~0.873 | ~0.861 | ~0.846 |
| 5 | ~0.881 | ~0.876 | ~0.865 | ~0.846 |
| 6 | ~0.884 | ~0.877 | ~0.868 | ~0.845 |
| 7 | ~0.886 | ~0.878 | ~0.871 | ~0.845 |
| 8 | ~0.888 | ~0.879 | ~0.873 | ~0.844 |
| 9 | ~0.890 | ~0.879 | ~0.874 | ~0.843 |
| 10| ~0.892 | ~0.879 | ~0.875 | ~0.842 |
**Trend Descriptions:**
1. **pass@k (Oracle) [Black, Dotted, Triangles]:**
* **Trend:** Shows a strong, steady logarithmic-like increase. It is the top-performing series for all k > 1.
2. **short-3@k (Ours) [Cyan, Solid, Diamonds]:**
* **Trend:** Increases rapidly from k=1 to k=3, then continues to increase at a decelerating rate, plateauing near k=8-10. It is the second-best performing series.
3. **majority@k [Dark Red, Solid, Circles]:**
* **Trend:** Shows a steady, near-linear increase across all sample sizes. It consistently performs below `short-3@k` but above `short-1@k` for k > 2.
4. **short-1@k (Ours) [Blue, Solid, Squares]:**
* **Trend:** Increases from k=1 to k=4, peaks around k=4-5, and then shows a slight but consistent decline as k increases further. It is the lowest-performing series for k > 2.
### Key Observations
1. **Convergence at k=1:** All four methods start at the same accuracy point (~0.825) when the sample size is 1.
2. **Performance Hierarchy:** A clear and consistent hierarchy is established for k > 1: `pass@k (Oracle)` > `short-3@k (Ours)` > `majority@k` > `short-1@k (Ours)`.
3. **Diminishing Returns:** All curves show diminishing returns; the gain in accuracy per additional sample decreases as k grows. This is most pronounced for `short-3@k` and `pass@k`.
4. **Anomalous Trend:** The `short-1@k` series exhibits a unique trend where performance peaks early (k=4-5) and then degrades slightly with more samples, unlike the other methods which never decrease.
### Interpretation
The chart presents a comparative analysis of sampling-based methods, likely for a task like code generation or classification where multiple attempts ("samples") are made and aggregated.
* **Oracle as Upper Bound:** The `pass@k (Oracle)` line represents an idealized upper bound, where the correct answer is counted if it appears in any of the k samples. Its steady climb shows that simply having more "chances" reliably improves the chance of success.
* **Efficacy of Proposed Method:** The `short-3@k (Ours)` method significantly outperforms the standard `majority@k` voting baseline. This suggests that the authors' technique for selecting or aggregating from 3 short samples is more effective than simple majority voting over k samples.
* **Limitation of Single Short Sample:** The poor and declining performance of `short-1@k` indicates that relying on a single short sample does not scale well. The slight decline could suggest that as k increases, the method might be selecting less optimal samples or that the aggregation rule for a single short sample is not robust to larger pools.
* **Practical Implication:** For a practitioner, this chart argues that using the authors' `short-3` method with a moderate sample size (e.g., k=5-7) offers a strong balance between performance and computational cost, achieving accuracy close to the oracle's trajectory without requiring an excessive number of samples. The data underscores that more samples are beneficial, but the method of using them is critical.
</details>
(d) R1-670B
Figure 10: AIME 2025 - sample size ( $k$ ) comparison.
<details>
<summary>x38.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance (accuracy) of four different methods or models as a function of increasing sample size. The chart demonstrates how accuracy improves with more data for each method, with one method showing significantly superior performance.
### Components/Axes
* **X-Axis:** Labeled "Sample Size (k)". It has discrete integer markers from 1 to 10.
* **Y-Axis:** Labeled "Accuracy". It has numerical markers from 0.35 to 0.65, in increments of 0.05.
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries:
1. A black dotted line with upward-pointing triangle markers (β²).
2. A red solid line with circle markers (β).
3. A cyan solid line with square markers (β ).
4. A cyan solid line with diamond markers (β).
* **Grid:** A light gray grid is present, aligning with the major ticks on both axes.
### Detailed Analysis
The chart plots four data series. Below is an analysis of each, including trend description and approximate data points extracted from visual inspection. Values are approximate (Β±0.005).
**1. Black Dotted Line (β²)**
* **Trend:** Shows a strong, logarithmic-like growth pattern. It starts at the lowest accuracy but rises steeply and continues to increase across the entire range, maintaining the highest accuracy from k=2 onward.
* **Data Points (k, Accuracy):**
* (1, ~0.325)
* (2, ~0.430)
* (3, ~0.490)
* (4, ~0.530)
* (5, ~0.560)
* (6, ~0.585)
* (7, ~0.605)
* (8, ~0.620)
* (9, ~0.635)
* (10, ~0.645)
**2. Red Solid Line (β)**
* **Trend:** Shows a steady, near-linear increase. It starts at the same point as the cyan lines but grows at a slower rate than the black line, eventually surpassing the cyan lines around k=6.
* **Data Points (k, Accuracy):**
* (1, ~0.325)
* (2, ~0.340)
* (3, ~0.355)
* (4, ~0.380)
* (5, ~0.405)
* (6, ~0.415)
* (7, ~0.430)
* (8, ~0.435)
* (9, ~0.440)
* (10, ~0.445)
**3. Cyan Solid Line (β )**
* **Trend:** Shows initial growth that plateaus relatively early. It closely follows the diamond-marked cyan line, with a very slight divergence at higher k values where the square-marked line is marginally higher.
* **Data Points (k, Accuracy):**
* (1, ~0.325)
* (2, ~0.365)
* (3, ~0.385)
* (4, ~0.395)
* (5, ~0.405)
* (6, ~0.410)
* (7, ~0.415)
* (8, ~0.418)
* (9, ~0.420)
* (10, ~0.425)
**4. Cyan Solid Line (β)**
* **Trend:** Nearly identical to the square-marked cyan line, showing early growth and a plateau. It is slightly below the square-marked line from k=7 onward.
* **Data Points (k, Accuracy):**
* (1, ~0.325)
* (2, ~0.365)
* (3, ~0.390)
* (4, ~0.400)
* (5, ~0.405)
* (6, ~0.408)
* (7, ~0.410)
* (8, ~0.412)
* (9, ~0.413)
* (10, ~0.415)
### Key Observations
1. **Performance Hierarchy:** The method represented by the black dotted line (β²) is the clear top performer, achieving an accuracy of ~0.645 at k=10, which is approximately 0.20 higher than the next best method.
2. **Convergence at Start:** All four methods begin at approximately the same accuracy (~0.325) when the sample size is 1.
3. **Divergence with Data:** As sample size increases, the performance of the methods diverges significantly. The black line separates immediately, while the red line separates from the cyan cluster around k=5-6.
4. **Plateauing:** The two cyan methods show signs of performance saturation (plateauing) after k=5 or 6, with very marginal gains for additional samples. The red and black lines show no clear plateau within the observed range.
5. **Cyan Line Similarity:** The two cyan lines (β and β) are extremely close in performance throughout, suggesting they may represent very similar algorithms or minor variations of the same method.
### Interpretation
This chart illustrates a classic machine learning or statistical learning curve. The key takeaway is the **superior data efficiency and scalability** of the method represented by the black dotted line.
* **What the data suggests:** The black method not only learns faster from initial data (steeper slope between k=1 and k=3) but also continues to extract meaningful signal from additional data long after the other methods have begun to plateau. This indicates a more robust or complex model architecture capable of leveraging larger datasets.
* **Relationship between elements:** The x-axis (Sample Size) is the independent variable, driving the change in the dependent variable (Accuracy). The legend distinguishes between different modeling approaches being tested under identical conditions.
* **Notable implications:** For applications where data is scarce (low k), all methods perform similarly. However, for applications where data can be scaled to k=10 or beyond, the choice of method is critical, with the black method being the unequivocal choice for maximizing accuracy. The plateauing of the cyan methods suggests they may be too simple (high bias) to capture the underlying pattern in the data beyond a certain point, while the black method's continued growth suggests it has the capacity (low bias) to do so. The red method represents a middle ground, showing steady but unspectacular improvement.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x39.png Details</summary>

### Visual Description
## Line Chart: kNN Accuracy vs. Sample Size (k)
### Overview
This is a line chart comparing the classification accuracy of a k-Nearest Neighbors (kNN) algorithm with different fixed `k` values (1, 3, 5, 7) as the sample size (number of neighbors considered, also denoted as `k` on the x-axis) increases. The chart demonstrates how the choice of the hyperparameter `k` affects the model's accuracy scaling with more data.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, ranging from 1 to 10 with integer increments.
* **Markers:** Major ticks at each integer value (1, 2, 3, ..., 10).
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from 0.40 to 0.60.
* **Markers:** Major ticks at 0.40, 0.45, 0.50, 0.55, 0.60.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Entries (from top to bottom):**
1. **kNN (k=7):** Black, dotted line with upward-pointing triangle markers (β²).
2. **kNN (k=5):** Dark red, solid line with circle markers (β).
3. **kNN (k=3):** Cyan, solid line with diamond markers (β).
4. **kNN (k=1):** Blue, solid line with square markers (β ).
* **Grid:** Light gray grid lines are present for both major x and y ticks.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **kNN (k=7) - Black Dotted Line with Triangles:**
* **Trend:** Shows a strong, consistent, and nearly linear upward trend across the entire range. It is the highest-performing model for all sample sizes > 1.
* **Data Points:**
* k=1: ~0.37
* k=2: ~0.45
* k=3: ~0.495
* k=4: ~0.525
* k=5: ~0.545
* k=6: ~0.56
* k=7: ~0.575
* k=8: ~0.585
* k=9: ~0.595
* k=10: ~0.605
2. **kNN (k=5) - Dark Red Solid Line with Circles:**
* **Trend:** Shows a steady, moderate upward trend. It starts tied for lowest accuracy but surpasses the k=3 and k=1 models by k=6.
* **Data Points:**
* k=1: ~0.37
* k=2: ~0.40
* k=3: ~0.40
* k=4: ~0.42
* k=5: ~0.435
* k=6: ~0.445
* k=7: ~0.455
* k=8: ~0.465
* k=9: ~0.47
* k=10: ~0.475
3. **kNN (k=3) - Cyan Solid Line with Diamonds:**
* **Trend:** Increases rapidly from k=1 to k=4, then plateaus, showing very little improvement from k=6 onward.
* **Data Points:**
* k=1: ~0.37
* k=2: ~0.40
* k=3: ~0.43
* k=4: ~0.44
* k=5: ~0.445
* k=6: ~0.45
* k=7: ~0.455
* k=8: ~0.455
* k=9: ~0.455
* k=10: ~0.455
4. **kNN (k=1) - Blue Solid Line with Squares:**
* **Trend:** Increases slightly from k=1 to k=5, peaks around k=5 or k=6, and then shows a slight downward trend for larger sample sizes.
* **Data Points:**
* k=1: ~0.37
* k=2: ~0.40
* k=3: ~0.41
* k=4: ~0.415
* k=5: ~0.415
* k=6: ~0.415
* k=7: ~0.415
* k=8: ~0.415
* k=9: ~0.41
* k=10: ~0.41
### Key Observations
1. **Convergence at Origin:** All four models start at approximately the same accuracy (~0.37) when the sample size (k) is 1.
2. **Performance Hierarchy:** For any sample size greater than 1, a clear and consistent performance hierarchy is established: k=7 > k=5 > k=3 > k=1.
3. **Diminishing Returns:** The benefit of increasing sample size diminishes for lower `k` values. The k=3 line plateaus, and the k=1 line even declines slightly, suggesting overfitting or increased sensitivity to noise with more neighbors for a very local model.
4. **Linear Scaling for High k:** The k=7 model shows the most consistent and scalable improvement, suggesting that a less local (higher `k`) model benefits more predictably from additional data within this range.
### Interpretation
This chart illustrates a fundamental trade-off in k-Nearest Neighbors classification between **locality** (low `k`) and **generalization** (high `k`).
* **Low `k` (e.g., k=1):** Creates a highly flexible, complex decision boundary that is very sensitive to local patterns and noise in the training data. This leads to poor generalization, as seen by its low and eventually declining accuracy. It likely suffers from high variance.
* **High `k` (e.g., k=7):** Creates a smoother, simpler decision boundary by averaging over a larger neighborhood. This reduces variance and leads to better generalization, as evidenced by its steadily increasing accuracy. It likely has lower variance but higher bias.
* **The Trend:** The data suggests that for this specific (unspecified) dataset and problem, using a higher `k` value (7) yields a model that not only performs better overall but also scales more effectively with increased sample size. The plateauing of the k=3 line indicates that beyond a certain point (kβ6), adding more neighbors provides no additional discriminative power for that model configuration. The slight decline in the k=1 model is a classic sign of overfitting, where incorporating more neighbors (which should theoretically help) instead introduces noise that harms performance.
**In summary, the chart argues for using a higher `k` value (like 7) in this context, as it provides superior accuracy and more reliable improvement as the dataset grows.**
</details>
(b) R $1$ - $32$ B
<details>
<summary>x40.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart comparing the performance (Accuracy) of four different models or methods as the training sample size increases. The chart demonstrates how accuracy improves with more data for each method, with one method (black dotted line) showing significantly superior performance and a steeper learning curve compared to the other three, which cluster together at a lower performance tier.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It represents the number of training samples in thousands. The axis has discrete markers at integer values from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It represents a performance metric, likely classification accuracy, ranging from approximately 0.48 to 0.75. Major gridlines are drawn at intervals of 0.05 (0.50, 0.55, 0.60, 0.65, 0.70, 0.75).
* **Legend:** Located in the top-left quadrant of the plot area. It identifies four data series:
1. **Black dotted line with upward-pointing triangle markers (β²):** Label not visible in the provided crop, but this is the highest-performing series.
2. **Cyan solid line with diamond markers (β):** The second-highest performing series.
3. **Blue solid line with square markers (β ):** The third-highest performing series.
4. **Red solid line with circle markers (β):** The lowest-performing series initially, but it shows steady improvement.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **Black Dotted Line (β²):**
* **Trend:** Shows a strong, steep, and slightly concave-down upward trend. It starts at the lowest point (tied with others) but quickly diverges and maintains the highest accuracy throughout.
* **Data Points:**
* k=1: ~0.48
* k=2: ~0.56
* k=3: ~0.61
* k=4: ~0.645
* k=5: ~0.675
* k=6: ~0.695
* k=7: ~0.71
* k=8: ~0.725
* k=9: ~0.74
* k=10: ~0.75
2. **Cyan Solid Line (β):**
* **Trend:** Shows a steady, slightly concave-down upward trend. It is consistently the second-best performer after k=2.
* **Data Points:**
* k=1: ~0.48
* k=2: ~0.515
* k=3: ~0.545
* k=4: ~0.56
* k=5: ~0.575
* k=6: ~0.58
* k=7: ~0.59
* k=8: ~0.595
* k=9: ~0.598
* k=10: ~0.60
3. **Blue Solid Line (β ):**
* **Trend:** Shows a steady, nearly linear upward trend. It is generally the third-best performer, closely following the cyan line but with a slightly lower slope.
* **Data Points:**
* k=1: ~0.48
* k=2: ~0.515 (overlaps with cyan)
* k=3: ~0.535
* k=4: ~0.548
* k=5: ~0.555
* k=6: ~0.56
* k=7: ~0.565
* k=8: ~0.57
* k=9: ~0.575
* k=10: ~0.578
4. **Red Solid Line (β):**
* **Trend:** Shows a steady, nearly linear upward trend. It starts as the lowest performer but its slope is slightly steeper than the blue line, causing it to converge with and eventually surpass the blue line at higher k values.
* **Data Points:**
* k=1: ~0.48
* k=2: ~0.495
* k=3: ~0.505
* k=4: ~0.52
* k=5: ~0.54
* k=6: ~0.552
* k=7: ~0.565 (intersects blue line)
* k=8: ~0.575
* k=9: ~0.582
* k=10: ~0.59
### Key Observations
1. **Performance Hierarchy:** A clear performance gap exists. The method represented by the black dotted line (β²) is in a class of its own, achieving ~15-16 percentage points higher accuracy than the next best method at k=10.
2. **Convergence at Low k:** All four methods start at approximately the same accuracy (~0.48) when the sample size is smallest (k=1).
3. **Diminishing Returns:** All curves show signs of diminishing returns (concave-down shape), meaning the marginal gain in accuracy decreases as more data is added. This effect is most pronounced for the top-performing black line.
4. **Crossover Point:** The red line (β) and blue line (β ) intersect between k=6 and k=7. Before this point, blue is more accurate; after it, red becomes more accurate.
5. **Clustering:** The cyan, blue, and red lines form a relatively tight cluster, with a maximum spread of about 0.02-0.03 accuracy points between the best and worst of them at any given k after k=2.
### Interpretation
This chart illustrates a classic machine learning learning curve. The key takeaway is the dramatic difference in **data efficiency** between the methods.
* The black-dotted method is highly data-efficient. It extracts significantly more predictive power from the same amount of data, as evidenced by its steep initial slope. This suggests it may be a more complex model, use a better feature representation, or employ a more effective learning algorithm.
* The other three methods (cyan, blue, red) have similar, lower data efficiency. Their close grouping suggests they may be variants of a similar algorithmic family or share a common limitation. The crossover between red and blue indicates that their relative performance is sensitive to the amount of available data.
* The universal convergence at k=1 implies that with extremely scarce data, the choice of method may matter less, as all are equally constrained. The value of choosing a superior method (black line) becomes overwhelmingly apparent as more data becomes available.
* The chart does not show a plateau for any line within the range k=1 to k=10, suggesting that further accuracy gains might be possible with even larger sample sizes, though at a diminishing rate.
**Note on Missing Information:** The specific names or identities of the four methods compared are not visible in the provided image crop. The legend labels are cut off. A complete technical document would require these labels to make the comparison meaningful.
</details>
(c) QwQ-32B
<details>
<summary>x41.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size for Different Sampling Methods
### Overview
The image is a line chart comparing the performance of four different methods or metrics as a function of sample size (`k`). The chart plots "Accuracy" on the y-axis against "Sample Size (k)" on the x-axis. All four lines show an increasing trend, indicating that accuracy improves with a larger sample size. The chart includes a legend to identify each line.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `Sample Size (k)`
* **Scale:** Linear, with major tick marks and labels at integer values from 1 to 10.
* **Y-Axis (Vertical):**
* **Label:** `Accuracy`
* **Scale:** Linear, with major tick marks and labels at intervals of 0.025, ranging from 0.675 to 0.875.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries, each with a distinct line style, marker, and color.
1. `pass@k (Oracle)`: Black, dotted line with upward-pointing triangle markers.
2. `majority@k`: Dark red/brown, solid line with circle markers.
3. `short-1@k (Ours)`: Light blue, solid line with square markers.
4. `short-3@k (Ours)`: Cyan/teal, solid line with diamond markers.
* **Grid:** A light gray grid is present, aligned with the major ticks on both axes.
### Detailed Analysis
**Data Series and Approximate Values:**
The following table lists the approximate accuracy values for each method at each sample size `k`, derived from visual inspection of the chart. Values are approximate (`~`).
| k | pass@k (Oracle) | majority@k | short-1@k (Ours) | short-3@k (Ours) |
| :--- | :--- | :--- | :--- | :--- |
| 1 | ~0.680 | ~0.680 | ~0.680 | ~0.680 |
| 2 | ~0.760 | ~0.720 | ~0.745 | ~0.745 |
| 3 | ~0.795 | ~0.725 | ~0.775 | ~0.780 |
| 4 | ~0.815 | ~0.750 | ~0.790 | ~0.805 |
| 5 | ~0.830 | ~0.765 | ~0.805 | ~0.825 |
| 6 | ~0.845 | ~0.775 | ~0.815 | ~0.835 |
| 7 | ~0.855 | ~0.785 | ~0.820 | ~0.845 |
| 8 | ~0.865 | ~0.795 | ~0.825 | ~0.855 |
| 9 | ~0.875 | ~0.805 | ~0.830 | ~0.865 |
| 10 | ~0.885 | ~0.810 | ~0.830 | ~0.870 |
**Trend Verification:**
* **pass@k (Oracle):** This black dotted line with triangle markers has the steepest initial slope and maintains the highest position throughout. It shows a strong, consistent upward trend, suggesting an upper-bound or ideal performance.
* **majority@k:** This dark red line with circle markers has the shallowest slope and remains the lowest-performing method for all `k > 1`. It shows a steady but slower rate of improvement.
* **short-1@k (Ours):** This light blue line with square markers starts at the same point as others at `k=1`. It rises quickly initially but its slope decreases more noticeably after `k=5`, appearing to plateau slightly towards `k=10`.
* **short-3@k (Ours):** This cyan line with diamond markers closely follows the `short-1@k` line for `k=1` and `k=2`, then begins to diverge upward. It maintains a position between `short-1@k` and the `pass@k (Oracle)` line, showing a strong and sustained upward trend.
### Key Observations
1. **Universal Improvement:** All four methods show improved accuracy as the sample size `k` increases from 1 to 10.
2. **Performance Hierarchy:** A clear and consistent performance hierarchy is established for `k >= 2`: `pass@k (Oracle)` > `short-3@k (Ours)` > `short-1@k (Ours)` > `majority@k`.
3. **Convergence at k=1:** All four lines originate from approximately the same accuracy point (~0.680) at `k=1`.
4. **Divergence with k:** The performance gap between the methods widens as `k` increases. The gap between the best (`pass@k`) and worst (`majority@k`) is largest at `k=10`.
5. **Proposed Methods' Efficacy:** The two methods labeled "(Ours)" both significantly outperform the `majority@k` baseline. `short-3@k` demonstrates a clear advantage over `short-1@k` for `k >= 3`.
### Interpretation
This chart likely comes from a research paper in machine learning or statistics, evaluating methods for improving prediction or generation accuracy by using multiple samples (`k`). The key takeaways are:
* **The "Oracle" is an Upper Bound:** The `pass@k (Oracle)` line represents an idealized, best-possible scenario (perhaps where the correct answer is always selected from the `k` samples). It serves as a benchmark for the maximum achievable accuracy.
* **"Ours" Methods are Effective:** The proposed methods (`short-1@k` and `short-3@k`) provide a substantial improvement over the simple `majority@k` voting baseline. This suggests the authors' technique for selecting or aggregating from multiple samples is successful.
* **More Samples Help, but with Diminishing Returns:** While accuracy increases with `k` for all methods, the rate of improvement slows down, especially for `short-1@k`. This indicates a trade-off between computational cost (more samples) and performance gain.
* **`short-3` Outperforms `short-1`:** The `short-3@k` method's superior performance over `short-1@k` suggests that the specific design choice it represents (perhaps considering a different subset or using a different aggregation strategy) is more effective. The gap between them and the Oracle line shows there is still room for improvement in the selection/aggregation algorithms.
In summary, the data demonstrates that the authors' proposed sampling strategies are effective, with `short-3@k` being particularly promising, as they approach the theoretical upper bound of performance more closely than the baseline method as more samples are used.
</details>
(d) R1-670B
Figure 11: HMMT Feb 2025 - sample size ( $k$ ) comparison.
<details>
<summary>x42.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Reasoning Methods
### Overview
This is a line chart plotting model accuracy against computational cost, measured in "thinking tokens." It compares the performance scaling of four distinct reasoning or prompting methods as more computational resources (thinking tokens) are allocated. The chart demonstrates that all methods improve with more compute, but at significantly different rates and with different efficiency profiles.
### Components/Axes
* **X-Axis:** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from 0 to 120, with major tick marks at 20, 40, 60, 80, 100, and 120. The unit is thousands of tokens.
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.60 to 0.80, with major tick marks at 0.60, 0.65, 0.70, 0.75, and 0.80.
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries, each with a distinct line style and marker:
1. **Black dotted line with upward-pointing triangles (β²):** "Chain-of-Thought (CoT)"
2. **Cyan solid line with diamonds (β):** "Self-Consistency (SC)"
3. **Cyan solid line with squares (β ):** "Tree of Thoughts (ToT)"
4. **Red solid line with circles (β):** "Reasoning via Planning (RAP)"
* **Grid:** A light gray grid is present, aligning with the major ticks on both axes.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate):**
1. **Chain-of-Thought (CoT) - Black dotted line with triangles:**
* **Trend:** Shows the steepest, near-linear upward slope. It demonstrates the highest accuracy gain per unit of additional compute.
* **Data Points (Compute k, Accuracy):** Starts at (~10k, ~0.57). Passes through (~20k, ~0.67), (~30k, ~0.72), (~40k, ~0.76), (~50k, ~0.79), (~60k, ~0.81), and ends near (~70k, ~0.83).
2. **Self-Consistency (SC) - Cyan solid line with diamonds:**
* **Trend:** Shows a strong, slightly curving upward slope. It is the second most efficient method, closely following CoT initially but with a slightly less steep trajectory at higher compute.
* **Data Points (Compute k, Accuracy):** Starts at (~10k, ~0.57). Passes through (~20k, ~0.64), (~30k, ~0.68), (~40k, ~0.71), (~50k, ~0.73), (~60k, ~0.75), (~70k, ~0.77), and ends near (~80k, ~0.78).
3. **Tree of Thoughts (ToT) - Cyan solid line with squares:**
* **Trend:** Shows a moderate upward curve that begins to plateau earlier than SC. Its slope is less steep than both CoT and SC.
* **Data Points (Compute k, Accuracy):** Starts at (~10k, ~0.57). Passes through (~20k, ~0.65), (~30k, ~0.70), (~40k, ~0.72), (~50k, ~0.74), (~60k, ~0.75), and ends near (~70k, ~0.75). The line appears to terminate or merge with the SC line around 70k.
4. **Reasoning via Planning (RAP) - Red solid line with circles:**
* **Trend:** Shows the most gradual, shallow upward slope. It requires significantly more compute to achieve smaller gains in accuracy compared to the other methods.
* **Data Points (Compute k, Accuracy):** Starts at (~10k, ~0.57). Passes through (~20k, ~0.59), (~30k, ~0.62), (~40k, ~0.64), (~50k, ~0.66), (~60k, ~0.68), (~70k, ~0.69), (~80k, ~0.70), (~90k, ~0.71), (~100k, ~0.71), and ends near (~120k, ~0.72).
**Spatial Grounding & Cross-Reference:**
* All four lines originate from approximately the same point at the lowest compute value (~10k tokens, ~0.57 accuracy).
* The legend in the top-left correctly maps to the lines: The black dotted line with triangles is the highest line (CoT). The two cyan lines are in the middle, with the diamond-marked line (SC) ultimately above the square-marked line (ToT). The red line with circles (RAP) is consistently the lowest line.
### Key Observations
1. **Universal Starting Point:** All methods begin at near-identical performance for minimal compute, suggesting a common baseline capability.
2. **Diverging Efficiency:** A dramatic divergence in scaling efficiency occurs immediately after the starting point. CoT scales most efficiently, followed by SC, then ToT, with RAP being the least compute-efficient.
3. **Crossover Point:** The SC (diamonds) and ToT (squares) lines cross between 40k and 50k tokens. Below this point, ToT has a slight edge; above it, SC becomes more accurate for the same compute.
4. **Plateauing:** ToT shows the earliest signs of performance plateauing, with its curve flattening noticeably after ~60k tokens. RAP also shows a very gradual slope, indicating diminishing returns.
5. **No Upper Bound Shown:** The chart does not show a clear performance ceiling for any method within the plotted range, though the slopes suggest CoT and SC would continue to rise if the x-axis were extended.
### Interpretation
This chart provides a clear empirical comparison of the "return on investment" for different AI reasoning strategies. It answers the critical question: "If I spend more computational budget on making a model 'think harder,' which method will give me the most accuracy boost?"
* **What the data suggests:** **Chain-of-Thought (CoT)** is the most efficient method for converting additional thinking compute into accuracy gains within this regime. **Self-Consistency (SC)** is a strong second, offering a good balance. **Tree of Thoughts (ToT)** provides intermediate gains but may hit diminishing returns sooner. **Reasoning via Planning (RAP)** is the least efficient, requiring substantially more tokens for smaller improvements.
* **Relationship between elements:** The chart illustrates a fundamental trade-off in AI system design: the choice of reasoning architecture dramatically impacts computational cost-effectiveness. The crossing of SC and ToT lines indicates there is no universally "best" method; the optimal choice can depend on the available compute budget.
* **Notable implications:** For applications where compute cost or latency is a primary constraint, CoT or SC would be preferable. RAP's shallow slope might be justified only if it offers other unmeasured benefits (e.g., better interpretability, robustness, or performance on a different, more complex task distribution not captured by this accuracy metric). The chart strongly argues that simply "adding more compute" is not a uniform strategyβthe method used to structure that compute is paramount.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x43.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Models
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in "thinking tokens." It displays four distinct data series, each representing a different model or method, showing how their performance scales with increased compute. The chart demonstrates a clear relationship where accuracy generally increases with more thinking compute, but at different rates and to different saturation points for each series.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from 0 to 100, with major tick marks at 20, 40, 60, 80, and 100. The unit is thousands of tokens.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.72 to 0.86, with major tick marks at 0.72, 0.74, 0.76, 0.78, 0.80, 0.82, 0.84, and 0.86.
* **Data Series (Legend Inferred from Visuals):** There is no explicit legend box. The four series are distinguished by color, line style, and marker shape.
1. **Black, dotted line with upward-pointing triangle markers.**
2. **Cyan (bright blue), solid line with diamond markers.**
3. **Red (dark red/maroon), solid line with circle markers.**
4. **Light blue, solid line with square markers.**
* **Grid:** A light gray grid is present, aligning with the major ticks on both axes.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **Black Dotted Line (Triangles):**
* **Trend:** Shows the steepest initial ascent, indicating the highest efficiency in converting compute to accuracy. It begins to plateau at the highest accuracy level among all series.
* **Data Points (Compute k, Accuracy):** (~10, 0.71), (~15, 0.80), (~20, 0.825), (~25, 0.838), (~30, 0.845), (~35, 0.85), (~40, 0.855), (~45, 0.858), (~50, 0.86), (~55, 0.865).
2. **Cyan Solid Line (Diamonds):**
* **Trend:** Shows a strong, steady increase in accuracy that is less steep than the black line initially but continues to climb robustly across the measured range.
* **Data Points (Compute k, Accuracy):** (~10, 0.71), (~15, 0.76), (~20, 0.775), (~25, 0.785), (~30, 0.80), (~35, 0.812), (~40, 0.82), (~45, 0.826), (~50, 0.831), (~55, 0.835), (~60, 0.839), (~65, 0.841).
3. **Red Solid Line (Circles):**
* **Trend:** Exhibits a more gradual, concave-downward curve. It starts lower than the cyan line but surpasses the light blue line and continues to improve steadily, though at a slower rate than the cyan line.
* **Data Points (Compute k, Accuracy):** (~10, 0.71), (~20, 0.74), (~30, 0.772), (~40, 0.797), (~50, 0.809), (~60, 0.817), (~70, 0.822), (~80, 0.826), (~90, 0.828), (~100, 0.83).
4. **Light Blue Solid Line (Squares):**
* **Trend:** Rises quickly at very low compute but then flattens out dramatically, showing strong early saturation. It has the lowest final accuracy of the four series.
* **Data Points (Compute k, Accuracy):** (~10, 0.71), (~15, 0.76), (~20, 0.783), (~25, 0.788), (~30, 0.792), (~35, 0.794), (~40, 0.795), (~45, 0.797), (~50, 0.799).
### Key Observations
* **Performance Hierarchy:** At any given compute level above ~15k tokens, the models maintain a consistent performance order from highest to lowest accuracy: Black > Cyan > Red > Light Blue.
* **Efficiency vs. Saturation:** The black line is the most compute-efficient, reaching ~0.86 accuracy with only ~55k tokens. The light blue line is the least efficient for high accuracy, saturating below 0.80.
* **Convergence at Origin:** All four lines appear to originate from the same point at approximately (10k tokens, 0.71 accuracy), suggesting a common baseline performance with minimal compute.
* **Divergence:** The lines diverge immediately and significantly, highlighting fundamental differences in how each model utilizes additional compute.
### Interpretation
This chart illustrates a core trade-off in AI model design: the relationship between computational investment ("thinking compute") and task performance ("accuracy"). The data suggests:
1. **Diminishing Returns are Model-Dependent:** All models show diminishing returns (the curves flatten), but the point at which returns diminish sharply varies. The light blue model hits this point very early, while the black and cyan models sustain useful gains over a much wider compute range.
2. **Architectural or Methodological Differences:** The starkly different curves imply the models are not just scaled versions of each other. The black-dotted series likely represents a fundamentally more efficient architecture or training paradigm for this specific task, achieving superior accuracy with less computational effort.
3. **Practical Implications:** For applications where compute is cheap or latency is not critical, the cyan or red models might be acceptable. However, for scenarios demanding maximum accuracy per unit of compute (e.g., real-time systems, large-scale deployment), the model represented by the black line is clearly superior. The light blue model appears unsuitable for high-accuracy requirements regardless of compute budget.
4. **The "Thinking" Paradigm:** The axis label "thinking tokens" frames compute as an active reasoning process. The chart validates that for these models, "more thinking" generally leads to better answers, but the quality of the "thinking mechanism" (the model itself) is the primary determinant of final performance.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x44.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in "thinking tokens." It compares the performance of four distinct models or methods, each represented by a unique line style and color. The chart demonstrates how accuracy improves as more computational resources (thinking tokens) are allocated, with all models showing diminishing returns.
### Components/Axes
* **Y-Axis (Vertical):** Labeled **"Accuracy"**. The scale ranges from **0.80 to 0.90**, with major gridlines at intervals of 0.02 (0.80, 0.82, 0.84, 0.86, 0.88, 0.90).
* **X-Axis (Horizontal):** Labeled **"Thinking Compute (thinking tokens in thousands)"**. The scale ranges from **20 to 140**, with major gridlines at intervals of 20 (20, 40, 60, 80, 100, 120, 140).
* **Legend:** Positioned in the **top-left corner** of the chart area. It contains four entries:
1. **Black dotted line with upward-pointing triangle markers (β²)**
2. **Cyan solid line with diamond markers (β)**
3. **Light blue solid line with square markers (β )**
4. **Red solid line with circle markers (β)**
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
The chart displays four data series, each showing a positive, concave-down trend (increasing at a decreasing rate).
1. **Black Dotted Line (β²):**
* **Trend:** This line exhibits the steepest initial slope and achieves the highest overall accuracy. It shows the most significant gains from increased compute.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.797
* At ~25k tokens: Accuracy β 0.844
* At ~40k tokens: Accuracy β 0.862
* At ~60k tokens: Accuracy β 0.880
* At ~80k tokens: Accuracy β 0.898
* At ~95k tokens: Accuracy β 0.907 (highest point on the chart)
2. **Cyan Line (β):**
* **Trend:** This line has the second-steepest slope, consistently performing below the black line but above the others.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.797 (similar starting point to others)
* At ~25k tokens: Accuracy β 0.829
* At ~40k tokens: Accuracy β 0.846
* At ~60k tokens: Accuracy β 0.860
* At ~80k tokens: Accuracy β 0.869
* At ~105k tokens: Accuracy β 0.879
3. **Light Blue Line (β ):**
* **Trend:** This line follows a path very close to the cyan line initially but begins to plateau earlier and at a lower accuracy level.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.797
* At ~25k tokens: Accuracy β 0.829
* At ~40k tokens: Accuracy β 0.846
* At ~60k tokens: Accuracy β 0.854
* At ~80k tokens: Accuracy β 0.859
* At ~95k tokens: Accuracy β 0.862
4. **Red Line (β):**
* **Trend:** This line has the shallowest slope, indicating the least accuracy gain per unit of additional compute. It plateaus at the lowest accuracy level.
* **Approximate Data Points:**
* At ~15k tokens: Accuracy β 0.797
* At ~40k tokens: Accuracy β 0.825
* At ~60k tokens: Accuracy β 0.836
* At ~80k tokens: Accuracy β 0.844
* At ~100k tokens: Accuracy β 0.846
* At ~135k tokens: Accuracy β 0.852
### Key Observations
1. **Performance Hierarchy:** A clear and consistent performance hierarchy is established across nearly the entire compute range: Black (β²) > Cyan (β) > Light Blue (β ) > Red (β).
2. **Diminishing Returns:** All four models exhibit diminishing returns; the accuracy gain from each additional thousand tokens decreases as total compute increases.
3. **Convergence at Low Compute:** At the lowest compute level shown (~15k tokens), all four models start at approximately the same accuracy (~0.797).
4. **Divergence with Scale:** As compute increases, the models diverge significantly. The gap between the best (Black) and worst (Red) performing models widens from near-zero at 15k tokens to over 0.05 accuracy points at 100k tokens.
5. **Plateau Points:** The Light Blue (β ) and Red (β) lines show clearer signs of plateauing (flattening) within the displayed range compared to the Black (β²) and Cyan (β) lines, which are still rising more noticeably at their rightmost data points.
### Interpretation
This chart illustrates a fundamental trade-off in machine learning and AI: the relationship between computational cost ("thinking compute") and model performance ("accuracy").
* **Efficiency Comparison:** The black-dotted method is the most "compute-efficient," achieving superior accuracy at every comparable level of compute beyond the starting point. The red method is the least efficient.
* **Strategic Implications:** The data suggests that for applications where high accuracy is critical, investing in the method represented by the black line yields the best returns, despite potentially higher inherent costs. For resource-constrained environments, one might choose the cyan or light blue methods as a balance, accepting lower peak accuracy for potentially lower operational costs.
* **Underlying Phenomenon:** The concave-down shape of all curves is characteristic of many scaling laws in AI, where performance improves predictably with scale but eventually saturates. The different curves likely represent different model architectures, training techniques, or algorithms, with the black-dotted line embodying a more advanced or optimized approach.
* **Investigative Insight:** The fact that all models start at the same accuracy suggests they may share a common base or were evaluated on the same initial, low-compute task. Their divergence reveals how their underlying designs respond differently to the allocation of greater computational resources for "thinking." The chart doesn't show the absolute maximum possible accuracy (the ceiling), only the trajectory of these four specific approaches.
</details>
(c) QwQ-32B
<details>
<summary>x45.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Methods
### Overview
The image is a line chart comparing the performance of four different methods or models. The chart plots "Accuracy" on the vertical axis against "Thinking Compute" (measured in thousands of thinking tokens) on the horizontal axis. The primary purpose is to show how the accuracy of each method scales with increased computational resources (thinking tokens). The chart contains four distinct data series, each represented by a unique line style, color, and marker.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale ranging from approximately 0.84 to 0.93.
* **Major Ticks:** 0.84, 0.86, 0.88, 0.90, 0.92.
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale ranging from approximately 20 to 175.
* **Major Ticks:** 25, 50, 75, 100, 125, 150, 175.
* **Legend:**
* **Position:** Bottom-right quadrant of the chart area.
* **Entries (from top to bottom as listed):**
1. `pass@k (Oracle)`: Represented by a black, dotted line with upward-pointing triangle markers (β²).
2. `majority@k`: Represented by a solid, dark red (maroon) line with circle markers (β).
3. `short-1@k (Ours)`: Represented by a solid, light blue (cyan) line with square markers (β ).
4. `short-3@k (Ours)`: Represented by a solid, teal (blue-green) line with diamond markers (β).
### Detailed Analysis
**1. `pass@k (Oracle)` (Black Dotted Line, β²):**
* **Trend:** Shows a steep, concave-down increase in accuracy with compute, exhibiting strong diminishing returns. It is the highest-performing series across the entire range.
* **Approximate Data Points:**
* At ~20k tokens: Accuracy β 0.84
* At 50k tokens: Accuracy β 0.90
* At 75k tokens: Accuracy β 0.92
* At 100k tokens: Accuracy β 0.925
* At 125k tokens: Accuracy β 0.928 (appears to plateau near this value).
**2. `majority@k` (Dark Red Solid Line, β):**
* **Trend:** Shows a steady, nearly linear increase in accuracy with compute. It starts as the lowest-performing method but eventually surpasses `short-1@k`.
* **Approximate Data Points:**
* At ~20k tokens: Accuracy β 0.84
* At 50k tokens: Accuracy β 0.863
* At 75k tokens: Accuracy β 0.885
* At 100k tokens: Accuracy β 0.896
* At 125k tokens: Accuracy β 0.905
* At 150k tokens: Accuracy β 0.913
* At ~170k tokens: Accuracy β 0.924
**3. `short-1@k (Ours)` (Light Blue Solid Line, β ):**
* **Trend:** Shows an initial increase, peaks, and then begins to decline. This suggests a potential overfitting or efficiency loss at higher compute levels for this specific method.
* **Approximate Data Points:**
* At ~20k tokens: Accuracy β 0.84
* At 35k tokens: Accuracy β 0.874
* At 50k tokens: Accuracy β 0.879
* At 65k tokens: Accuracy β 0.881 (peak)
* At 80k tokens: Accuracy β 0.880
* At 100k tokens: Accuracy β 0.877
* At 120k tokens: Accuracy β 0.870
**4. `short-3@k (Ours)` (Teal Solid Line, β):**
* **Trend:** Shows a strong, concave-down increase similar to the Oracle but at a lower absolute accuracy. It consistently outperforms `short-1@k` and `majority@k` for most of the range, plateauing at higher compute.
* **Approximate Data Points:**
* At ~20k tokens: Accuracy β 0.84
* At 35k tokens: Accuracy β 0.864
* At 50k tokens: Accuracy β 0.894
* At 65k tokens: Accuracy β 0.906
* At 80k tokens: Accuracy β 0.913
* At 100k tokens: Accuracy β 0.920
* At 125k tokens: Accuracy β 0.922
* At 140k tokens: Accuracy β 0.922 (plateau).
### Key Observations
1. **Performance Hierarchy:** The Oracle (`pass@k`) sets the upper bound. Among the non-oracle methods, `short-3@k (Ours)` is the top performer for compute budgets above ~40k tokens. `majority@k` shows the most consistent scaling without degradation.
2. **Divergent Scaling:** The two "Ours" methods (`short-1@k` and `short-3@k`) exhibit fundamentally different scaling behaviors. `short-3@k` scales well, while `short-1@k` peaks and regresses, indicating that the "short-3" variant is more robust to increased compute.
3. **Crossover Point:** The `majority@k` line crosses above the `short-1@k` line at approximately 80k thinking tokens. Before this point, `short-1@k` is more accurate; after, `majority@k` is superior.
4. **Convergence at Low Compute:** All four methods start at nearly the same accuracy point (~0.84) when thinking compute is very low (~20k tokens), suggesting a common baseline performance.
### Interpretation
This chart likely comes from research on scaling inference-time compute ("thinking tokens") for language models or reasoning systems. The data suggests several key insights:
* **Value of Increased Compute:** For most methods, allocating more thinking tokens leads to higher accuracy, validating the core hypothesis that "thinking more" can improve performance.
* **Method Efficiency Matters:** The stark difference between `short-1@k` and `short-3@k` demonstrates that not all methods benefit equally from extra compute. The "short-3" approach is architecturally or algorithmically better at converting additional tokens into accuracy gains. The decline of `short-1@k` could indicate it starts generating redundant or counterproductive reasoning steps at high token counts.
* **Oracle as a Benchmark:** The `pass@k (Oracle)` line represents an idealized upper bound (perhaps using ground-truth selection). The gap between it and `short-3@k` shows the remaining potential for improvement in the proposed method.
* **Practical Trade-offs:** The choice of method depends on the available compute budget. For very low budgets (<40k tokens), the methods are similar. For medium budgets (40k-80k), `short-3@k` is best. For very high budgets where `short-3@k` plateaus, `majority@k` continues to improve slowly and might eventually catch up, though it requires significantly more tokens to reach the same accuracy level that `short-3@k` achieves earlier.
The chart effectively argues for the superiority of the `short-3@k (Ours)` method in the mid-to-high compute regime, while honestly showing the limitations of its `short-1@k` counterpart.
</details>
(d) R1-670B
Figure 12: AIME 2024 - thinking compute comparison.
<details>
<summary>x46.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in thinking tokens. It displays four distinct data series, each represented by a different colored line with unique markers, showing how accuracy scales with increased "thinking compute." The chart demonstrates a clear positive correlation between compute and accuracy for all series, but with varying rates of improvement and performance ceilings.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale from 0 to 120, with major tick marks and grid lines at intervals of 20 (0, 20, 40, 60, 80, 100, 120).
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0.50 to approximately 0.78, with major tick marks and grid lines at intervals of 0.05 (0.50, 0.55, 0.60, 0.65, 0.70, 0.75).
* **Data Series (Legend inferred from visual markers):**
1. **Black, dotted line with upward-pointing triangle markers (β²).** Positioned as the top-most line.
2. **Cyan (light blue) solid line with diamond markers (β).** Positioned in the middle-upper range.
3. **Cyan (light blue) solid line with square markers (β ).** Positioned just below the diamond-marked cyan line.
4. **Dark red solid line with circle markers (β).** Starts as the lowest line but crosses above the square-marked cyan line at higher compute values.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **Black Dotted Line (β²):**
* **Trend:** Shows the steepest initial ascent and continues to rise throughout the charted range, maintaining the highest accuracy at every compute level after the initial point.
* **Key Points:**
* At ~10k tokens: Accuracy β 0.51
* At ~20k tokens: Accuracy β 0.63
* At ~40k tokens: Accuracy β 0.72
* At ~60k tokens: Accuracy β 0.76
* At ~80k tokens: Accuracy β 0.78 (highest point on the chart)
2. **Cyan Line with Diamonds (β):**
* **Trend:** Rises steadily, with a slightly decreasing slope, and appears to plateau near the end of its plotted range.
* **Key Points:**
* At ~10k tokens: Accuracy β 0.51
* At ~30k tokens: Accuracy β 0.60
* At ~50k tokens: Accuracy β 0.64
* At ~70k tokens: Accuracy β 0.65
* At ~90k tokens: Accuracy β 0.655 (final point)
3. **Cyan Line with Squares (β ):**
* **Trend:** Follows a similar trajectory to the diamond-marked cyan line but consistently achieves slightly lower accuracy. It also shows signs of plateauing.
* **Key Points:**
* At ~10k tokens: Accuracy β 0.51
* At ~25k tokens: Accuracy β 0.59
* At ~45k tokens: Accuracy β 0.62
* At ~65k tokens: Accuracy β 0.625
* At ~80k tokens: Accuracy β 0.63 (final point)
4. **Dark Red Line with Circles (β):**
* **Trend:** Starts with the shallowest slope, lagging behind the cyan lines initially. However, it maintains a more consistent upward trajectory, eventually crossing above the square-marked cyan line around 65k tokens and continuing to rise.
* **Key Points:**
* At ~10k tokens: Accuracy β 0.51
* At ~40k tokens: Accuracy β 0.56
* At ~60k tokens: Accuracy β 0.62
* At ~80k tokens: Accuracy β 0.645
* At ~100k tokens: Accuracy β 0.655
* At ~120k tokens: Accuracy β 0.665 (final point, highest among non-black lines)
### Key Observations
1. **Universal Scaling Law:** All four methods show improved accuracy with increased thinking compute, suggesting a fundamental relationship between computational resources allocated to "thinking" and task performance.
2. **Performance Hierarchy:** A clear and consistent performance gap exists. The method represented by the black dotted line (β²) is significantly more compute-efficient, achieving much higher accuracy at every token budget.
3. **Diminishing Returns:** All curves exhibit diminishing returns; the slope (accuracy gain per additional thousand tokens) decreases as compute increases. This is most pronounced in the two cyan lines, which appear to approach an asymptote.
4. **Crossover Event:** The dark red line (β) demonstrates a different scaling profile. While less efficient at low compute, its more sustained improvement allows it to surpass the square-marked cyan line (β ) at approximately 65k tokens, indicating it may have a higher performance ceiling with sufficient compute.
5. **Convergence at Origin:** All lines originate from approximately the same point (~10k tokens, ~0.51 accuracy), suggesting a common baseline performance with minimal thinking compute.
### Interpretation
This chart visualizes the **trade-off between computational cost and model performance** for different reasoning or "thinking" methodologies. The data suggests:
* **Methodological Superiority:** The approach behind the black dotted line is fundamentally more effective at converting thinking tokens into accuracy gains. This could indicate a more efficient architecture, better training, or a superior reasoning algorithm.
* **The Cost of Accuracy:** Achieving higher accuracy requires exponentially more compute. Moving from 0.65 to 0.78 accuracy (for the black line) requires roughly doubling the token budget from ~40k to ~80k.
* **Strategic Choice:** The optimal method depends on the available compute budget. For constrained environments (<60k tokens), the cyan methods (diamond or square) are competitive with each other and better than the red method. For unconstrained environments where maximum accuracy is the goal, the black method is unequivocally best. The red method presents an interesting case where investing in very high compute (>80k tokens) yields continued, albeit slow, gains that other plateauing methods cannot match.
* **Underlying Principle:** The chart provides empirical support for the hypothesis that allocating more tokens for internal "thinking" or chain-of-thought processing improves outcomes, a key concept in scaling language model capabilities. The variance between lines highlights that *how* those tokens are used is as important as *how many* are used.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x47.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Reasoning Methods
### Overview
The image is a line chart comparing the performance (accuracy) of four different AI reasoning methods as a function of computational effort ("Thinking Compute"). The chart demonstrates how each method's accuracy scales with increased "thinking tokens," measured in thousands. All methods start at a similar baseline accuracy but diverge significantly as compute increases.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **X-Axis:**
* **Title:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear, ranging from approximately 10 to 125 (thousand tokens).
* **Major Ticks:** 20, 40, 60, 80, 100, 120.
* **Y-Axis:**
* **Title:** "Accuracy"
* **Scale:** Linear, ranging from 0.55 to approximately 0.78.
* **Major Ticks:** 0.55, 0.60, 0.65, 0.70, 0.75.
* **Legend:** Positioned in the **top-left corner** of the plot area. It contains four entries:
1. **Chain-of-Thought (CoT):** Black dotted line with upward-pointing triangle markers (β²).
2. **Self-Consistency (SC):** Cyan solid line with diamond markers (β).
3. **Tree of Thoughts (ToT):** Blue solid line with square markers (β ).
4. **Reflexion:** Red solid line with circle markers (β).
### Detailed Analysis
**Data Series Trends and Approximate Points:**
1. **Chain-of-Thought (CoT) - Black Dotted Line with Triangles:**
* **Trend:** Shows the steepest and most sustained upward slope. It demonstrates strong, near-logarithmic scaling with compute.
* **Data Points (Approximate):**
* ~10k tokens: 0.54 accuracy
* ~20k tokens: 0.65 accuracy
* ~40k tokens: 0.72 accuracy
* ~60k tokens: 0.76 accuracy
* ~80k tokens: 0.78 accuracy (highest point on the chart)
2. **Self-Consistency (SC) - Cyan Solid Line with Diamonds:**
* **Trend:** Rises quickly initially but begins to plateau after ~40k tokens. The curve flattens noticeably.
* **Data Points (Approximate):**
* ~10k tokens: 0.54 accuracy
* ~20k tokens: 0.58 accuracy
* ~40k tokens: 0.63 accuracy
* ~60k tokens: 0.64 accuracy
* ~80k tokens: 0.64 accuracy
3. **Tree of Thoughts (ToT) - Blue Solid Line with Squares:**
* **Trend:** Follows a path very similar to SC but consistently at a slightly lower accuracy level. It also plateaus in the same region.
* **Data Points (Approximate):**
* ~10k tokens: 0.54 accuracy
* ~20k tokens: 0.58 accuracy
* ~40k tokens: 0.61 accuracy
* ~60k tokens: 0.62 accuracy
* ~70k tokens: 0.625 accuracy (last visible point)
4. **Reflexion - Red Solid Line with Circles:**
* **Trend:** Shows a steady, approximately linear increase in accuracy across the entire compute range shown. It does not plateau within the chart's bounds but has a shallower slope than CoT.
* **Data Points (Approximate):**
* ~10k tokens: 0.54 accuracy
* ~40k tokens: 0.59 accuracy
* ~60k tokens: 0.63 accuracy
* ~80k tokens: 0.64 accuracy
* ~100k tokens: 0.65 accuracy
* ~120k tokens: 0.655 accuracy
### Key Observations
1. **Common Starting Point:** All four methods begin at nearly the same accuracy (~0.54) at the lowest compute level (~10k tokens).
2. **Divergent Scaling:** The primary insight is the dramatic divergence in performance scaling. CoT scales exceptionally well, while SC and ToT show diminishing returns. Reflexion scales steadily but more slowly.
3. **Plateau Behavior:** SC and ToT appear to hit an accuracy ceiling between 0.62-0.64 within the 40k-80k token range.
4. **Performance Hierarchy:** At any compute level above ~15k tokens, the clear performance order is: CoT > SC β Reflexion (at mid-range) > ToT. At high compute (>80k), the order is CoT > Reflexion > SC/ToT (plateaued).
5. **Visual Grouping:** The SC (cyan) and ToT (blue) lines are tightly clustered, suggesting similar underlying efficiency characteristics, distinct from the other two methods.
### Interpretation
This chart provides a compelling visual argument about the **compute-efficiency trade-offs** of different AI reasoning strategies.
* **Chain-of-Thought (CoT)** is demonstrated to be the most **compute-efficient** method for achieving high accuracy. Its steep, sustained curve suggests that investing more "thinking tokens" directly and effectively translates into better performance, making it suitable for tasks where high accuracy is critical and compute is available.
* **Self-Consistency (SC) and Tree of Thoughts (ToT)** show **early saturation**. Their plateau indicates that beyond a certain point (~40k tokens), throwing more compute at the problem using these methods yields minimal accuracy gains. They may be better suited for resource-constrained environments where a "good enough" answer is needed quickly.
* **Reflexion** occupies a middle ground. Its linear scaling suggests a predictable, steady return on investment for additional compute. It doesn't achieve the peak performance of CoT but avoids the early plateau of SC/ToT, potentially offering a balanced approach for long-running processes.
The data suggests that the **architecture of the reasoning process** (linear chain vs. sampled consistency vs. tree search vs. iterative reflection) fundamentally dictates how well a model can leverage additional computational resources. CoT's simple, sequential structure appears uniquely scalable in this context. The chart is likely from a research paper comparing these prompting or agentic techniques, arguing for the superiority of CoT in scaling laws or highlighting the limitations of more complex methods like SC and ToT.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x48.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against the amount of "thinking compute" allocated, measured in thousands of thinking tokens. It displays four distinct data series, each represented by a different color and marker type, showing how accuracy changes as compute increases. The chart suggests a study on the efficiency and scaling behavior of different models or methods.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from approximately 15 to 160, with major tick marks at 25, 50, 75, 100, 125, and 150.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.72 to 0.82, with major tick marks at 0.72, 0.74, 0.76, 0.78, 0.80, and 0.82.
* **Data Series (Legend inferred from visual markers):**
1. **Black, Dotted Line with Upward-Pointing Triangles:** Positioned as the top-most line for most of the chart.
2. **Red (Maroon), Solid Line with Circles:** Positioned below the black line but above the cyan line for higher compute values.
3. **Cyan (Light Blue), Solid Line with Diamonds:** Initially rises, peaks, then shows a slight decline.
4. **Blue, Solid Line with Squares:** Rises to an early peak and then declines sharply.
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
**Data Series Trends and Approximate Points:**
1. **Black Dotted Line (Triangles):**
* **Trend:** Shows a very steep initial increase in accuracy, followed by a strong diminishing returns curve, plateauing at the highest accuracy level.
* **Key Points (Approximate):**
* At ~15k tokens: Accuracy β 0.723
* At ~25k tokens: Accuracy β 0.790 (sharp rise)
* At ~50k tokens: Accuracy β 0.805
* At ~100k tokens: Accuracy β 0.815
* At ~125k tokens: Accuracy β 0.817 (appears to be the final point)
2. **Red Solid Line (Circles):**
* **Trend:** Shows a steady, nearly linear increase in accuracy across the entire range of compute, without clear signs of plateauing within the chart's bounds.
* **Key Points (Approximate):**
* At ~15k tokens: Accuracy β 0.723 (similar starting point to others)
* At ~50k tokens: Accuracy β 0.774
* At ~75k tokens: Accuracy β 0.794
* At ~100k tokens: Accuracy β 0.797
* At ~150k tokens: Accuracy β 0.800
3. **Cyan Solid Line (Diamonds):**
* **Trend:** Increases to a peak, then begins a gradual decline. It suggests an optimal compute point beyond which performance slightly degrades.
* **Key Points (Approximate):**
* At ~15k tokens: Accuracy β 0.723
* At ~50k tokens: Accuracy β 0.788
* **Peak at ~70k tokens:** Accuracy β 0.790
* At ~100k tokens: Accuracy β 0.786
* At ~125k tokens: Accuracy β 0.781
4. **Blue Solid Line (Squares):**
* **Trend:** Rises to an early peak and then exhibits a strong, consistent decline as compute increases, indicating severe over-computation or inefficiency.
* **Key Points (Approximate):**
* At ~15k tokens: Accuracy β 0.723
* **Peak at ~35k tokens:** Accuracy β 0.756
* At ~50k tokens: Accuracy β 0.751
* At ~75k tokens: Accuracy β 0.736
* At ~100k tokens: Accuracy β 0.714
* At ~110k tokens: Accuracy β 0.707 (final visible point)
### Key Observations
1. **Common Origin:** All four series appear to start at approximately the same accuracy (~0.723) at the lowest compute level (~15k tokens).
2. **Divergent Scaling:** The series diverge dramatically as compute increases, revealing fundamentally different scaling laws.
3. **Performance Hierarchy:** For compute > ~40k tokens, the black dotted line consistently achieves the highest accuracy, followed by the red line, then the cyan line, with the blue line performing worst.
4. **Optimal Compute Points:** The cyan and blue lines demonstrate clear "peaks," suggesting optimal compute budgets for those methods (~70k and ~35k tokens, respectively). The black and red lines show no such peak within the chart, implying they benefit from (or at least are not harmed by) additional compute up to ~125k and ~150k tokens.
5. **Rate of Improvement:** The black line shows the most efficient initial scaling (steepest slope). The red line shows the most sustained, stable improvement.
### Interpretation
This chart visualizes a critical concept in AI scaling: **not all methods or models benefit equally from increased computational resources.**
* The **black dotted line** likely represents a highly optimized or state-of-the-art method that efficiently converts additional "thinking" compute into accuracy gains, exhibiting classic logarithmic scaling with strong diminishing returns.
* The **red line** suggests a robust, stable method that scales predictably and linearly, making it reliable for scaling but potentially less efficient at the low-compute end.
* The **cyan line** indicates a method with a clear "sweet spot." Allocating more than ~70k thinking tokens is not just wasteful but actively harmful, possibly due to overfitting, noise introduction, or a breakdown in the reasoning process with excessive length.
* The **blue line** represents a method that is highly sensitive to over-computation. Its performance collapses with more thinking tokens, suggesting it may be prone to error propagation, distraction, or a fundamental mismatch between the method and the extended compute budget.
The overarching insight is that **"more compute" is not a universal solution.** The relationship between compute and performance is method-dependent. Choosing the right method for a given compute budget is as important as the budget itself. The chart argues for the development and use of methods like the one represented by the black line, which can effectively leverage available resources without suffering from performance degradation.
</details>
(c) QwQ-32B
<details>
<summary>x49.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Methods
### Overview
The image is a line chart comparing the performance of four different methods or models. The chart plots "Accuracy" on the vertical axis against "Thinking Compute" (measured in thousands of thinking tokens) on the horizontal axis. The general trend shows that accuracy increases with more compute for most methods, but with different rates and ceilings.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear scale from approximately 0 to 180 (thousands of tokens).
* **Major Tick Marks:** Labeled at 50, 100, and 150.
* **Y-Axis (Vertical):**
* **Label:** `Accuracy`
* **Scale:** Linear scale from 0.83 to 0.89.
* **Major Tick Marks:** Labeled at 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, and 0.89.
* **Legend (Located in the bottom-right quadrant of the chart area):**
* `pass@k (Oracle)`: Represented by a black dotted line with upward-pointing triangle markers.
* `majority@k`: Represented by a solid dark red (maroon) line with circle markers.
* `short-1@k (Ours)`: Represented by a solid cyan (light blue) line with square markers.
* `short-3@k (Ours)`: Represented by a solid cyan (light blue) line with diamond markers.
### Detailed Analysis
**1. `pass@k (Oracle)` - Black Dotted Line with Triangles**
* **Trend:** Shows a strong, consistent upward logarithmic-like curve. It is the highest-performing method across the entire range of compute shown.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.825
* At ~30k tokens: Accuracy β 0.862
* At ~50k tokens: Accuracy β 0.871
* At ~70k tokens: Accuracy β 0.876
* At ~90k tokens: Accuracy β 0.880
* At ~110k tokens: Accuracy β 0.883
* At ~130k tokens: Accuracy β 0.886
* At ~150k tokens: Accuracy β 0.889
* At ~170k tokens: Accuracy β 0.892 (highest point on the chart)
**2. `majority@k` - Solid Red Line with Circles**
* **Trend:** Shows a steady, near-linear upward trend. It consistently performs below the Oracle method but above the `short-1@k` method after the initial compute range.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.825 (similar starting point to others)
* At ~50k tokens: Accuracy β 0.854
* At ~70k tokens: Accuracy β 0.860
* At ~90k tokens: Accuracy β 0.865
* At ~110k tokens: Accuracy β 0.868
* At ~130k tokens: Accuracy β 0.870
* At ~150k tokens: Accuracy β 0.872
* At ~170k tokens: Accuracy β 0.874
**3. `short-1@k (Ours)` - Solid Cyan Line with Squares**
* **Trend:** Shows an initial increase, peaks, and then begins a gradual decline. This suggests diminishing returns or potential overfitting/instability with increased compute for this specific method.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.825
* At ~30k tokens: Accuracy β 0.842
* At ~50k tokens: Accuracy β 0.845
* At ~70k tokens: Accuracy β 0.846 (peak)
* At ~90k tokens: Accuracy β 0.845
* At ~110k tokens: Accuracy β 0.844
* At ~130k tokens: Accuracy β 0.843
* At ~150k tokens: Accuracy β 0.842
**4. `short-3@k (Ours)` - Solid Cyan Line with Diamonds**
* **Trend:** Shows a rapid initial increase, then plateaus. It significantly outperforms `short-1@k` and approaches the performance of `majority@k` at higher compute levels.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.825
* At ~30k tokens: Accuracy β 0.842
* At ~50k tokens: Accuracy β 0.868
* At ~70k tokens: Accuracy β 0.873
* At ~90k tokens: Accuracy β 0.876
* At ~110k tokens: Accuracy β 0.877
* At ~130k tokens: Accuracy β 0.878
* At ~150k tokens: Accuracy β 0.878 (plateau)
### Key Observations
1. **Performance Hierarchy:** The `pass@k (Oracle)` method is the clear upper bound. Among the non-oracle methods, `short-3@k (Ours)` is the best performer, followed by `majority@k`, with `short-1@k (Ours)` being the lowest.
2. **Compute Efficiency:** `short-3@k` achieves high accuracy (~0.878) with relatively low compute (~110k tokens) and then plateaus. `majority@k` requires more compute (~170k tokens) to reach a similar level (~0.874).
3. **Divergent Behavior of "Ours" Methods:** The two methods labeled "(Ours)" show fundamentally different scaling properties. `short-3@k` scales well and plateaus, while `short-1@k` peaks early and degrades.
4. **Common Starting Point:** All four methods appear to converge at the lowest compute level shown (~10k tokens) with an accuracy of approximately 0.825.
### Interpretation
This chart likely comes from a research paper evaluating methods for improving the reasoning or problem-solving accuracy of a language model by allocating more "thinking" compute (tokens). The "Oracle" represents a theoretical or best-possible selection method, serving as a benchmark.
The key finding is that the proposed method `short-3@k` offers a favorable trade-off: it provides a substantial accuracy boost over the baseline `short-1@k` and becomes competitive with the `majority@k` voting approach, but with a clear plateau. This suggests that `short-3@k` efficiently utilizes additional compute up to a point, after which further investment yields minimal gain. In contrast, `short-1@k`'s decline indicates that simply generating more tokens for a single short solution can be counterproductive. The data argues for the effectiveness of the `short-3@k` strategy in balancing performance and computational cost.
</details>
(d) R1-670B
Figure 13: AIME 2025 - thinking compute comparison.
<details>
<summary>x50.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational resources, measured in "thinking tokens." It compares the performance scaling of three distinct methods or models as compute increases. The chart demonstrates a clear divergence in how effectively each method translates additional compute into improved accuracy.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear, ranging from 0 to 140 (representing 0 to 140,000 tokens).
* **Major Ticks:** 0, 20, 40, 60, 80, 100, 120, 140.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from approximately 0.30 to 0.65.
* **Major Ticks:** 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65.
* **Legend:** Located in the top-left corner of the plot area. It identifies three data series:
1. **Black dotted line with upward-pointing triangle markers (β²).**
2. **Cyan solid line with square markers (β ).**
3. **Red solid line with circle markers (β).**
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
The chart displays three distinct performance curves:
1. **Black Dotted Line (β²):**
* **Trend:** Shows a steep, near-linear upward slope that begins to curve slightly, suggesting diminishing returns at very high compute. It demonstrates the strongest positive correlation between compute and accuracy.
* **Data Points (Approximate):**
* (10, 0.33)
* (20, 0.43)
* (30, 0.49)
* (40, 0.53)
* (50, 0.56)
* (60, 0.58)
* (70, 0.60)
* (80, 0.62)
* (90, 0.63)
* (100, 0.64)
2. **Cyan Solid Line (β ):**
* **Trend:** Increases steadily at low compute but begins to plateau significantly after approximately 60k tokens. The curve flattens, indicating that additional compute yields minimal accuracy gains beyond this point.
* **Data Points (Approximate):**
* (10, 0.33)
* (20, 0.37)
* (30, 0.39)
* (40, 0.40)
* (50, 0.41)
* (60, 0.415)
* (70, 0.42)
* (80, 0.42)
* (90, 0.415)
* (100, 0.415)
3. **Red Solid Line (β):**
* **Trend:** Starts with the lowest accuracy at low compute but maintains a steady, gradual upward slope. It surpasses the cyan line's accuracy at approximately 85k tokens and continues to improve slowly, showing no clear plateau within the plotted range.
* **Data Points (Approximate):**
* (10, 0.33)
* (40, 0.36)
* (55, 0.38)
* (70, 0.40)
* (85, 0.42)
* (100, 0.43)
* (110, 0.435)
* (125, 0.44)
* (135, 0.442)
### Key Observations
* **Common Starting Point:** All three methods begin at nearly the same accuracy (~0.33) with minimal compute (10k tokens).
* **Divergent Scaling:** The primary insight is the dramatic difference in scaling efficiency. The method represented by the black line scales exceptionally well, while the cyan method hits a performance ceiling. The red method scales slowly but consistently.
* **Crossover Point:** The red line overtakes the cyan line at a compute level of approximately 85,000 thinking tokens, suggesting it becomes the more efficient choice for high-compute scenarios within this range.
* **Plateau vs. Continued Growth:** The cyan line's plateau contrasts sharply with the continued (though slow) growth of the red line and the strong growth of the black line.
### Interpretation
This chart likely compares different AI model architectures, training techniques, or inference strategies (e.g., different "chain-of-thought" methods). The data suggests:
1. **Superior Method (Black Line):** The method using the black dotted line is fundamentally more efficient at leveraging additional computational resources ("thinking tokens") to improve accuracy. Its near-linear scaling implies a highly effective use of compute, possibly indicating a more advanced or better-optimized reasoning process.
2. **Early Plateau (Cyan Line):** The method represented by the cyan line benefits from initial compute but quickly saturates. This could indicate a simpler model or a technique with a fixed "reasoning capacity" that cannot be expanded simply by allocating more tokens. It may be efficient for low-compute applications but is outclassed at higher budgets.
3. **Steady Improver (Red Line):** The red line method is less efficient at low compute but demonstrates robust, continued scaling. Its eventual surpassing of the cyan line indicates it has a higher performance ceiling. This might represent a more complex but less sample-efficient approach that requires substantial compute to shine.
4. **Practical Implication:** The choice of method depends on the available compute budget. For very low budgets (<20k tokens), all methods perform similarly. For medium budgets (20k-80k), the black method is best, followed by cyan. For high budgets (>85k), the ranking is black, then red, then cyan. The black method is dominant across the entire observed range.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x51.png Details</summary>

### Visual Description
## Multi-Series Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in thinking tokens. It displays four distinct data series, each represented by a different color and marker style, showing how accuracy scales with increased "thinking compute." The chart demonstrates a clear positive correlation between compute and accuracy for all series, but with significantly different scaling efficiencies.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale ranging from approximately 10 to 130 (in thousands of tokens).
* **Major Tick Marks:** 20, 40, 60, 80, 100, 120.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale ranging from 0.40 to 0.60.
* **Major Tick Marks:** 0.40, 0.45, 0.50, 0.55, 0.60.
* **Data Series (Legend inferred from visual markers):**
1. **Black, Dotted Line with Upward-Pointing Triangles (β²):** The top-performing series.
2. **Cyan (Light Blue), Solid Line with Diamonds (β):** The second-highest series.
3. **Blue, Solid Line with Squares (β ):** The series that plateaus earliest.
4. **Red (Maroon), Solid Line with Circles (β):** The series with the most linear, steady growth.
* **Grid:** A light gray grid is present, with vertical lines at each major x-axis tick and horizontal lines at each major y-axis tick.
### Detailed Analysis
**1. Black Dotted Line (β²):**
* **Trend:** Exhibits the steepest, near-logarithmic growth. It starts at the lowest accuracy point but rapidly surpasses all others.
* **Key Data Points (Approximate):**
* ~10k tokens: Accuracy β 0.37
* ~20k tokens: Accuracy β 0.45
* ~40k tokens: Accuracy β 0.525
* ~60k tokens: Accuracy β 0.575
* ~80k tokens: Accuracy β 0.60 (highest point on the chart)
**2. Cyan Line (β):**
* **Trend:** Shows strong initial growth that begins to decelerate and plateau after ~60k tokens.
* **Key Data Points (Approximate):**
* ~10k tokens: Accuracy β 0.37
* ~30k tokens: Accuracy β 0.41
* ~50k tokens: Accuracy β 0.445
* ~70k tokens: Accuracy β 0.455
* ~90k tokens: Accuracy β 0.455 (plateau)
**3. Blue Line (β ):**
* **Trend:** Rises quickly at very low compute but hits a performance ceiling earliest, showing a clear plateau and even a slight decline.
* **Key Data Points (Approximate):**
* ~10k tokens: Accuracy β 0.37
* ~20k tokens: Accuracy β 0.40
* ~40k tokens: Accuracy β 0.415
* ~60k tokens: Accuracy β 0.415 (plateau)
* ~80k tokens: Accuracy β 0.41 (slight decline)
**4. Red Line (β):**
* **Trend:** Demonstrates a remarkably steady, almost linear increase in accuracy across the entire compute range shown. It starts as the lowest-performing series but eventually overtakes the plateauing blue and cyan series.
* **Key Data Points (Approximate):**
* ~10k tokens: Accuracy β 0.37
* ~40k tokens: Accuracy β 0.40
* ~70k tokens: Accuracy β 0.44
* ~100k tokens: Accuracy β 0.46
* ~130k tokens: Accuracy β 0.475 (highest point for this series)
### Key Observations
1. **Performance Hierarchy:** At high compute (>80k tokens), the order of performance from highest to lowest accuracy is: Black (β²) > Red (β) > Cyan (β) > Blue (β ).
2. **Scaling Efficiency:** The Black (β²) series demonstrates vastly superior scaling efficiency. To reach 0.50 accuracy, it requires only ~30k tokens, while the Red (β) series does not reach 0.50 even at 130k tokens.
3. **Plateau Effects:** The Blue (β ) and Cyan (β) series exhibit clear diminishing returns, with the Blue series plateauing below 0.42 accuracy. The Red (β) series shows no signs of plateauing within the charted range.
4. **Convergence at Low Compute:** All four series originate from approximately the same point (~0.37 accuracy at ~10k tokens), indicating similar baseline performance with minimal thinking compute.
### Interpretation
This chart illustrates a fundamental trade-off in AI model design between **peak performance** and **compute efficiency**.
* The **Black (β²) series** likely represents a highly optimized, state-of-the-art reasoning or "chain-of-thought" method. Its steep curve suggests that allocating more thinking tokens yields dramatic accuracy gains, making it the best choice when high accuracy is critical and compute is available.
* The **Red (β) series** represents a method with consistent, reliable scaling. While less efficient than the black series, its linear growth is predictable and may be preferable in scenarios where compute budgets are fixed and incremental improvements are valued over peak performance.
* The **Cyan (β) and Blue (β ) series** represent methods that are effective at low compute but hit inherent limitations. The Blue series, in particular, may be a simpler or more constrained model that cannot leverage additional thinking tokens to improve beyond a certain point. These could be suitable for low-latency or resource-constrained applications where minimal compute is the primary concern.
The data suggests that the choice of "thinking" architecture or prompting strategy (represented by the different lines) is more impactful on final accuracy than simply increasing compute for a given method, especially at higher token counts. The stark difference between the black and red lines highlights that not all methods for utilizing thinking compute are created equal.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x52.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting "Accuracy" against "Thinking Compute" (measured in thousands of thinking tokens). It displays the performance scaling of four distinct methods or models as computational resources increase. All four lines originate from a common starting point at low compute and diverge as compute increases, showing different efficiency and performance ceilings.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from 0 to 150, with major tick marks at 50, 100, and 150.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.50 to 0.75, with major tick marks at 0.50, 0.55, 0.60, 0.65, 0.70, and 0.75.
* **Data Series:** Four distinct lines, differentiated by color, line style, and marker shape. There is no embedded legend; identification is based on visual attributes.
1. **Black, Dotted Line with Upward-Pointing Triangle Markers:** Shows the steepest ascent.
2. **Cyan (Light Blue), Solid Line with Diamond Markers:** Shows a strong initial ascent that begins to plateau.
3. **Blue, Solid Line with Square Markers:** Follows a path similar to but slightly below the cyan line.
4. **Red (Dark Red/Brown), Solid Line with Circle Markers:** Shows the most gradual, linear ascent.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **Black Dotted Line (Triangles):**
* **Trend:** Steep, near-linear upward slope that shows no sign of plateauing within the charted range. It is the top-performing series.
* **Data Points:** Starts at ~0.48 accuracy at ~10k tokens. Key points: ~0.56 at 25k, ~0.61 at 50k, ~0.67 at 75k, ~0.71 at 100k, ~0.73 at 125k, and ends at ~0.75 at 150k tokens.
* **Cyan Line (Diamonds):**
* **Trend:** Rapid initial increase that begins to flatten (diminishing returns) after approximately 75k tokens.
* **Data Points:** Starts at ~0.48 at 10k. Key points: ~0.52 at 25k, ~0.55 at 50k, ~0.58 at 75k, ~0.59 at 100k, and ends at ~0.60 at 125k tokens.
* **Blue Line (Squares):**
* **Trend:** Similar shape to the cyan line but consistently lower accuracy. Also shows diminishing returns.
* **Data Points:** Starts at ~0.48 at 10k. Key points: ~0.52 at 25k, ~0.54 at 50k, ~0.56 at 75k, ~0.57 at 100k, and ends at ~0.575 at 125k tokens.
* **Red Line (Circles):**
* **Trend:** Steady, linear increase with a slope shallower than the black line but more consistent than the cyan/blue lines. It does not show clear plateauing.
* **Data Points:** Starts at ~0.48 at 10k. Key points: ~0.50 at 50k, ~0.52 at 75k, ~0.54 at 100k, ~0.56 at 125k, ~0.575 at 150k, and ends at ~0.59 at 175k tokens (extrapolated slightly beyond the 150k axis label).
### Key Observations
1. **Common Origin:** All methods begin at approximately the same accuracy (~0.48) with minimal compute (~10k tokens).
2. **Performance Hierarchy:** A clear and consistent performance hierarchy is established early and maintained: Black > Cyan > Blue > Red.
3. **Diminishing Returns:** The cyan and blue lines exhibit classic diminishing returns, where additional compute yields progressively smaller accuracy gains. The black and red lines do not show this within the chart's range.
4. **Efficiency Gap:** The black method is dramatically more efficient. To reach 0.60 accuracy, the black method requires ~50k tokens, while the cyan method requires ~125k tokensβ2.5 times more compute for the same result.
### Interpretation
This chart likely compares different strategies for allocating "thinking" or reasoning compute in an AI system (e.g., different chain-of-thought methods, model sizes, or inference algorithms). The data suggests:
* The method represented by the **black dotted line** is vastly superior in its ability to translate additional thinking compute into higher accuracy. It represents a highly scalable and efficient reasoning approach.
* The **cyan and blue methods** provide good initial gains but hit a performance ceiling relatively quickly. They may be suitable for low-compute scenarios but are inefficient at scale.
* The **red method** is reliable and scales predictably but is the least efficient, requiring the most compute to achieve any given accuracy level.
* The **key takeaway** is that the choice of reasoning method has a profound impact on both the maximum achievable performance and the cost (in compute) to get there. The black method's trajectory implies it could continue to improve with even more compute, making it the most promising for high-stakes, resource-rich applications. The chart argues for investing in the development of the "black line" methodology over the others for tasks where accuracy is paramount.
</details>
(c) QwQ-32B
<details>
<summary>x53.png Details</summary>

### Visual Description
## Line Graph: Accuracy vs. Thinking Compute
### Overview
The image is a line graph comparing the performance (Accuracy) of four different methods or models as a function of computational effort (Thinking Compute). The graph demonstrates how accuracy improves with increased compute for each method, with one method consistently outperforming the others.
### Components/Axes
* **Chart Type:** Line graph with markers.
* **X-Axis:**
* **Label:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear, ranging from approximately 25 to 225 (thousands of tokens).
* **Major Tick Marks:** 50, 100, 150, 200.
* **Y-Axis:**
* **Label:** `Accuracy`
* **Scale:** Linear, ranging from 0.675 to 0.875.
* **Major Tick Marks:** 0.675, 0.700, 0.725, 0.750, 0.775, 0.800, 0.825, 0.850, 0.875.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. `pass@k (Oracle)`: Black dotted line with upward-pointing triangle markers.
2. `majority@k`: Solid dark red line with circle markers.
3. `short-1@k (Ours)`: Solid cyan line with square markers.
4. `short-3@k (Ours)`: Solid cyan line with diamond markers.
* **Grid:** A light gray grid is present, aligning with the major tick marks on both axes.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
1. **pass@k (Oracle) [Black dotted line, triangles]:**
* **Trend:** Steepest upward slope, showing the highest accuracy at all compute levels. The curve is concave, indicating diminishing returns as compute increases.
* **Data Points:**
* ~25k tokens: 0.680
* ~50k tokens: 0.765
* ~75k tokens: 0.805
* ~100k tokens: 0.845
* ~125k tokens: 0.860
* ~150k tokens: 0.875
* ~175k tokens: 0.885 (estimated, point is above the 0.875 grid line)
2. **short-3@k (Ours) [Cyan line, diamonds]:**
* **Trend:** Second steepest slope, closely following but consistently below the Oracle line. Also shows a concave curve.
* **Data Points:**
* ~25k tokens: 0.680
* ~50k tokens: 0.745
* ~75k tokens: 0.790
* ~100k tokens: 0.825
* ~125k tokens: 0.840
* ~150k tokens: 0.855
* ~175k tokens: 0.865
* ~200k tokens: 0.870
3. **short-1@k (Ours) [Cyan line, squares]:**
* **Trend:** Slope is less steep than short-3@k. It begins to plateau noticeably after ~100k tokens.
* **Data Points:**
* ~25k tokens: 0.680
* ~50k tokens: 0.745 (appears identical to short-3@k at this point)
* ~75k tokens: 0.780
* ~100k tokens: 0.805
* ~125k tokens: 0.815
* ~150k tokens: 0.825
* ~160k tokens: 0.830 (final data point for this series)
4. **majority@k [Red line, circles]:**
* **Trend:** Shallowest slope, showing the lowest accuracy at all compute levels. The curve is nearly linear, suggesting less pronounced diminishing returns within this range.
* **Data Points:**
* ~25k tokens: 0.680
* ~50k tokens: 0.725
* ~75k tokens: 0.750
* ~100k tokens: 0.765
* ~125k tokens: 0.775
* ~150k tokens: 0.785
* ~175k tokens: 0.795
* ~200k tokens: 0.805
* ~225k tokens: 0.815
### Key Observations
1. **Common Starting Point:** All four methods begin at approximately the same accuracy (~0.680) at the lowest compute level (~25k tokens).
2. **Performance Hierarchy:** A clear and consistent performance hierarchy is established very early (by ~50k tokens) and maintained throughout: `pass@k (Oracle)` > `short-3@k (Ours)` > `short-1@k (Ours)` > `majority@k`.
3. **Diminishing Returns:** All curves show diminishing returns (concave shape), but the degree varies. The `pass@k` and `short-3@k` curves continue to rise significantly even at high compute, while `short-1@k` plateaus more sharply.
4. **"Ours" Methods:** The two methods labeled "(Ours)" occupy the middle ground. `short-3@k` demonstrates a clear advantage over `short-1@k`, especially at higher compute levels (>100k tokens), suggesting that the "3" variant scales better.
5. **Oracle as Upper Bound:** The `pass@k (Oracle)` line acts as a performance ceiling or upper bound for the other methods.
### Interpretation
This chart likely comes from a research paper in machine learning or natural language processing, comparing different strategies for improving model accuracy by allocating more "thinking" or reasoning compute (measured in tokens).
* **What the data suggests:** The data demonstrates that investing more computational resources (thinking tokens) improves accuracy for all tested methods. However, the efficiency of this investment varies dramatically. The `pass@k (Oracle)` method represents an idealized or best-case scenario (possibly using ground-truth information or an exhaustive search), setting a theoretical maximum. The proposed methods (`short-1@k` and `short-3@k`) are being evaluated against this ceiling and a baseline (`majority@k`).
* **Relationship between elements:** The graph's core message is about **scaling efficiency**. It answers: "For a given budget of thinking tokens, which method yields the highest accuracy?" The authors' method `short-3@k` is shown to be more efficient than `short-1@k` and the `majority@k` baseline, as it achieves higher accuracy for the same compute, or the same accuracy for less compute.
* **Notable implications:** The plateau of `short-1@k` suggests it may hit a fundamental limit in its ability to leverage additional compute. In contrast, `short-3@k`'s continued growth implies its design is better suited for scaling. The significant gap between all methods and the Oracle line indicates there is still substantial room for improvement in algorithmic efficiency for this task. The chart argues for the effectiveness of the authors' `short-3@k` approach as a better trade-off between computational cost and performance.
</details>
(d) R1-670B
Figure 14: HMMT Feb 2025 - thinking compute comparison.
<details>
<summary>x54.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing the performance of different models or configurations, parameterized by a variable 'k'. It plots **Accuracy** (y-axis) against **Time-to-Answer** (x-axis), measured in thousands of units (likely tokens or steps). The data is segmented into three distinct series, differentiated by color and marker shape, each representing a different model or method. Each data point is explicitly labeled with its corresponding 'k' value.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.575 to 0.775.
* **Major Ticks:** 0.575, 0.600, 0.625, 0.650, 0.675, 0.700, 0.725, 0.750, 0.775.
* **X-Axis (Horizontal):**
* **Label:** `Time-to-Answer (longest thinking in thousands)`
* **Scale:** Linear, ranging from approximately 7 to 18.
* **Major Ticks:** 8, 10, 12, 14, 16, 18.
* **Data Series (Inferred Legend):**
* **Series 1:** Cyan squares (β ). Positioned on the left side of the chart.
* **Series 2:** Cyan diamonds (β). Positioned in the middle of the chart.
* **Series 3:** Red circles (β). Positioned on the right side of the chart.
* **Data Point Annotations:** Each marker is accompanied by a text label indicating the 'k' value (e.g., `k=9`).
### Detailed Analysis
**Data Points (Approximate Coordinates & Labels):**
* **Cyan Square Series (Left Cluster):**
* Point 1: (x β 7.5, y β 0.750), Label: `k=9`
* Point 2: (x β 8.0, y β 0.715), Label: `k=5`
* Point 3: (x β 9.0, y β 0.675), Label: `k=3`
* **Cyan Diamond Series (Middle Cluster):**
* Point 4: (x β 10.0, y β 0.770), Label: `k=9`
* Point 5: (x β 11.5, y β 0.730), Label: `k=5`
* Point 6: (x β 12.0, y β 0.570), Label: `k=1`
* Point 7: (x β 15.0, y β 0.685), Label: `k=3`
* **Red Circle Series (Right Cluster):**
* Point 8: (x β 15.0, y β 0.620), Label: `k=3`
* Point 9: (x β 16.5, y β 0.660), Label: `k=5`
* Point 10: (x β 18.0, y β 0.705), Label: `k=9`
**Visual Trends per Series:**
* **Cyan Squares:** Shows a clear **downward trend**. As Time-to-Answer increases from ~7.5 to ~9, Accuracy decreases from ~0.750 to ~0.675.
* **Cyan Diamonds:** Shows a **non-monotonic trend**. Accuracy peaks at the highest 'k' value (k=9, yβ0.770) at a moderate time (xβ10). It then drops significantly for k=5 and k=3, with a severe outlier at k=1 (lowest accuracy, yβ0.570) at xβ12.
* **Red Circles:** Shows a clear **upward trend**. As Time-to-Answer increases from ~15 to ~18, Accuracy increases from ~0.620 to ~0.705.
### Key Observations
1. **Performance Clusters:** The three series occupy distinct regions of the time-accuracy space. Cyan squares are fast but mid-accuracy, cyan diamonds are mid-speed with high variance, and red circles are slow but show improving accuracy.
2. **Impact of 'k':** Within each series, higher 'k' values generally correlate with higher accuracy, with the notable exception of the cyan diamond series where k=1 is a drastic outlier.
3. **Trade-off Visualization:** The chart illustrates a complex trade-off. The fastest method (cyan squares) sacrifices peak accuracy. The method with the highest observed accuracy (cyan diamond, k=9) requires moderate time. The slowest method (red circles) starts with lower accuracy but improves with more time.
4. **Outlier:** The data point for the cyan diamond series at `k=1` (xβ12, yβ0.570) is a significant outlier, having the lowest accuracy on the chart despite not having the shortest time.
### Interpretation
This chart likely compares different reasoning or search strategies (parameterized by 'k', possibly the number of candidates or steps) for an AI model. The data suggests:
* **No Single Best Strategy:** There is a Pareto frontier. The choice of optimal 'k' and underlying method depends on the priority: minimizing latency (choose cyan squares with k=9) or maximizing accuracy (choose cyan diamonds with k=9, if the ~10k time cost is acceptable).
* **Method Efficiency:** The cyan square method is the most time-efficient for a given accuracy level in its range. The red circle method appears to be a different paradigm that benefits from more "thinking" time, showing a positive scaling law within its observed range.
* **The 'k=1' Anomaly:** The poor performance of k=1 in the cyan diamond series suggests that some minimal level of computation or candidate generation (k>1) is crucial for that method's effectiveness. k=1 might represent a greedy or baseline approach that fails to capture necessary complexity.
* **Underlying Mechanism:** The separation of clusters implies the three series represent fundamentally different algorithms or model architectures, not just parameter tweaks. The cyan diamond method has the highest potential ceiling but also the highest variance and risk of failure (as seen with k=1).
In summary, the visualization provides a technical comparison for system designers to select a model configuration based on their specific constraints for speed and accuracy, highlighting that increased computational time does not universally guarantee better performanceβit depends heavily on the chosen method.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x55.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different k Values
### Overview
The image is a scatter plot comparing the performance of different methods or models, parameterized by a variable `k`, across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). The plot displays three distinct series of data points, differentiated by marker shape and color, each containing points for `k=3`, `k=5`, and `k=9`. An additional outlier point for `k=1` is present. The chart suggests a trade-off between accuracy and computational time, with different methods exhibiting different efficiency profiles.
### Components/Axes
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from approximately 5 to 17, with major tick marks at 6, 8, 10, 12, 14, and 16. The unit is implied to be thousands of some time unit (e.g., milliseconds, steps).
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from 0.72 to 0.84, with major tick marks at 0.72, 0.74, 0.76, 0.78, 0.80, 0.82, and 0.84.
* **Data Series & Legend:** There is no separate legend box. The series are identified by their distinct marker shapes and colors, with each data point labeled directly with its corresponding `k` value.
* **Series 1 (Cyan Diamonds):** Points are labeled `k=1`, `k=3`, `k=5`, `k=9`.
* **Series 2 (Cyan Squares):** Points are labeled `k=3`, `k=5`, `k=9`.
* **Series 3 (Red Circles):** Points are labeled `k=3`, `k=5`, `k=9`.
* **Grid:** A light gray grid is present, aiding in the estimation of data point coordinates.
### Detailed Analysis
**Data Point Extraction (Approximate Coordinates):**
| Series (Marker) | Label | X (Time-to-Answer) | Y (Accuracy) | Spatial Position (Relative) |
| :--- | :--- | :--- | :--- | :--- |
| **Cyan Diamond** | k=1 | ~10.0 | ~0.71 | Bottom-center |
| **Cyan Diamond** | k=3 | ~13.0 | ~0.80 | Center-right |
| **Cyan Diamond** | k=5 | ~9.5 | ~0.82 | Upper-center |
| **Cyan Diamond** | k=9 | ~7.5 | ~0.84 | Top-left |
| **Cyan Square** | k=3 | ~7.0 | ~0.775 | Center-left |
| **Cyan Square** | k=5 | ~6.0 | ~0.79 | Center-left |
| **Cyan Square** | k=9 | ~5.5 | ~0.80 | Left |
| **Red Circle** | k=3 | ~13.0 | ~0.77 | Center-right |
| **Red Circle** | k=5 | ~15.0 | ~0.81 | Right |
| **Red Circle** | k=9 | ~16.5 | ~0.83 | Top-right |
**Trend Verification per Series:**
* **Cyan Diamonds:** The trend is **non-linear and concave**. Accuracy increases sharply from `k=1` to `k=3`, then more gradually to `k=9`. Time-to-Answer is lowest for `k=9` and highest for `k=3`, with `k=1` and `k=5` in between. This series achieves the highest overall accuracy (`k=9`).
* **Cyan Squares:** The trend shows **modest, linear improvement**. Accuracy increases slightly as `k` increases from 3 to 9. Time-to-Answer **decreases** as `k` increases, showing improved efficiency with higher `k` for this method.
* **Red Circles:** The trend shows **steady, linear improvement**. Both Accuracy and Time-to-Answer **increase** as `k` increases from 3 to 9. This series shows the highest Time-to-Answer for each corresponding `k` value compared to the other series.
### Key Observations
1. **Performance Hierarchy:** For a given `k` value (e.g., `k=9`), the **Cyan Diamond** method achieves the highest accuracy, followed by the **Red Circle** method, and then the **Cyan Square** method.
2. **Efficiency Trade-off:** The **Cyan Square** method is the most time-efficient (lowest Time-to-Answer for each `k`), but at the cost of lower accuracy. The **Red Circle** method is the least time-efficient.
3. **The `k=1` Outlier:** The single **Cyan Diamond** point for `k=1` is a significant outlier, showing very low accuracy (~0.71) compared to all other points, which are above 0.77. Its Time-to-Answer (~10) is moderate.
4. **Convergence at High `k`:** At `k=9`, the accuracy gap between the **Cyan Diamond** (~0.84) and **Red Circle** (~0.83) methods narrows, while the **Cyan Square** method lags behind (~0.80).
### Interpretation
This chart visualizes the **Pareto frontier** of a multi-objective optimization problem: maximizing accuracy while minimizing computational cost (time). The three series likely represent three different algorithms, model architectures, or search strategies.
* The **Cyan Diamond** series demonstrates a **highly effective but non-scalable strategy**. It achieves peak accuracy with moderate `k` values (`k=5,9`) but suffers a catastrophic drop at `k=1`, suggesting a minimum complexity threshold for it to function. Its best performance (`k=9`) is also the fastest, indicating a potential "sweet spot."
* The **Cyan Square** series represents a **fast, lightweight, but less accurate** approach. Its efficiency improves with higher `k`, which is counter-intuitive and may indicate that the parameter `k` has a different meaning for this method (e.g., it could be a pruning factor where higher `k` removes more unnecessary computation).
* The **Red Circle** series shows a **classic, predictable scaling law**: investing more time (higher `k`) yields proportionally better accuracy. It is a reliable but computationally expensive method.
The overarching insight is that the choice of method depends on the operational constraints. If maximum accuracy is critical and compute time is secondary, the **Cyan Diamond** method at `k=9` is optimal. If speed is paramount and a slight accuracy loss is acceptable, the **Cyan Square** method is preferable. The **Red Circle** method offers a predictable, linear trade-off but is dominated in efficiency by the other two. The `k=1` point for the diamond method serves as a crucial warning about its minimum viable operating point.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x56.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing the performance of different models or configurations, labeled by a parameter 'k', across two metrics: Accuracy (y-axis) and Time-to-Answer (x-axis). The plot uses three distinct marker shapes and colors to represent different series, with each data point annotated with its specific 'k' value.
### Components/Axes
* **X-Axis:**
* **Title:** `Time-to-Answer (longest thinking in thousands)`
* **Scale:** Linear, ranging from approximately 9 to 19.
* **Major Tick Marks:** 10, 12, 14, 16, 18.
* **Y-Axis:**
* **Title:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.795 to 0.875.
* **Major Tick Marks:** 0.80, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87.
* **Data Series (Inferred from Marker Shape/Color):**
1. **Cyan Diamonds:** A series of data points.
2. **Blue Squares:** A series of data points.
3. **Red Circles:** A series of data points.
4. **Cyan Star/Explosion:** A single, distinct data point.
* **Data Point Labels:** Each marker is accompanied by a text label indicating its 'k' value (e.g., `k=9`, `k=5`).
### Detailed Analysis
**Data Points and Approximate Coordinates:**
The following table lists each data point by its inferred series, label, and approximate (x, y) coordinates based on visual estimation.
| Series (Marker) | Label | Approx. Time-to-Answer (x) | Approx. Accuracy (y) |
| :--- | :--- | :--- | :--- |
| Cyan Diamond | k=9 | 11.5 | 0.875 |
| Cyan Diamond | k=5 | 13.5 | 0.860 |
| Cyan Diamond | k=3 | 16.5 | 0.846 |
| Blue Square | k=9 | 9.5 | 0.860 |
| Blue Square | k=5 | 10.2 | 0.850 |
| Blue Square | k=3 | 11.0 | 0.840 |
| Red Circle | k=9 | 19.0 | 0.850 |
| Red Circle | k=5 | 17.5 | 0.840 |
| Red Circle | k=3 | 16.5 | 0.825 |
| Cyan Star | k=1 | 13.7 | 0.797 |
**Trend Verification by Series:**
* **Cyan Diamonds:** The trend slopes **downward** from left to right. As Time-to-Answer increases (from ~11.5 to ~16.5), Accuracy decreases (from ~0.875 to ~0.846).
* **Blue Squares:** The trend slopes **downward** from left to right. As Time-to-Answer increases (from ~9.5 to ~11.0), Accuracy decreases (from ~0.860 to ~0.840).
* **Red Circles:** The trend slopes **upward** from left to right. As Time-to-Answer increases (from ~16.5 to ~19.0), Accuracy increases (from ~0.825 to ~0.850).
* **Cyan Star (k=1):** This is a single outlier point with the lowest accuracy (~0.797) at a moderate Time-to-Answer (~13.7).
### Key Observations
1. **Performance Trade-off:** For the Cyan Diamond and Blue Square series, there is a clear trade-off: higher accuracy is associated with lower time-to-answer (faster response). The Red Circle series shows the inverse relationship.
2. **Speed vs. Accuracy Clusters:** The Blue Square points are the fastest (lowest x-values, ~9.5-11.0) but have mid-range accuracy. The Red Circle points are the slowest (highest x-values, ~16.5-19.0) with variable accuracy. The Cyan Diamond points span the middle of the time range.
3. **Impact of 'k':** Within each colored series, a higher 'k' value generally corresponds to higher accuracy. For example, in the Cyan Diamonds: k=9 (0.875) > k=5 (0.860) > k=3 (0.846). This pattern holds for the Blue Squares and Red Circles as well.
4. **Outlier:** The single Cyan Star point labeled `k=1` is a significant outlier, having the worst accuracy by a large margin despite a moderate time cost.
### Interpretation
This chart likely evaluates different algorithmic approaches or model configurations (represented by the three marker series) for a task requiring both speed and correctness. The parameter 'k' appears to be a complexity or resource parameter (e.g., number of candidates considered, ensemble size).
* The **Blue Square** method is optimized for **speed**, delivering the fastest answers but with a ceiling on accuracy that improves with higher 'k'.
* The **Red Circle** method is optimized for **accuracy**, but this comes at a significant time cost, and its accuracy also improves with higher 'k'.
* The **Cyan Diamond** method represents a **middle ground**, offering the highest peak accuracy (at k=9) for a moderate time investment.
* The `k=1` point suggests that a minimal configuration of the Cyan method is ineffective, indicating that a certain threshold of complexity ('k') is necessary for viable performance.
The data demonstrates that there is no single "best" configuration; the optimal choice depends on whether the priority is minimizing response time, maximizing accuracy, or balancing both. The consistent positive correlation between 'k' and accuracy within each series suggests that increasing this parameter reliably improves performance at the cost of increased time-to-answer.
</details>
(c) QwQ-32B
<details>
<summary>x57.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods
### Overview
The image is a scatter plot comparing the performance of three different methods ("majority@k", "short-1@k (Ours)", and "short-3@k (Ours)") across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). Each data point is labeled with a specific "k" value (k=1, 3, 5, 9), representing a parameter for the method. The chart illustrates the trade-off between computational time (thinking time) and answer accuracy.
### Components/Axes
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from approximately 12 to 22 (in thousands). Major gridlines are at 12, 14, 16, 18, 20.
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from approximately 0.84 to 0.92. Major gridlines are at 0.84, 0.86, 0.88, 0.90, 0.92.
* **Legend:** Located in the **bottom-right quadrant** of the chart area.
* **Red Circle:** `majority@k`
* **Blue Square:** `short-1@k (Ours)`
* **Cyan Diamond:** `short-3@k (Ours)`
* **Data Point Labels:** Each marker is annotated with text indicating its "k" value (e.g., "k=9").
### Detailed Analysis
The plot contains nine distinct data points, three for each method.
**1. `short-1@k (Ours)` - Blue Squares**
* **Trend:** This series is clustered on the **left side** of the chart, indicating consistently lower Time-to-Answer. Accuracy varies moderately.
* **Data Points:**
* **k=9:** Positioned at approximately **(12.2, 0.875)**.
* **k=5:** Positioned at approximately **(13.2, 0.881)**.
* **k=3:** Positioned at approximately **(14.2, 0.875)**.
**2. `short-3@k (Ours)` - Cyan Diamonds**
* **Trend:** This series shows a **clear downward trend** in Accuracy as Time-to-Answer increases. The highest accuracy point is also the fastest for this method.
* **Data Points:**
* **k=9:** Positioned at approximately **(15.2, 0.922)**. This is the highest accuracy point on the entire chart.
* **k=5:** Positioned at approximately **(17.2, 0.913)**.
* **k=3:** Positioned at approximately **(19.2, 0.894)**.
* **k=1:** Positioned at approximately **(16.8, 0.838)**. This is the lowest accuracy point on the chart and an outlier for this series, breaking the smooth downward trend.
**3. `majority@k` - Red Circles**
* **Trend:** This series shows a **clear upward trend** in Accuracy as Time-to-Answer increases.
* **Data Points:**
* **k=9:** Positioned at approximately **(21.2, 0.919)**. This is the point with the highest Time-to-Answer.
* **k=5:** Positioned at approximately **(20.2, 0.886)**.
* **k=3:** Positioned at approximately **(19.2, 0.863)**.
### Key Observations
1. **Performance Trade-off:** There is a clear inverse relationship between the `short-3@k` and `majority@k` methods. `short-3@k` achieves higher accuracy with less time for larger k (k=9,5), while `majority@k` requires significantly more time to reach comparable accuracy levels.
2. **Efficiency Leader:** The `short-3@k (k=9)` point is the most efficient, achieving the highest overall accuracy (~0.922) with a moderate Time-to-Answer (~15.2k).
3. **Speed Leader:** The `short-1@k` methods are the fastest, all with Time-to-Answer below 15k, but their accuracy is capped around 0.88.
4. **Outlier:** The `short-3@k (k=1)` point is a significant outlier. It has very low accuracy (~0.838) despite a moderate Time-to-Answer (~16.8k), suggesting the method fails or performs poorly with this parameter setting.
5. **Parameter Sensitivity:** All methods show sensitivity to the 'k' parameter, but the direction of the effect on accuracy differs between methods.
### Interpretation
This chart likely evaluates different strategies for a multi-step reasoning or verification task (e.g., in AI or machine learning), where 'k' could represent the number of reasoning paths, votes, or attempts.
* **`short-1@k` and `short-3@k (Ours)`** appear to be proposed, more efficient methods. `short-3@k` in particular demonstrates a superior accuracy-time Pareto frontier for k=5 and k=9, suggesting it is a more effective strategy than the baseline `majority@k` when given a moderate time budget.
* **`majority@k`** represents a baseline, possibly a simple voting or ensemble method. Its upward trend indicates that throwing more computation (time) at it reliably improves accuracy, but it is inefficient compared to the proposed methods.
* The **`short-3@k (k=1)` outlier** is critical. It indicates a failure mode where the method, with minimal 'k', does not produce reliable results, possibly due to insufficient diversity or verification in its process.
* **Overall Implication:** The data suggests the authors' `short-3@k` method offers a better balance, achieving state-of-the-art accuracy with lower computational cost than a majority-vote baseline, provided the parameter 'k' is set appropriately (k > 1). The choice between `short-1@k` and `short-3@k` would depend on whether the priority is absolute speed (`short-1`) or higher accuracy within a reasonable time (`short-3`).
</details>
(d) R1-670B
Figure 15: AIME 2024 - time-to-answer comparison.
<details>
<summary>x58.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different k-Values
### Overview
The image is a scatter plot comparing the performance of different models or configurations, parameterized by a variable `k`. It plots "Accuracy" on the vertical axis against "Time-to-Answer" on the horizontal axis. The data points are distinguished by three different marker shapes and colors, each associated with specific `k` values. The chart illustrates a trade-off between accuracy and computational time.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 8 to 18, with major grid lines at intervals of 2 (8, 10, 12, 14, 16, 18). The unit is "thousands," implying the values represent thousands of some time unit (e.g., milliseconds, steps).
* **Y-Axis:** Labeled "Accuracy". The scale runs from approximately 0.52 to 0.66, with major grid lines at intervals of 0.02 (0.52, 0.54, 0.56, 0.58, 0.60, 0.62, 0.64, 0.66).
* **Data Series & Legend:** There is no separate legend box. The series are identified by marker shape, color, and direct labels (`k=1`, `k=3`, `k=5`, `k=9`) placed next to each point.
* **Series 1 (Cyan Squares):** Labeled with `k=3`, `k=5`, `k=9`.
* **Series 2 (Cyan Diamonds):** Labeled with `k=1`, `k=3`, `k=5`, `k=9`.
* **Series 3 (Red Circles):** Labeled with `k=3`, `k=5`, `k=9`.
### Detailed Analysis
**Data Point Extraction (Approximate Coordinates):**
* **Cyan Square Series:**
* `k=9`: Position ~ (8.0, 0.627). Top-left quadrant.
* `k=5`: Position ~ (8.5, 0.615). Left of center.
* `k=3`: Position ~ (9.5, 0.593). Left of center, lower than k=5.
* **Cyan Diamond Series:**
* `k=9`: Position ~ (10.2, 0.652). Top-center. Highest accuracy point for cyan markers.
* `k=5`: Position ~ (12.0, 0.635). Center.
* `k=3`: Position ~ (15.0, 0.600). Right of center.
* `k=1`: Position ~ (12.2, 0.505). Bottom-center. This is the lowest accuracy point on the entire chart.
* **Red Circle Series:**
* `k=9`: Position ~ (17.8, 0.658). Top-right corner. Highest accuracy point on the chart.
* `k=5`: Position ~ (16.5, 0.622). Right of center.
* `k=3`: Position ~ (15.0, 0.558). Right of center, significantly lower than its k=5 and k=9 points.
**Trend Verification:**
* **Cyan Squares:** As `k` increases from 3 to 9, both Time-to-Answer (increases from ~9.5 to ~8.0? *Note: This appears counterintuitive; the point labeled k=9 is leftmost, suggesting lower time. This may indicate a different model family or an anomaly.*) and Accuracy (increases from ~0.593 to ~0.627) change. The trend is not strictly monotonic in time.
* **Cyan Diamonds:** As `k` increases from 1 to 9, Accuracy shows a clear upward trend (from ~0.505 to ~0.652). Time-to-Answer also generally increases, with the `k=1` point being an outlier at a moderate time (~12.2) but very low accuracy.
* **Red Circles:** As `k` increases from 3 to 9, Accuracy increases sharply (from ~0.558 to ~0.658). Time-to-Answer also increases consistently (from ~15.0 to ~17.8).
### Key Observations
1. **Accuracy vs. k:** For both the Cyan Diamond and Red Circle series, higher `k` values are strongly associated with higher accuracy.
2. **Time Cost:** Higher `k` values generally require more Time-to-Answer, most clearly seen in the Red Circle series. The Cyan Square series is an exception, where the highest accuracy point (`k=9`) has the lowest time.
3. **Performance Tiers:** The Red Circle series at `k=9` achieves the highest overall accuracy (~0.658) but at the highest time cost (~17.8). The Cyan Diamond series at `k=9` is a close second in accuracy (~0.652) with significantly lower time (~10.2).
4. **Outlier:** The Cyan Diamond `k=1` point is a major outlier, showing drastically lower accuracy (~0.505) than all other points, despite having a moderate Time-to-Answer (~12.2).
5. **Clustering:** Points with the same `k` value but different markers (shapes/colors) are often far apart. For example, the three `k=9` points are in completely different regions of the plot, indicating that the marker type represents a fundamental difference in model or method, not just the `k` parameter.
### Interpretation
This chart visualizes the **efficiency-accuracy frontier** for different algorithmic approaches (represented by marker shape/color). The data suggests:
* **Method Comparison:** The method represented by **Cyan Diamonds** appears to be the most efficient for high accuracy. It reaches near-peak accuracy (`k=9`) at a Time-to-Answer of ~10.2, which is much faster than the Red Circle method's peak.
* **Scalability:** The **Red Circle** method scales poorly with `k` in terms of time; increasing `k` yields accuracy gains but at a steep, linear increase in computational cost.
* **The `k` Parameter:** Within each method, increasing `k` (which could represent model size, number of reasoning steps, or ensemble size) reliably improves accuracy, confirming its role as a key performance lever.
* **Anomaly Investigation:** The Cyan Square `k=9` point's position (high accuracy, low time) is suspicious and warrants investigation. It could represent a breakthrough configuration, a measurement error, or a different experimental condition. Similarly, the catastrophic failure of the Cyan Diamond method at `k=1` suggests a minimum complexity threshold for that approach to function effectively.
* **Practical Implication:** A user must choose a method and `k` value based on their priority. For real-time applications with strict time limits, the Cyan Square or lower-`k` Cyan Diamond methods are preferable. For offline tasks where maximum accuracy is critical, the Red Circle `k=9` or Cyan Diamond `k=9` are the best candidates, with the latter being more time-efficient.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x59.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot chart displaying the relationship between "Accuracy" (y-axis) and "Time-to-Answer" (x-axis) for various data points. Each point is labeled with a "k=" value, indicating a parameter or condition. The data is represented using three distinct marker shapes and colors, suggesting three different series or categories.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 7 to 19, with major tick marks at 8, 10, 12, 14, 16, and 18. The unit is "thousands," implying the values represent thousands of units (e.g., tokens, steps, or milliseconds).
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.54 to approximately 0.65, with major tick marks at 0.54, 0.56, 0.58, 0.60, 0.62, and 0.64.
* **Data Series & Legend (Inferred):** There is no explicit legend box. The series are distinguished by marker shape and color, which are consistently paired with specific "k" value ranges:
* **Cyan Squares:** Associated with k=3, k=5, and k=9.
* **Teal Diamonds:** Associated with k=1, k=3, k=5, and k=9.
* **Maroon Circles:** Associated with k=3, k=5, and k=9.
* **Data Point Labels:** Each marker has a text label directly adjacent to it, stating its "k" value (e.g., "k=9").
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
| Marker (Color/Shape) | k-Value | Time-to-Answer (x, thousands) | Accuracy (y) | Position in Plot |
| :--- | :--- | :--- | :--- | :--- |
| Cyan Square | k=9 | ~7.5 | ~0.623 | Left side, mid-height |
| Cyan Square | k=5 | ~8.0 | ~0.613 | Left side, slightly lower |
| Cyan Square | k=3 | ~9.0 | ~0.601 | Left-center, lower |
| Teal Diamond | k=9 | ~9.8 | ~0.639 | Center-left, high |
| Teal Diamond | k=5 | ~12.2 | ~0.634 | Center, high |
| Teal Diamond | k=3 | ~15.8 | ~0.621 | Center-right, mid-height |
| Teal Diamond | k=1 | ~12.5 | ~0.543 | Center, very low (outlier) |
| Maroon Circle | k=3 | ~15.8 | ~0.594 | Center-right, lower |
| Maroon Circle | k=5 | ~17.5 | ~0.630 | Right side, high |
| Maroon Circle | k=9 | ~19.0 | ~0.650 | Far right, highest point |
**Trend Verification by Series:**
* **Cyan Squares (k=3,5,9):** This series shows a clear **downward trend**. As Time-to-Answer increases from ~7.5 to ~9.0, Accuracy decreases from ~0.623 to ~0.601.
* **Teal Diamonds (k=1,3,5,9):** This series shows a **non-monotonic, scattered trend**. The highest accuracy points (k=9, k=5) are at moderate times (~9.8, ~12.2). Accuracy then dips for k=3 at a higher time (~15.8). The k=1 point is a significant outlier with very low accuracy at a moderate time (~12.5).
* **Maroon Circles (k=3,5,9):** This series shows a general **upward trend**. As Time-to-Answer increases from ~15.8 to ~19.0, Accuracy increases from ~0.594 to ~0.650.
### Key Observations
1. **Performance Trade-off:** There is no single optimal strategy. The highest accuracy (Maroon Circle, k=9, ~0.65) requires the longest time (~19k). The fastest times (Cyan Squares) yield moderate accuracy.
2. **Parameter 'k' Impact:** For a given marker type (color/shape), a higher 'k' value generally correlates with higher accuracy but also longer time-to-answer. This is most consistent in the Maroon Circle series.
3. **Series Distinction:** The three marker types occupy distinct regions of the plot:
* **Cyan Squares:** Clustered in the **low-time, mid-accuracy** region (left side).
* **Teal Diamonds:** Spread across the **mid-time, high-accuracy** region (center), except for the k=1 outlier.
* **Maroon Circles:** Located in the **high-time, variable-accuracy** region (right side).
4. **Significant Outlier:** The Teal Diamond labeled **k=1** is a major outlier. It has the lowest accuracy (~0.543) of all points and does not follow the trend of its series (Teal Diamonds), which otherwise perform well.
### Interpretation
This chart visualizes a **multi-objective optimization problem** between computational cost (Time-to-Answer) and performance (Accuracy). The different marker series likely represent different algorithms, model architectures, or decision-making strategies.
* The **Cyan Square** strategy is "fast but less accurate," suitable for time-critical applications where moderate performance is acceptable.
* The **Teal Diamond** strategy (excluding k=1) appears to be the most efficient, offering the best accuracy for a moderate time investment. The k=1 outlier suggests this strategy fails catastrophically with a very low 'k' parameter.
* The **Maroon Circle** strategy is "slow but potentially very accurate," representing a high-compute, high-reward approach. Its positive trend suggests it benefits from more thinking time.
The parameter 'k' acts as a **complexity or resource allocation knob** within each strategy. Increasing 'k' generally improves accuracy at the cost of time, but the rate of improvement and the baseline performance differ markedly between the three core strategies. The chart suggests that choosing the right core strategy (marker type) is more impactful than tuning 'k' within a suboptimal strategy.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x60.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing the performance of a system across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). The data points are categorized by a parameter `k`, represented by different marker shapes and colors. The plot suggests a trade-off between speed (lower time-to-answer) and accuracy, with different `k` values occupying distinct regions of the chart.
### Components/Axes
* **Y-Axis:** Labeled "Accuracy". The scale ranges from approximately 0.71 to 0.80, with major gridlines at intervals of 0.02 (0.72, 0.74, 0.76, 0.78, 0.80).
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale ranges from approximately 11 to 21, with major gridlines at intervals of 2 (12, 14, 16, 18, 20).
* **Data Series & Legend (Inferred from Labels):** There is no separate legend box. The categories are defined by the text labels adjacent to each data point, which specify the `k` value. The markers use distinct shapes and colors:
* **Cyan Squares:** Associated with labels `k=9`, `k=5`, `k=3`.
* **Cyan Diamonds:** Associated with labels `k=9`, `k=5`, `k=3`.
* **Dark Red Circles:** Associated with labels `k=3`, `k=5`, `k=9`.
* **Cyan Star:** Associated with the label `k=1`.
### Detailed Analysis
The plot contains 10 distinct data points. Their approximate coordinates (Time-to-Answer, Accuracy) and labels are as follows, grouped by marker type:
**1. Cyan Squares (Lower-Left Quadrant - Faster, Lower Accuracy):**
* Point 1: (~11.5, 0.715) - Label: `k=9`
* Point 2: (~12.2, 0.744) - Label: `k=5`
* Point 3: (~13.0, 0.756) - Label: `k=3`
* **Trend:** For this group, as `k` decreases from 9 to 3, both Time-to-Answer and Accuracy increase slightly.
**2. Cyan Diamonds (Upper-Middle Region - Moderate Time, High Accuracy):**
* Point 4: (~13.8, 0.784) - Label: `k=9`
* Point 5: (~15.5, 0.790) - Label: `k=5`
* Point 6: (~18.5, 0.780) - Label: `k=3`
* **Trend:** This group shows a peak in accuracy around `k=5`. The `k=3` point has a significantly higher Time-to-Answer than the others in this group.
**3. Dark Red Circles (Upper-Right Quadrant - Slower, Highest Accuracy):**
* Point 7: (~18.5, 0.773) - Label: `k=3`
* Point 8: (~19.8, 0.794) - Label: `k=5`
* Point 9: (~21.0, 0.800) - Label: `k=9`
* **Trend:** For this group, as `k` increases from 3 to 9, both Time-to-Answer and Accuracy increase. This series contains the highest accuracy point on the chart (`k=9`).
**4. Cyan Star (Outlier):**
* Point 10: (~15.8, 0.723) - Label: `k=1`
* **Trend:** This point has a moderate Time-to-Answer but the lowest Accuracy of all points shown.
### Key Observations
1. **Performance Clusters:** The data forms three loose clusters: a fast/low-accuracy group (cyan squares), a moderate-time/high-accuracy group (cyan diamonds), and a slow/highest-accuracy group (dark red circles).
2. **The `k=1` Outlier:** The single `k=1` point (cyan star) is an outlier, achieving lower accuracy than all other points despite a moderate computation time.
3. **Color/Shape Discrepancy:** The same `k` values (e.g., `k=3`, `k=5`, `k=9`) appear with different marker shapes (square, diamond, circle) and in different performance clusters. This implies the marker shape/color represents an additional, unlabeled variable or condition beyond just the `k` value.
4. **General Trade-off:** There is a broad, positive correlation between Time-to-Answer and Accuracy across the entire dataset. The points with the highest accuracy (dark red circles) also have the longest thinking times.
### Interpretation
This chart visualizes the performance landscape of an algorithm or model where the parameter `k` and at least one other hidden factor (indicated by marker shape/color) influence the trade-off between computational cost (time) and result quality (accuracy).
* **What the data suggests:** The system can be configured for different priorities. The "cyan square" configuration offers quick but less accurate answers. The "dark red circle" configuration yields the most accurate answers but requires the most time. The "cyan diamond" configuration represents a potential "sweet spot" with high accuracy and moderate time.
* **The role of `k`:** The effect of `k` is not uniform. Within each color/shape group, increasing `k` generally increases both time and accuracy, but the magnitude of this effect varies. The `k=1` setting appears suboptimal, as it provides poor accuracy without a proportional time saving compared to the fastest points.
* **The hidden variable:** The most critical insight is that `k` alone does not determine performance. The marker shape/color denotes a second, crucial condition (e.g., different algorithms, model sizes, or data subsets) that fundamentally shifts the time-accuracy profile. The "dark red circle" condition is inherently more accurate but slower than the "cyan diamond" condition for the same `k` value.
* **Anomaly:** The `k=3` point in the dark red circle group has lower accuracy than the `k=5` and `k=9` points in the cyan diamond group, despite being slower. This suggests that under the "dark red circle" condition, the benefit of increasing `k` from 3 to 5 is substantial, but the condition itself may have a higher time overhead.
In summary, the chart demonstrates that optimizing this system requires tuning both the explicit parameter `k` and selecting the appropriate underlying operational mode (the hidden variable represented by color/shape) based on whether speed or accuracy is the primary goal.
</details>
(c) QwQ-32B
<details>
<summary>x61.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods
### Overview
The image is a scatter plot comparing the performance of three different methods (`majority@k`, `short-1@k (Ours)`, and `short-3@k (Ours)`) across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). Each data point is labeled with a parameter `k` (1, 3, 5, or 9). The plot illustrates the trade-off between computational cost (time) and performance (accuracy) for these methods.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 15 to 23, with major tick marks at 16, 18, 20, and 22. The unit is implied to be thousands of operations or steps.
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.83 to 0.88, with major tick marks at 0.83, 0.84, 0.85, 0.86, 0.87, and 0.88.
* **Legend:** Located in the bottom-right quadrant of the chart area. It defines three data series:
* `majority@k`: Represented by dark red circles.
* `short-1@k (Ours)`: Represented by bright blue squares.
* `short-3@k (Ours)`: Represented by cyan diamonds.
* **Data Point Labels:** Each marker is annotated with text indicating the `k` value (e.g., "k=9", "k=5").
### Detailed Analysis
**Data Series and Points:**
| Method | k Value | Approx. Time-to-Answer (x) | Approx. Accuracy (y) | Notes |
| :--- | :--- | :--- | :--- | :--- |
| **`short-1@k (Ours)`** (Blue Squares) | k=9 | β 15.0 | β 0.843 | |
| | k=5 | β 15.5 | β 0.846 | |
| | k=3 | β 16.2 | β 0.845 | |
| | k=1 | β 18.3 | β 0.825 | Outlier: significantly lower accuracy and higher time. |
| **`short-3@k (Ours)`** (Cyan Diamonds) | k=9 | β 17.0 | β 0.878 | Highest accuracy on the chart. |
| | k=5 | β 18.5 | β 0.875 | |
| | k=3 | β 20.8 | β 0.868 | |
| **`majority@k`** (Red Circles) | k=3 | β 20.8 | β 0.854 | |
| | k=5 | β 21.8 | β 0.865 | |
| | k=9 | β 22.5 | β 0.874 | Highest time cost on the chart. |
**Trends:**
* **`short-1@k (Ours)`:** Occupies the lower-left region, indicating lower time cost and lower accuracy. Accuracy appears relatively flat or slightly increasing with time.
* **`short-3@k (Ours)`:** Positioned in the upper region, showing the highest accuracy values. There is a general trend of decreasing accuracy as Time-to-Answer increases from its lowest to highest points.
* **`majority@k`:** Shows a clear positive correlation: as Time-to-Answer increases, Accuracy also increases. It spans the widest range on the x-axis.
### Key Observations
1. **Performance Clusters:** The methods form distinct clusters. `short-1@k` is low-time/low-accuracy, `short-3@k` is medium-time/high-accuracy, and `majority@k` is high-time/medium-to-high-accuracy.
2. **Efficiency of `short-3@k`:** The `short-3@k` method achieves the highest observed accuracy (β0.878 at k=9) with a moderate Time-to-Answer (β17.0), suggesting it may be the most efficient method for peak accuracy.
3. **Outlier Point:** The `short-1@k, k=1` point is a significant outlier, breaking the trend of its series with much lower accuracy and higher time.
4. **`majority@k` Scaling:** The `majority@k` method shows a predictable, almost linear increase in both time and accuracy as `k` increases.
5. **Crossover Point:** At a Time-to-Answer of approximately 20.8, the `short-3@k, k=3` and `majority@k, k=3` points have nearly identical x-values, but `short-3@k` has significantly higher accuracy (β0.868 vs. β0.854).
### Interpretation
The data demonstrates a classic speed-accuracy trade-off in computational methods, likely for a reasoning or question-answering task. The "Ours" methods (`short-1` and `short-3`) appear to be novel approaches being compared against a `majority` baseline.
* **`short-1@k`** is optimized for speed, providing quick but less accurate answers. Its performance collapses at `k=1`, suggesting a minimum threshold of "thinking" or sampling is required for it to function effectively.
* **`short-3@k`** represents a "sweet spot," delivering the highest accuracy at a reasonable computational cost. Its downward trend with increasing `k` is intriguingβit suggests that for this method, more "thinking" (higher `k`) beyond a certain point may introduce noise or diminishing returns, reducing accuracy.
* **`majority@k`** is a reliable but costly baseline. Its consistent scaling indicates that simply aggregating more votes or samples (`k`) reliably improves accuracy at the expense of linearly increasing time.
The chart argues that the proposed `short-3@k` method is superior for achieving maximum accuracy efficiently, while `short-1@k` is preferable when speed is the paramount concern. The `majority@k` method serves as a predictable, resource-intensive benchmark. The outlier at `short-1@k, k=1` is a critical data point, indicating a potential failure mode or minimum viable parameter for that method.
</details>
(d) R1-670B
Figure 16: AIME 2025 - time-to-answer comparison.
<details>
<summary>x62.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different k-Values
### Overview
The image is a scatter plot comparing the performance of different model configurations, labeled by a parameter "k". It plots "Accuracy" on the vertical axis against "Time-to-Answer (longest thinking in thousands)" on the horizontal axis. The data points are distinguished by both color (cyan and red) and marker shape (square, diamond, circle, star), with each point explicitly labeled with its corresponding k-value.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 9 to 20, with major grid lines at intervals of 2 (10, 12, 14, 16, 18, 20).
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.32 to 0.44, with major grid lines at intervals of 0.02.
* **Data Series & Legend:** There is no separate legend box. The mapping is embedded in the markers:
* **Cyan Square:** Represents a configuration with a specific characteristic (e.g., model type or method). Points are labeled k=3, k=5, k=9.
* **Cyan Diamond:** Represents a second configuration. Points are labeled k=3, k=5, k=9.
* **Red Circle:** Represents a third configuration. Points are labeled k=3, k=5, k=9.
* **Cyan Star:** Represents a unique, single data point labeled k=1.
### Detailed Analysis
**Data Points (Approximate Coordinates):**
| Marker (Color/Shape) | k-Value | Time-to-Answer (x-axis, thousands) | Accuracy (y-axis) |
| :--- | :--- | :--- | :--- |
| Cyan Square | k=9 | ~9.0 | ~0.420 |
| Cyan Square | k=5 | ~9.8 | ~0.404 |
| Cyan Square | k=3 | ~10.8 | ~0.387 |
| Cyan Diamond | k=9 | ~11.5 | ~0.415 |
| Cyan Diamond | k=5 | ~13.5 | ~0.407 |
| Cyan Diamond | k=3 | ~16.8 | ~0.392 |
| Red Circle | k=9 | ~19.8 | ~0.440 |
| Red Circle | k=5 | ~18.2 | ~0.403 |
| Red Circle | k=3 | ~16.8 | ~0.357 |
| Cyan Star | k=1 | ~13.8 | ~0.325 |
**Trend Verification:**
* **Cyan Square Series:** Shows a clear negative trend. As Time-to-Answer increases from ~9 to ~10.8, Accuracy decreases from ~0.420 to ~0.387.
* **Cyan Diamond Series:** Also shows a negative trend. As Time-to-Answer increases from ~11.5 to ~16.8, Accuracy decreases from ~0.415 to ~0.392.
* **Red Circle Series:** Shows a positive trend. As Time-to-Answer increases from ~16.8 to ~19.8, Accuracy increases from ~0.357 to ~0.440.
* **Cyan Star (k=1):** A single outlier point with moderate time (~13.8) but the lowest accuracy (~0.325).
### Key Observations
1. **Performance Clusters:** The cyan markers (squares and diamonds) generally occupy the left side of the chart (lower Time-to-Answer, ~9-17), while the red circles occupy the right side (higher Time-to-Answer, ~17-20).
2. **k-Value Impact:** For the cyan square and cyan diamond series, higher k-values (k=9) are associated with both higher accuracy and lower time-to-answer compared to lower k-values (k=3) within the same series.
3. **Red Series Anomaly:** The red circle series exhibits the opposite internal trend: higher k-values (k=9) are associated with both the highest accuracy *and* the highest time-to-answer in the entire chart.
4. **Outlier:** The k=1 point (cyan star) is a significant outlier, demonstrating that this configuration yields the worst accuracy despite a mid-range time cost.
5. **Highest & Lowest Points:** The highest accuracy (~0.440) is achieved by the Red Circle at k=9, but it also has the highest time cost (~19.8). The lowest accuracy (~0.325) is from the Cyan Star at k=1.
### Interpretation
This chart visualizes a trade-off between computational cost (Time-to-Answer) and performance (Accuracy) across different model strategies (color/shape) and complexity levels (k-value).
* **Strategy Comparison:** The two cyan strategies appear to be more efficient, achieving respectable accuracy (<0.42) with lower time costs. The red strategy is more expensive computationally but has the potential for the highest peak accuracy (0.44).
* **Role of 'k':** The parameter 'k' does not have a uniform effect. For the efficient cyan strategies, increasing 'k' improves both speed and accuracy, suggesting it may optimize a process. For the expensive red strategy, increasing 'k' improves accuracy at the cost of more time, suggesting it may increase thoroughness or ensemble size.
* **Practical Implication:** The choice of strategy and 'k' depends on the priority. If speed is critical, a cyan square with k=9 is optimal. If maximum accuracy is paramount and time is less constrained, the red circle with k=9 is the best choice. The k=1 configuration appears to be a baseline or underpowered setting that is not competitive.
* **Underlying Question:** The chart prompts investigation into what the different colors/shapes represent (e.g., different algorithms, model sizes, or reasoning methods) and why their performance characteristics with respect to 'k' are inverted.
</details>
(a) LN-Super- $49$ B
<details>
<summary>x63.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing the performance of different models or configurations, parameterized by a variable 'k'. It plots "Accuracy" on the vertical axis against "Time-to-Answer (longest thinking in thousands)" on the horizontal axis. The plot uses three distinct marker shapes and colors to represent different categories or model families, with each data point labeled with its specific 'k' value.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from approximately 0.37 to 0.47.
* **Major Ticks:** 0.38, 0.40, 0.42, 0.44, 0.46.
* **X-Axis (Horizontal):**
* **Label:** "Time-to-Answer (longest thinking in thousands)"
* **Scale:** Linear, ranging from approximately 7 to 21.
* **Major Ticks:** 7, 10, 12, 15, 17, 20.
* **Data Series & Legend (Inferred from markers):**
* **Series 1 (Cyan Squares):** Located in the left-center region of the plot.
* **Series 2 (Teal Diamonds):** Scattered across the center and upper-left of the plot.
* **Series 3 (Dark Red Circles):** Located in the right half of the plot.
* **Note:** There is no explicit legend box. The categories are defined by the consistent use of marker shape and color.
### Detailed Analysis
**Data Points (Approximate Coordinates: [Time-to-Answer, Accuracy]):**
* **Cyan Square Series:**
* `k=9`: [~7.5, ~0.415]
* `k=5`: [~8.5, ~0.418]
* `k=3`: [~9.5, ~0.412]
* *Trend:* This series shows a relatively flat or very slightly increasing trend in accuracy as time increases within a narrow range (7.5 to 9.5 thousand).
* **Teal Diamond Series:**
* `k=9`: [~10.5, ~0.456]
* `k=5`: [~12.5, ~0.448]
* `k=3`: [~16.5, ~0.428]
* `k=1`: [~13.0, ~0.370] (This is a distinct outlier, plotted with a different, star-like teal marker).
* *Trend:* Excluding the `k=1` outlier, this series shows a clear **downward trend**: as Time-to-Answer increases from ~10.5 to ~16.5 thousand, Accuracy decreases from ~0.456 to ~0.428.
* **Dark Red Circle Series:**
* `k=9`: [~20.5, ~0.470]
* `k=5`: [~18.5, ~0.436]
* `k=3`: [~17.0, ~0.400]
* *Trend:* This series shows a strong **upward trend**: as Time-to-Answer increases from ~17.0 to ~20.5 thousand, Accuracy increases significantly from ~0.400 to ~0.470.
### Key Observations
1. **Performance Trade-off Variability:** The relationship between time and accuracy is not consistent across model types. The red circle series shows a positive correlation (more time yields higher accuracy), while the teal diamond series (excluding k=1) shows a negative correlation.
2. **The `k=1` Outlier:** The teal diamond point for `k=1` is a significant outlier. It has the lowest accuracy (~0.370) and a moderate time-to-answer (~13.0 thousand), breaking the trend of its series.
3. **Highest Accuracy:** The highest accuracy point (~0.470) belongs to the red circle series with `k=9`, but it also has the longest time-to-answer (~20.5 thousand).
4. **Efficiency Cluster:** The cyan square series (`k=3,5,9`) forms a cluster representing models with relatively low time-to-answer (7.5-9.5k) and moderate accuracy (~0.41-0.42).
5. **Impact of 'k':** Within each colored series, a higher 'k' value generally corresponds to a position further to the right (higher time) and, for the red circles, higher up (higher accuracy). For teal diamonds, higher 'k' corresponds to higher accuracy but lower time.
### Interpretation
This chart visualizes a multi-faceted performance benchmark, likely for different AI model architectures or reasoning strategies (represented by color/shape) tested with varying depths or iterations of a process (parameter 'k').
* **The data suggests a fundamental design choice:** One can optimize for speed (low time-to-answer, like the cyan squares) or for peak accuracy (high time, like the red circles at k=9), but not both simultaneously with the same model family.
* **The divergent trends are critical:** The red series's positive slope implies a model that benefits from "thinking longer." The teal series's negative slope is counter-intuitive and suggests that for that model type, increased computation time may lead to overfitting, confusion, or inefficient resource use, degrading accuracy. The `k=1` point for this series is particularly inefficient.
* **The 'k' parameter acts as a control knob:** Increasing 'k' consistently increases computational cost (time). Its effect on accuracy is model-dependent: beneficial for the red series, detrimental for the teal series, and neutral for the cyan series within its narrow range.
* **Practical Implication:** A user must select a model family based on their priority. If latency is critical, the cyan square models are optimal. If maximum accuracy is paramount and time is less constrained, the red circle model at high 'k' is best. The teal diamond models, especially at low 'k', appear to be a poor choice, offering neither top speed nor top accuracy.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x64.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different k Values
### Overview
The image is a scatter plot comparing the "Accuracy" (y-axis) against the "Time-to-Answer" (x-axis) for various data points labeled with different "k" values. The plot uses distinct marker shapes and colors to differentiate between data series, though a formal legend is not present; labels are placed directly next to each data point.
### Components/Axes
* **X-Axis:**
* **Title:** "Time-to-Answer (longest thinking in thousands)"
* **Scale:** Linear, ranging from 12 to 24.
* **Major Tick Marks:** 12, 14, 16, 18, 20, 22, 24.
* **Y-Axis:**
* **Title:** "Accuracy"
* **Scale:** Linear, ranging from 0.48 to 0.60.
* **Major Tick Marks:** 0.48, 0.50, 0.52, 0.54, 0.56, 0.58, 0.60.
* **Data Series & Labels:** Data points are identified by a combination of marker shape, color, and an adjacent text label indicating the "k" value. The identified series are:
* **Cyan Squares:** Labeled k=9, k=5, k=3.
* **Cyan Diamonds:** Labeled k=9, k=5, k=3.
* **Cyan Star/Plus:** Labeled k=1.
* **Red Circles:** Labeled k=9, k=5, k=3.
### Detailed Analysis
**Data Point Extraction (Approximate Coordinates):**
The following table lists each data point by its visual properties and approximate (x, y) coordinates.
| Marker Shape | Color | Label | Approx. X (Time-to-Answer) | Approx. Y (Accuracy) | Spatial Position (Relative) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Square | Cyan | k=9 | 12.5 | 0.572 | Top-left quadrant |
| Square | Cyan | k=5 | 13.5 | 0.556 | Left-center |
| Square | Cyan | k=3 | 14.8 | 0.535 | Left-center, below k=5 |
| Diamond | Cyan | k=9 | 15.8 | 0.595 | Top-center (highest accuracy point) |
| Diamond | Cyan | k=5 | 18.0 | 0.573 | Center |
| Star/Plus | Cyan | k=1 | 18.0 | 0.478 | Bottom-center (lowest accuracy point) |
| Diamond | Cyan | k=3 | 21.0 | 0.544 | Right-center |
| Circle | Red | k=3 | 21.0 | 0.505 | Right-center, below cyan diamond |
| Circle | Red | k=5 | 22.2 | 0.539 | Right-center |
| Circle | Red | k=9 | 23.5 | 0.582 | Top-right quadrant |
**Trend Verification by Series:**
* **Cyan Squares (k=9,5,3):** This series shows a clear **downward trend**. As Time-to-Answer increases from ~12.5 to ~14.8, Accuracy decreases from ~0.572 to ~0.535.
* **Cyan Diamonds (k=9,5,3):** This series shows a **downward trend**. As Time-to-Answer increases from ~15.8 to ~21.0, Accuracy decreases from ~0.595 to ~0.544.
* **Red Circles (k=9,5,3):** This series shows an **upward trend**. As Time-to-Answer increases from ~21.0 to ~23.5, Accuracy increases from ~0.505 to ~0.582.
* **Single Point (Cyan Star, k=1):** This is an isolated point with the lowest accuracy (~0.478) at a moderate Time-to-Answer (~18.0).
### Key Observations
1. **Performance Clusters:** Data points cluster into three distinct groups based on marker shape/color:
* **Left Group (Cyan Squares):** Lower Time-to-Answer (12-15), moderate Accuracy (0.535-0.572).
* **Center Group (Cyan Diamonds & Star):** Mid-range Time-to-Answer (15.8-18), but the widest spread in Accuracy (0.478-0.595).
* **Right Group (Red Circles):** Highest Time-to-Answer (21-23.5), with Accuracy spanning from low to high (0.505-0.582).
2. **k-Value Relationship:** For the Cyan Square and Cyan Diamond series, **higher k correlates with higher Accuracy and lower Time-to-Answer** within that series. For the Red Circle series, the relationship is inverted: **higher k correlates with higher Accuracy but also higher Time-to-Answer**.
3. **Outliers:**
* The **Cyan Diamond (k=9)** at (15.8, 0.595) is the overall highest accuracy point.
* The **Cyan Star (k=1)** at (18.0, 0.478) is the overall lowest accuracy point.
* The **Red Circle (k=9)** at (23.5, 0.582) achieves high accuracy but at the cost of the longest time.
### Interpretation
This chart visualizes a **trade-off between computational cost (Time-to-Answer) and performance (Accuracy)** for a system or algorithm where a parameter "k" is varied. The different marker series likely represent different model architectures, algorithms, or experimental conditions.
* The **Cyan Square** condition is the most time-efficient but offers only moderate accuracy. It benefits from higher "k" values.
* The **Cyan Diamond** condition offers the potential for peak accuracy (at k=9) but with a moderate time cost. Its performance degrades significantly as "k" decreases.
* The **Red Circle** condition is the most time-intensive. Its poor performance at low "k" (k=3) suggests it requires a higher "k" to become effective, but when it does (k=9), it approaches the peak accuracy of the Cyan Diamond series, albeit much slower.
* The **k=1** point serves as a baseline, showing that minimal "k" leads to poor accuracy regardless of the condition.
The data suggests that the optimal choice depends on the priority: for speed, use the Cyan Square condition with high k. For maximum accuracy, the Cyan Diamond with k=9 is best. The Red Circle condition appears inefficient unless the specific constraints of that condition are necessary, as it requires both high "k" and high time to achieve good results.
</details>
(c) QwQ-32B
<details>
<summary>x65.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods
### Overview
The image is a scatter plot comparing the performance of three different methods or models on a task. The plot visualizes the trade-off between **Accuracy** (y-axis) and **Time-to-Answer** (x-axis, measured in thousands of thinking steps). Each data point represents a specific configuration of a method, labeled with a parameter `k`. The chart suggests an analysis of efficiency versus effectiveness for different algorithmic approaches.
### Components/Axes
* **Chart Type:** Scatter plot with three distinct data series.
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from approximately 0.675 to 0.850, with major gridlines at intervals of 0.025 (0.675, 0.700, 0.725, etc.).
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from 16 to 26, with major gridlines at intervals of 2 (16, 18, 20, etc.).
* **Legend:** Located in the **bottom-right quadrant** of the chart area. It defines three series:
1. **Red Circle:** `majority@k`
2. **Blue Square:** `short-1@k (Ours)`
3. **Cyan Diamond:** `short-3@k (Ours)`
* **Data Point Labels:** Each marker is annotated with text indicating its `k` value (e.g., `k=9`, `k=5`, `k=3`, `k=1`).
### Detailed Analysis
The plot contains nine distinct data points, three for each method.
**1. Series: `majority@k` (Red Circles)**
* **Trend:** Shows a **positive correlation**; as Time-to-Answer increases, Accuracy generally increases.
* **Data Points:**
* **Point 1:** `k=3`. Position: ~24.5 on x-axis, ~0.725 on y-axis.
* **Point 2:** `k=5`. Position: ~25.5 on x-axis, ~0.765 on y-axis.
* **Point 3:** `k=9`. Position: ~26.5 on x-axis, ~0.805 on y-axis.
**2. Series: `short-1@k (Ours)` (Blue Squares)**
* **Trend:** Shows a **negative correlation**; as Time-to-Answer increases, Accuracy decreases.
* **Data Points:**
* **Point 1:** `k=9`. Position: ~16.0 on x-axis, ~0.830 on y-axis.
* **Point 2:** `k=5`. Position: ~17.5 on x-axis, ~0.805 on y-axis.
* **Point 3:** `k=3`. Position: ~18.5 on x-axis, ~0.775 on y-axis.
**3. Series: `short-3@k (Ours)` (Cyan Diamonds)**
* **Trend:** Shows a **non-linear, peaked relationship**. Accuracy increases from `k=1` to a peak at `k=9`, while Time-to-Answer also increases.
* **Data Points:**
* **Point 1:** `k=1`. Position: ~21.5 on x-axis, ~0.675 on y-axis (lowest accuracy on the chart).
* **Point 2:** `k=3`. Position: ~24.5 on x-axis, ~0.780 on y-axis.
* **Point 3:** `k=5`. Position: ~22.0 on x-axis, ~0.825 on y-axis.
* **Point 4:** `k=9`. Position: ~20.0 on x-axis, ~0.860 on y-axis (highest accuracy on the chart).
### Key Observations
1. **Performance Frontier:** The `short-3@k` method (cyan diamonds) defines the upper-left performance frontier for higher `k` values (`k=5,9`), achieving the highest accuracies (~0.825, ~0.860) at moderate time costs (~22.0, ~20.0).
2. **Speed vs. Accuracy Trade-off:** The `short-1@k` method (blue squares) is the fastest (lowest Time-to-Answer, 16-18.5) but shows a clear trade-off: higher `k` yields higher accuracy but at the cost of increased time.
3. **Inefficiency of `majority@k`:** The `majority@k` method (red circles) is consistently the slowest (highest Time-to-Answer, 24.5-26.5) for its given accuracy levels. For example, at an accuracy of ~0.805, `majority@k` (`k=9`) requires ~26.5 time units, while `short-1@k` (`k=5`) requires only ~17.5.
4. **Parameter `k` Impact:** For the "Ours" methods, increasing `k` generally improves accuracy but has a complex effect on time. For `short-1@k`, higher `k` increases time. For `short-3@k`, the relationship is non-monotonic; the highest accuracy (`k=9`) occurs at a *lower* time (~20.0) than the `k=3` point (~24.5).
### Interpretation
This chart likely comes from a research paper evaluating novel methods (`short-1@k` and `short-3@k`, labeled "Ours") against a baseline (`majority@k`) for a reasoning or question-answering task where "thinking time" is a measurable resource.
* **The data suggests** that the proposed `short-3@k` method is the most effective, capable of reaching peak accuracy. Its non-linear trend implies an optimal operating point (around `k=5` or `k=9`) where it maximizes accuracy without a proportional increase in computational cost.
* **The `short-1@k` method** appears optimized for speed, offering a fast but less accurate solution. Its negative trend indicates that forcing it to consider more candidates (`k`) degrades its efficiency without a net accuracy gain in this time-accuracy view.
* **The `majority@k` baseline** is shown to be computationally expensive. The positive trend suggests it benefits from more "thinking" time, but it is outperformed in both speed and peak accuracy by the new methods.
* **The overarching narrative** is one of algorithmic improvement: the new methods (`short-*`) achieve better accuracy-time Pareto fronts than the baseline. `short-3@k` is the high-accuracy specialist, while `short-1@k` is the low-latency specialist. The choice between them would depend on whether the application prioritizes maximum correctness or minimal response time.
</details>
(d) R1-670B
Figure 17: HMMT Feb 2025 - time-to-answer comparison.
## Appendix C Ablation studies
We investigate two axis of short-m@k: the value of $m$ and the tie breaking method. For all experiments we use LN-Super- $49$ B, reporting results over the three benchmarks described in Section Λ 3.1. For the ablation studies we focus on controlling thinking compute.
We start by ablating different $m\in\{1,3,4,5,7,9\}$ for short-m@k. Results are shown in Figure Λ 18(a). As observed in our main results, short-1@k outperforms others in low-compute regimes, while being less effective for larger compute budgets. Larger $m$ values seem to perform similarly, with higher $m$ values yielding slightly better results in high-compute scenarios.
Next, we analyze the tie-breaking rule of short-m@k. We suggest the selection of the shortest reasoning chain among the vote-leading options. We compare this strategy to random tie-breaking, and to tie breaking according to the longest reasoning chain among the options. As shown in Figure Λ 18(b), the short-m@k strategy outperforms random tie-breaking. In contrast, choosing the option with the longest reasoning chain yields inferior results.
<details>
<summary>x66.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different "Short" Configurations
### Overview
The image displays a line chart plotting model accuracy against computational cost, measured in thinking tokens. It compares six different configurations labeled "short-1" through "short-9". The chart demonstrates a clear positive correlation between increased compute and accuracy, with all lines showing a logarithmic-like growth pattern that begins to plateau at higher compute levels.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Title:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear, ranging from approximately 10 to 120 (thousand tokens).
* **Major Tick Marks:** 20, 40, 60, 80, 100, 120.
* **Y-Axis:**
* **Title:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.47 to 0.63.
* **Major Tick Marks:** 0.48, 0.50, 0.52, 0.54, 0.56, 0.58, 0.60, 0.62.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Content:** Six entries, each with a unique color, line style, and marker.
1. `short-1`: Solid, bright cyan line with circular markers.
2. `short-3`: Solid, medium turquoise line with circular markers.
3. `short-4`: Dashed, medium sea green line with circular markers.
4. `short-5`: Dashed, dark green line with circular markers.
5. `short-7`: Dashed, olive/brown line with circular markers.
6. `short-9`: Dashed, dark olive green line with circular markers.
* **Grid:** Light gray grid lines are present for both major x and y ticks.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
All six data series follow a similar trajectory: a steep initial rise in accuracy as compute increases from ~10k to ~40k tokens, followed by a gradual flattening of the curve (diminishing returns) as compute approaches 120k tokens.
1. **`short-1` (Cyan, Solid):**
* **Trend:** The leftmost and initially highest-performing line. It rises steeply and then plateaus earlier than others.
* **Data Points:** (~10k, 0.47), (~20k, 0.525), (~30k, 0.55), (~40k, 0.57), (~50k, 0.58), (~60k, 0.59), (~70k, 0.595), (~80k, 0.60). It ends near 80k tokens.
2. **`short-3` (Turquoise, Solid):**
* **Trend:** Follows a path very close to, but slightly below, `short-1` in the mid-range. It extends further along the x-axis.
* **Data Points:** (~10k, 0.47), (~20k, 0.52), (~30k, 0.55), (~40k, 0.57), (~50k, 0.58), (~60k, 0.59), (~70k, 0.60), (~80k, 0.605), (~90k, 0.61), (~100k, 0.615).
3. **`short-4` (Sea Green, Dashed):**
* **Trend:** Starts lower than the solid lines but crosses above them in the high-compute region (>80k tokens), achieving one of the highest final accuracies.
* **Data Points:** (~10k, 0.47), (~20k, 0.51), (~30k, 0.54), (~40k, 0.56), (~50k, 0.575), (~60k, 0.59), (~70k, 0.60), (~80k, 0.61), (~90k, 0.615), (~100k, 0.62), (~110k, 0.625).
4. **`short-5` (Dark Green, Dashed):**
* **Trend:** Very similar path to `short-4`, running just below it for most of the range.
* **Data Points:** (~10k, 0.47), (~20k, 0.51), (~30k, 0.54), (~40k, 0.56), (~50k, 0.575), (~60k, 0.585), (~70k, 0.595), (~80k, 0.605), (~90k, 0.61), (~100k, 0.615), (~110k, 0.62).
5. **`short-7` (Olive, Dashed):**
* **Trend:** The lowest-performing line for most of the compute range, but it continues to rise steadily and ends at a high point.
* **Data Points:** (~10k, 0.47), (~20k, 0.51), (~30k, 0.535), (~40k, 0.555), (~50k, 0.57), (~60k, 0.58), (~70k, 0.59), (~80k, 0.60), (~90k, 0.605), (~100k, 0.61), (~110k, 0.615), (~120k, 0.62).
6. **`short-9` (Dark Olive, Dashed):**
* **Trend:** Closely follows `short-7`, often overlapping or running just below it.
* **Data Points:** (~10k, 0.47), (~20k, 0.51), (~30k, 0.535), (~40k, 0.555), (~50k, 0.57), (~60k, 0.58), (~70k, 0.59), (~80k, 0.60), (~90k, 0.605), (~100k, 0.61), (~110k, 0.615), (~120k, 0.62).
### Key Observations
1. **Diminishing Returns:** All configurations show that the gain in accuracy per additional thousand thinking tokens decreases significantly after approximately 40-60k tokens.
2. **Performance Hierarchy:** At low-to-mid compute (20k-60k tokens), the solid-line configurations (`short-1`, `short-3`) outperform the dashed-line ones. However, at high compute (>80k tokens), the dashed-line configurations (`short-4`, `short-5`) achieve equal or slightly higher accuracy.
3. **Convergence:** The performance gap between the best and worst configurations narrows as compute increases. At ~120k tokens, the spread in accuracy is much smaller than at ~40k tokens.
4. **Endpoint Variation:** The configurations terminate at different maximum compute levels (e.g., `short-1` at ~80k, `short-4` at ~110k, `short-7/9` at ~120k), suggesting different computational budgets or stopping criteria.
### Interpretation
This chart illustrates a fundamental trade-off in AI model inference: **accuracy versus computational cost**. The data suggests that:
* **Investing compute yields better results, but with diminishing returns.** The most cost-effective gains are made in the early phase (up to ~40k thinking tokens). Pushing beyond this requires substantially more compute for smaller improvements.
* **The optimal configuration depends on the available budget.** For applications with strict latency or cost constraints (low compute budget), `short-1` or `short-3` are superior choices. For applications where maximum accuracy is paramount and compute is less constrained, `short-4` or `short-5` become the better options, as they continue to improve at higher token counts.
* **The "short-X" label likely represents a model or prompting configuration parameter.** The consistent ordering and behavior suggest that lower numbers (1, 3) may be optimized for efficiency, while higher numbers (4, 5, 7, 9) may be optimized for peak performance, trading off initial efficiency for higher ultimate accuracy. The near-identical performance of `short-7` and `short-9` indicates a potential performance ceiling for this family of configurations.
</details>
(a) $m$ values ablation of short-m@k
<details>
<summary>x67.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute for Different Tie-Breaking Strategies
### Overview
The image is a line chart plotting model accuracy against computational effort ("Thinking Compute") for three different experimental conditions or strategies. The chart demonstrates how accuracy scales with increased compute for each strategy, showing distinct performance trajectories.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear scale from approximately 10 to 90 (thousands of tokens).
* **Major Tick Markers:** 20, 40, 60, 80.
* **Y-Axis (Vertical):**
* **Label:** `Accuracy`
* **Scale:** Linear scale from 0.425 to 0.625.
* **Major Tick Markers:** 0.425, 0.450, 0.475, 0.500, 0.525, 0.550, 0.575, 0.600, 0.625.
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Entries (from top to bottom):**
1. **Solid cyan line with circle markers:** `short-3 - tie - short`
2. **Dashed dark teal line with circle markers:** `short-3 - tie - random`
3. **Dotted gray line with circle markers:** `short-3 - tie - long`
### Detailed Analysis
**1. Data Series: `short-3 - tie - short` (Solid Cyan Line)**
* **Trend:** This line shows a strong, consistent upward trend that begins to plateau at higher compute levels. It is the top-performing strategy across the entire range.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.470
* At ~25k tokens: Accuracy β 0.525
* At ~40k tokens: Accuracy β 0.560
* At ~55k tokens: Accuracy β 0.590
* At ~70k tokens: Accuracy β 0.605
* At ~85k tokens: Accuracy β 0.615
* At ~90k tokens: Accuracy β 0.618
**2. Data Series: `short-3 - tie - random` (Dashed Dark Teal Line)**
* **Trend:** This line shows an initial dip, followed by a steady upward trend. It consistently performs below the `short` strategy but above the `long` strategy for most of the range.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.470 (similar starting point to `short`)
* At ~25k tokens: Accuracy β 0.465 (slight dip)
* At ~40k tokens: Accuracy β 0.510
* At ~55k tokens: Accuracy β 0.545
* At ~70k tokens: Accuracy β 0.575
* At ~85k tokens: Accuracy β 0.595
* At ~90k tokens: Accuracy β 0.602
**3. Data Series: `short-3 - tie - long` (Dotted Gray Line)**
* **Trend:** This line exhibits a significant initial drop, followed by a strong recovery and upward trend. It is the lowest-performing strategy until very high compute levels, where it nearly converges with the `random` strategy.
* **Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.470
* At ~25k tokens: Accuracy β 0.410 (sharp dip, the lowest point on the chart)
* At ~40k tokens: Accuracy β 0.480
* At ~55k tokens: Accuracy β 0.520
* At ~70k tokens: Accuracy β 0.560
* At ~85k tokens: Accuracy β 0.590
* At ~90k tokens: Accuracy β 0.598
### Key Observations
1. **Performance Hierarchy:** The `short-3 - tie - short` strategy is unambiguously superior, achieving the highest accuracy at every measured point beyond the initial value.
2. **The 25k Token Dip:** Both the `random` and `long` strategies show a performance decrease or stagnation at approximately 25,000 thinking tokens, with the `long` strategy experiencing a dramatic drop. The `short` strategy does not exhibit this dip.
3. **Convergence at High Compute:** As thinking compute increases towards 90k tokens, the performance gap between the `random` and `long` strategies narrows significantly, though both remain below the `short` strategy.
4. **Diminishing Returns:** All three curves show signs of diminishing returns; the slope (rate of accuracy improvement per additional token) decreases as compute increases, particularly for the top-performing `short` strategy after ~60k tokens.
### Interpretation
This chart likely comes from research on large language model (LLM) reasoning or "chain-of-thought" processing, where "thinking tokens" represent internal computation before generating a final answer. The labels suggest an experiment on how to break ties or handle ambiguity (`tie`) in a model's internal reasoning process (`short-3` likely refers to a specific model or prompt configuration).
The data suggests a clear conclusion: **employing a "short" tie-breaking strategy is more effective and compute-efficient than "random" or "long" strategies.** The "long" strategy's severe performance penalty at low-to-mid compute (25k tokens) indicates it may introduce noise or confusion that requires substantially more computational resources to overcome. The eventual convergence of `random` and `long` at high compute implies that with enough processing power, the negative effects of a poor tie-breaking heuristic can be mitigated, but never fully erased, as the `short` strategy maintains its lead. For practical applications, this indicates that careful design of internal reasoning heuristics (like tie-breaking) is crucial for achieving high accuracy without requiring excessive computational resources.
</details>
(b) Tie breaking ablation
Figure 18: Ablation studies over different $m$ values for short-m@k, and different tie breaking methods. Both figures show the modelβs average accuracy across benchmarks as a function of the length of its thinking trajectories (measured in thousands).
## Appendix D Small models results
We present our main results (Sections Λ 3 and 4) using smaller models. We use Llama- $3.1$ -Nemotron-Nano- $8$ B-v $1$ [LN-Nano- $8$ B; Bercovich and others, 2025] and R $1$ -Distill-Qwen- $7$ B [R $1$ - $7$ B; Guo et al., 2025]. Table Λ 7 (corresponds to Table Λ 1), presents a comparison between shortest/longest/random generation per example for the smaller models. As observed for larger models, using the shortest answer outperforms both random and longest answers across all benchmarks and models.
Table 7: Comparison between taking the shortest/longest/random generation per example.
| | GPQA-D | AIME 2024 | AIME 2025 | HMMT | Math Average | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | |
| LN-Nano-8B | | | | | | | | | | |
| random | 7003 | 52.2 | 10380 | 62.1 | 11869 | 46.5 | 12693 | 34.0 | 11647 | 47.5 |
| longest | 10594 $(+51\$ | 41.4 | 16801 | 40.0 | 17140 | 33.3 | 18516 | 23.3 | 17486 $(+50\$ | 32.2 |
| shortest | 3937 $(-44\$ | 55.1 | 0 6047 | 70.0 | 0 7127 | 46.7 | 0 7508 | 50.0 | 6894 $(-41\$ | 55.6 |
| R1-7B | | | | | | | | | | |
| random | 7015 | 35.5 | 11538 | 57.8 | 12377 | 42.2 | 14693 | 25.0 | 12869 | 41.7 |
| longest | 11863 $(+69\$ | 29.8 | 21997 | 26.7 | 21029 | 26.7 | 23899 | 13.3 | 22308 $(+73\$ | 22.2 |
| shortest | 3438 $(-51\$ | 46.5 | 0 5217 | 76.7 | 0 6409 | 53.3 | 0 6950 | 43.3 | 6192 $(-52\$ | 57.8 |
Next, we analyze the performance of short-m@k using smaller models (see details in Section Λ 4). Figure Λ 19, Figure Λ 20 and Figure Λ 21 are the sample-size/compute/time-to-answer results for the small models over the math benchmarks, respectively. Figure Λ 22, Figure Λ 23 and Figure Λ 24 are the sample-size/compute/time-to-answer results for the small models for GPQA-D, respectively.
The performance of short-m@k using small models remain consistent with those observed in larger ones. short-1@k demonstrates a performance advantage over majority voting in low compute regimes, while short-3@k dominates it across all compute budgets.
<details>
<summary>x68.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size for Different Classification Methods
### Overview
The image is a line chart comparing the classification accuracy of four different methods as a function of increasing sample size (k). The chart demonstrates how the performance of two baseline methods and two proposed methods (each using either k-Nearest Neighbors (kNN) or Support Vector Machine (SVM) classifiers) scales with more training data.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **X-Axis:** Labeled **"Sample Size (k)"**. It has a linear scale with major tick marks at integer values from 1 to 10.
* **Y-Axis:** Labeled **"Accuracy"**. It has a linear scale ranging from approximately 0.47 to 0.70, with major grid lines at intervals of 0.05 (0.50, 0.55, 0.60, 0.65, 0.70).
* **Legend:** Positioned in the **top-left corner** of the chart area. It contains four entries, each with a unique line style, color, and marker:
1. **Baseline (kNN):** Solid blue line with square markers (β ).
2. **Baseline (SVM):** Solid red line with circle markers (β).
3. **Proposed Method (kNN):** Solid cyan line with diamond markers (β).
4. **Proposed Method (SVM):** Dotted black line with upward-pointing triangle markers (β²).
* **Grid:** A light gray grid is present for both major x and y ticks, aiding in value estimation.
### Detailed Analysis
**Data Series and Approximate Values:**
All series start at the same point for k=1 (Accuracy β 0.47). The following table lists the approximate accuracy values for each method at each sample size (k). Values are estimated from the chart's grid.
| k | Baseline (kNN) [Blue, β ] | Baseline (SVM) [Red, β] | Proposed Method (kNN) [Cyan, β] | Proposed Method (SVM) [Black, β²] |
| :--- | :--- | :--- | :--- | :--- |
| **1** | 0.47 | 0.47 | 0.47 | 0.47 |
| **2** | 0.52 | 0.49 | 0.52 | 0.57 |
| **3** | 0.545 | 0.51 | 0.545 | 0.615 |
| **4** | 0.55 | 0.535 | 0.56 | 0.64 |
| **5** | 0.555 | 0.55 | 0.575 | 0.66 |
| **6** | 0.56 | 0.56 | 0.58 | 0.67 |
| **7** | 0.56 | 0.565 | 0.585 | 0.68 |
| **8** | 0.56 | 0.57 | 0.59 | 0.69 |
| **9** | 0.56 | 0.575 | 0.59 | 0.70 |
| **10** | 0.56 | 0.58 | 0.59 | 0.70 |
**Trend Verification:**
* **Baseline (kNN) [Blue, β ]:** Shows a moderate initial increase from k=1 to k=3, then **plateaus sharply** from k=6 onward, showing no further improvement.
* **Baseline (SVM) [Red, β]:** Exhibits a **steady, near-linear increase** in accuracy across the entire range of k. It starts as the lowest-performing method after k=1 but surpasses the Baseline (kNN) method around k=6.
* **Proposed Method (kNN) [Cyan, β]:** Shows a **strong, consistent upward trend** that begins to **saturate** (the rate of increase slows) after k=6. It consistently outperforms both baseline methods.
* **Proposed Method (SVM) [Black, β²]:** Displays the **steepest and most significant upward trend**. It diverges sharply from the other methods starting at k=2 and maintains a high rate of improvement, showing only slight saturation towards k=9 and k=10. It is the top-performing method by a large margin.
### Key Observations
1. **Performance Hierarchy:** There is a clear and consistent performance hierarchy for all k > 1: Proposed Method (SVM) > Proposed Method (kNN) > Baseline (SVM) β Baseline (kNN). The gap between the best and worst methods widens significantly as k increases.
2. **Impact of the Proposed Method:** The proposed method provides a substantial accuracy boost for both underlying classifiers (kNN and SVM). The improvement is most dramatic for the SVM classifier.
3. **Classifier Behavior:** The SVM-based methods (both baseline and proposed) show more sustained improvement with larger sample sizes compared to their kNN counterparts, which tend to plateau earlier.
4. **Convergence Point:** All methods have identical performance at the smallest sample size (k=1), suggesting the advantage of the proposed method and the benefit of larger datasets only become apparent with more data.
### Interpretation
This chart provides strong evidence for the efficacy of the "Proposed Method." The data suggests that the proposed technique is not merely a minor improvement but a fundamental enhancement that significantly boosts the learning capacity of standard classifiers, especially SVM.
The **Peircean insight** here is that the relationship is not just about "more data is better." The chart reveals an **interaction effect**: the *type of method* determines *how effectively* additional data is utilized. The Proposed Method (SVM) demonstrates a superior "data efficiency" β it translates each incremental increase in sample size into a larger gain in accuracy compared to all other methods.
The plateauing of the kNN-based methods (both baseline and proposed) might indicate an inherent limitation of the kNN algorithm or the feature space for this specific task, where beyond a certain point (kβ6), additional samples provide diminishing returns. In contrast, the continued rise of the SVM-based methods, particularly the proposed one, suggests they are better at leveraging complex patterns in larger datasets.
**Notable Anomaly:** The crossing of the two baseline lines (Red and Blue) around k=6 is interesting. It indicates that while kNN may be more data-efficient for very small samples, SVM's capacity to model complex boundaries allows it to eventually surpass kNN as the dataset grows, even without the proposed enhancement. The proposed method amplifies this inherent advantage of SVM.
</details>
(a) LN-Nano-8B
<details>
<summary>x69.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size for Different Selection Methods
### Overview
The image is a line chart comparing the performance (accuracy) of four different methods as the sample size (k) increases. The chart demonstrates how accuracy scales with more samples for an ideal "Oracle" method, a baseline "majority" voting method, and two proposed methods labeled "Ours".
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, from 1 to 10.
* **Markers:** Integers 1 through 10.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from 0.40 to 0.65.
* **Gridlines:** Horizontal gridlines at intervals of 0.05 (0.40, 0.45, 0.50, 0.55, 0.60, 0.65).
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Position:** Bottom-right, inside the plot area.
* **Content:**
1. `pass@k (Oracle)`: Black dotted line with upward-pointing triangle markers.
2. `majority@k`: Dark red solid line with circle markers.
3. `short-1@k (Ours)`: Blue solid line with square markers.
4. `short-3@k (Ours)`: Cyan solid line with diamond markers.
### Detailed Analysis
**Data Series Trends and Approximate Values:**
1. **pass@k (Oracle) - Black Dotted Line with Triangles:**
* **Trend:** Shows the steepest and highest growth. It starts at the lowest point among all series at k=1 but quickly surpasses all others, maintaining a strong upward slope that begins to flatten slightly after k=6.
* **Approximate Data Points:**
* k=1: ~0.39
* k=2: ~0.49
* k=3: ~0.535
* k=4: ~0.565
* k=5: ~0.585
* k=6: ~0.605
* k=7: ~0.62
* k=8: ~0.63
* k=9: ~0.64
* k=10: ~0.65
2. **majority@k - Dark Red Solid Line with Circles:**
* **Trend:** Shows a steady, moderate upward slope. It is consistently the lowest-performing method for k > 1.
* **Approximate Data Points:**
* k=1: ~0.395
* k=2: ~0.42
* k=3: ~0.435
* k=4: ~0.46
* k=5: ~0.48
* k=6: ~0.495
* k=7: ~0.505
* k=8: ~0.51
* k=9: ~0.515
* k=10: ~0.52
3. **short-1@k (Ours) - Blue Solid Line with Squares:**
* **Trend:** Shows a strong upward slope, closely following but slightly below the `short-3@k` line. It performs significantly better than `majority@k` but worse than the `Oracle`.
* **Approximate Data Points:**
* k=1: ~0.395
* k=2: ~0.45
* k=3: ~0.475
* k=4: ~0.49
* k=5: ~0.505
* k=6: ~0.515
* k=7: ~0.52
* k=8: ~0.53
* k=9: ~0.535
* k=10: ~0.54
4. **short-3@k (Ours) - Cyan Solid Line with Diamonds:**
* **Trend:** Very similar to `short-1@k`, but maintains a slight, consistent advantage across all sample sizes. It is the best-performing of the non-Oracle methods.
* **Approximate Data Points:**
* k=1: ~0.395
* k=2: ~0.45
* k=3: ~0.48
* k=4: ~0.50
* k=5: ~0.51
* k=6: ~0.52
* k=7: ~0.525
* k=8: ~0.535
* k=9: ~0.54
* k=10: ~0.545
### Key Observations
1. **Performance Hierarchy:** A clear and consistent hierarchy is established for k > 1: `Oracle` >> `short-3@k` β₯ `short-1@k` > `majority@k`.
2. **Convergence at k=1:** All four methods start at nearly the same accuracy point (~0.39-0.395) when the sample size is 1.
3. **Diminishing Returns:** All curves show diminishing returns; the gain in accuracy per additional sample (k) decreases as k increases. This is most pronounced for the `Oracle` curve.
4. **Proximity of Proposed Methods:** The two methods labeled "(Ours)" perform very similarly, with `short-3@k` having a marginal but consistent edge over `short-1@k`.
5. **Significant Gap:** There is a substantial and growing gap between the ideal `Oracle` performance and the practical methods, especially at larger sample sizes.
### Interpretation
This chart likely comes from a research paper on machine learning or code generation, evaluating methods for selecting the best output from multiple samples (k). The "Oracle" represents an ideal upper bound where the correct answer is always selected if present among the k samples. The "majority@k" is a common baseline that picks the most frequent output.
The key finding is that the authors' proposed methods (`short-1@k` and `short-3@k`) significantly outperform the simple majority voting baseline. This suggests their selection strategy is more effective at identifying correct outputs. However, the persistent and large gap to the Oracle line indicates there is still substantial room for improvement in selection algorithms, as the theoretical potential (if the correct answer is in the set) is much higher than what current methods achieve. The near-identical performance of `short-1` and `short-3` might imply that the specific parameter (perhaps the length of a "short" list considered) has a minor impact compared to the core selection mechanism itself.
</details>
(b) R $1$ - $7$ B
Figure 19: Small models - sample size ( $k$ ) comparison over math benchmarks.
<details>
<summary>x70.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Reasoning Methods
### Overview
The image is a line chart comparing the performance (accuracy) of four different AI reasoning methods as a function of computational resources ("Thinking Compute"). The chart demonstrates how each method's accuracy scales with an increasing budget of "thinking tokens."
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **X-Axis:**
* **Title:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear, ranging from approximately 10 to 120 (representing 10,000 to 120,000 thinking tokens).
* **Major Tick Marks:** 20, 40, 60, 80, 100, 120.
* **Y-Axis:**
* **Title:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.47 to 0.71.
* **Major Tick Marks:** 0.50, 0.55, 0.60, 0.65, 0.70.
* **Legend:** Located in the top-left corner of the plot area. It contains four entries:
1. **Chain-of-Thought (CoT):** Black dotted line with upward-pointing triangle markers.
2. **Self-Consistency (SC):** Cyan (light blue) solid line with diamond markers.
3. **Tree of Thoughts (ToT):** Cyan (light blue) solid line with square markers.
4. **Reflexion:** Red solid line with circle markers.
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
**Data Series Trends and Approximate Points:**
1. **Chain-of-Thought (CoT) - Black Dotted Line, Triangle Markers:**
* **Trend:** Shows the steepest and most consistent upward slope, indicating the highest marginal gain in accuracy per additional thinking token. It does not appear to plateau within the displayed range.
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.57)
* (30, 0.615)
* (40, 0.64)
* (50, 0.66)
* (60, 0.68)
* (70, 0.69)
* (80, 0.705)
2. **Self-Consistency (SC) - Cyan Line, Diamond Markers:**
* **Trend:** Shows a strong initial increase that begins to decelerate (curve flattens) after approximately 40-50k tokens. It achieves the second-highest accuracy.
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.52)
* (30, 0.545)
* (40, 0.56)
* (50, 0.575)
* (60, 0.58)
* (70, 0.585)
* (80, 0.59)
3. **Tree of Thoughts (ToT) - Cyan Line, Square Markers:**
* **Trend:** Increases initially but plateaus relatively early, showing minimal gains after approximately 50-60k tokens. Its final accuracy is very close to that of the SC method at the same compute level (~60k tokens).
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.52)
* (30, 0.55)
* (40, 0.555)
* (50, 0.56)
* (60, 0.56)
* (70, 0.56)
4. **Reflexion - Red Line, Circle Markers:**
* **Trend:** Shows the most gradual, nearly linear increase. It starts at the same point as the others but is consistently outperformed by CoT and SC for most of the range. It continues to improve slowly even at high compute levels (120k tokens).
* **Approximate Data Points:**
* (10, 0.47)
* (20, 0.49)
* (30, 0.51)
* (40, 0.535)
* (50, 0.55)
* (60, 0.56)
* (70, 0.565)
* (80, 0.565)
* (90, 0.57)
* (100, 0.575)
* (110, 0.578)
* (120, 0.58)
### Key Observations
* **Common Starting Point:** All four methods begin at approximately the same accuracy (~0.47) with minimal compute (~10k tokens).
* **Performance Hierarchy:** For nearly all compute budgets above 20k tokens, the performance order is consistent: CoT > SC β ToT > Reflexion. The gap between CoT and the others widens significantly as compute increases.
* **Diminishing Returns:** SC and ToT exhibit clear diminishing returns (plateauing), while CoT shows sustained strong returns. Reflexion shows very slow but steady returns.
* **Convergence:** The SC (diamond) and ToT (square) lines converge around 60k tokens at an accuracy of ~0.56.
### Interpretation
This chart provides a comparative efficiency analysis of different AI reasoning strategies. The data suggests that the **Chain-of-Thought (CoT)** method is the most "compute-efficient" for scaling accuracy; it translates additional thinking tokens into performance gains most effectively within the tested range. This could imply its underlying process is better structured to utilize extended deliberation.
**Self-Consistency (SC)** and **Tree of Thoughts (ToT)** offer a middle ground, providing better accuracy than basic CoT at lower compute budgets but hitting a performance ceiling sooner. Their similar trajectories suggest they may share fundamental limitations in how they explore the solution space.
**Reflexion**, while improving steadily, is the least efficient in this context. Its linear growth might indicate a more iterative, less parallelizable process that benefits from more tokens but at a lower rate.
The key takeaway is that the choice of reasoning method has a profound impact on the return-on-investment for computational resources (thinking tokens). For tasks where high accuracy is critical and compute is available, CoT appears superior. For resource-constrained environments, SC or ToT might offer a better accuracy/compute trade-off at lower budgets. The chart does not show an upper bound for CoT, leaving open the question of where its performance might eventually plateau.
</details>
(a) LN-Nano-8B
<details>
<summary>x71.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart comparing the performance (Accuracy) of four different methods as a function of computational effort (Thinking Compute). The chart demonstrates how accuracy scales with increased compute for an "Oracle" method and three alternative approaches, two of which are labeled as "(Ours)".
### Components/Axes
* **Y-Axis:** Labeled "Accuracy". The scale ranges from 0.40 to 0.65, with major grid lines at intervals of 0.05.
* **X-Axis:** Labeled "Thinking Compute (thinking tokens in thousands)". The scale ranges from 20 to 140, with major grid lines at intervals of 20 (20, 40, 60, 80, 100, 120, 140).
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries:
1. `pass@k (Oracle)`: Represented by a black dotted line with upward-pointing triangle markers.
2. `majority@k`: Represented by a solid dark red line with circle markers.
3. `short-1@k (Ours)`: Represented by a solid blue line with square markers.
4. `short-3@k (Ours)`: Represented by a solid cyan line with diamond markers.
### Detailed Analysis
All four data series originate from the same approximate starting point at the lowest compute value shown.
**1. pass@k (Oracle)**
* **Trend:** Exhibits the steepest, near-linear upward slope. It demonstrates the highest accuracy for any given compute level above the starting point.
* **Data Points (Approximate):**
* (20, 0.40)
* (30, 0.485)
* (40, 0.535)
* (50, 0.565)
* (60, 0.59)
* (70, 0.615)
* (80, 0.65)
**2. majority@k**
* **Trend:** Shows the most gradual, concave upward slope. It has the lowest accuracy of all methods for compute values above ~25.
* **Data Points (Approximate):**
* (20, 0.40)
* (40, 0.43)
* (60, 0.46)
* (80, 0.49)
* (100, 0.505)
* (120, 0.515)
* (140, 0.52)
**3. short-1@k (Ours)**
* **Trend:** Shows a moderate, concave upward slope, positioned between the Oracle and majority methods.
* **Data Points (Approximate):**
* (20, 0.40)
* (30, 0.475)
* (40, 0.49)
* (50, 0.51)
* (60, 0.525)
* (70, 0.54)
**4. short-3@k (Ours)**
* **Trend:** Follows a very similar trajectory to `short-1@k (Ours)`, with a nearly identical slope, but is consistently positioned slightly to the right (requiring more compute for similar accuracy) or slightly below (lower accuracy for similar compute).
* **Data Points (Approximate):**
* (20, 0.40)
* (30, 0.45)
* (40, 0.48)
* (50, 0.50)
* (60, 0.515)
* (70, 0.525)
* (80, 0.535)
* (90, 0.54)
### Key Observations
1. **Universal Starting Point:** All methods begin at approximately 0.40 accuracy with 20k thinking tokens.
2. **Performance Hierarchy:** A clear and consistent hierarchy is established: `pass@k (Oracle)` >> `short-1@k (Ours)` β `short-3@k (Ours)` > `majority@k`.
3. **Diminishing Returns:** All curves show signs of diminishing returns (concavity), but the degree varies drastically. The Oracle method's returns diminish the least within the plotted range.
4. **Proximity of "Ours" Methods:** The two proposed methods (`short-1` and `short-3`) perform very similarly, with `short-1` having a slight edge in efficiency (achieving the same accuracy with less compute).
5. **Compute Range:** The Oracle method is only plotted up to 80k tokens, while `majority@k` extends to 140k, suggesting the Oracle may not require or was not tested at higher compute levels.
### Interpretation
This chart likely comes from research on scaling inference compute for language models or reasoning systems. The "Thinking Compute" axis represents the resource (in tokens) allocated to a problem-solving process.
* **The "Oracle" as an Upper Bound:** The `pass@k (Oracle)` line represents a theoretical or idealized best-case scenario (perhaps using ground-truth information or an unbounded verifier). It serves as a performance ceiling, showing the maximum achievable accuracy for a given compute budget under perfect conditions.
* **Efficiency of Proposed Methods:** The core message is that the authors' methods (`short-1@k` and `short-3@k`) offer a significant efficiency improvement over the `majority@k` baseline. They achieve substantially higher accuracy for the same compute, or the same accuracy with much less compute. For example, to reach 0.50 accuracy, `majority@k` requires ~100k tokens, while `short-1@k` requires only ~45k tokens.
* **The Cost of "Short" Strategies:** The names `short-1` and `short-3` imply these methods use shorter or more constrained reasoning chains. The chart quantifies the trade-off: these constrained strategies are less accurate than the ideal Oracle but are far more compute-efficient than a simple majority vote approach, striking a practical balance for real-world applications where compute is limited.
* **Scalability Insight:** The steep slope of the Oracle line suggests that with perfect verification, accuracy scales very favorably with compute. The flatter slopes of the other methods indicate they hit practical limits or inefficiencies in how they utilize additional compute. The research likely aims to close the gap between practical methods and the Oracle bound.
</details>
(b) R $1$ - $7$ B
Figure 20: Small models - thinking compute comparison over math benchmarks.
<details>
<summary>x72.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different k-values
### Overview
The image is a scatter plot comparing the performance of three distinct methods or models (differentiated by marker shape and color) across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). Each data point is labeled with a parameter `k`, which appears to be a variable (e.g., number of neighbors, steps, or candidates) ranging from 1 to 9. The plot reveals a trade-off between speed and accuracy, with different methods occupying different regions of the performance space.
### Components/Axes
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from approximately 7.5 to 17.5, with major tick marks at 8, 10, 12, 14, and 16. The unit is "thousands," suggesting the values represent thousands of operations, milliseconds, or another time-based unit.
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from 0.48 to 0.59, with major tick marks at 0.48, 0.50, 0.52, 0.54, 0.56, and 0.58.
* **Data Series (Inferred from Marker Style):** There is no explicit legend. The series are distinguished by marker shape and color:
1. **Cyan Squares:** Clustered on the left side of the plot (lower Time-to-Answer).
2. **Cyan Diamonds:** Distributed across the middle of the plot.
3. **Red Circles:** Clustered on the right side of the plot (higher Time-to-Answer).
* **Data Point Labels:** Each marker is annotated with text indicating the `k` value (e.g., "k=9", "k=5", "k=3", "k=1").
### Detailed Analysis
**Data Points by Series:**
* **Cyan Square Series (Left Cluster):**
* `k=9`: Accuracy β 0.563, Time-to-Answer β 7.8
* `k=5`: Accuracy β 0.555, Time-to-Answer β 8.5
* `k=3`: Accuracy β 0.542, Time-to-Answer β 9.5
* *Trend:* As `k` decreases, Accuracy decreases slightly while Time-to-Answer increases moderately.
* **Cyan Diamond Series (Middle Distribution):**
* `k=9`: Accuracy β 0.590, Time-to-Answer β 10.0
* `k=5`: Accuracy β 0.575, Time-to-Answer β 11.5
* `k=3`: Accuracy β 0.545, Time-to-Answer β 14.5
* `k=1`: Accuracy β 0.470, Time-to-Answer β 12.0
* *Trend:* For `k=9, 5, 3`, as `k` decreases, Accuracy decreases and Time-to-Answer increases. The `k=1` point is a significant outlier, showing the lowest accuracy but not the highest time.
* **Red Circle Series (Right Cluster):**
* `k=9`: Accuracy β 0.575, Time-to-Answer β 17.0
* `k=5`: Accuracy β 0.552, Time-to-Answer β 15.5
* `k=3`: Accuracy β 0.515, Time-to-Answer β 14.5
* *Trend:* As `k` decreases, both Accuracy and Time-to-Answer decrease.
### Key Observations
1. **Performance Clustering:** The three marker types occupy distinct regions. Cyan squares are fast but mid-accuracy. Cyan diamonds offer a range of accuracy (including the highest) at medium speed. Red circles are the slowest but can achieve high accuracy at high `k`.
2. **Impact of `k`:** For the Cyan Diamond and Red Circle series, higher `k` values consistently yield higher accuracy. The relationship with time is less uniform.
3. **Outlier:** The Cyan Diamond `k=1` point (Accuracy β 0.47) is a clear outlier, performing significantly worse in accuracy than all other points, despite having a moderate Time-to-Answer (~12).
4. **Speed-Accuracy Trade-off:** The plot visualizes a classic trade-off. The fastest method (Cyan Square `k=9`) has good accuracy (0.563). The most accurate method (Cyan Diamond `k=9`) is slower (Time=10). The slowest method (Red Circle `k=9`) has high accuracy (0.575) but is not the most accurate.
### Interpretation
This chart likely compares different algorithmic approaches or model configurations for a task requiring both deliberation (thinking time) and correctness. The `k` parameter is a key tuning knob.
* **Cyan Squares** represent a **"Fast but Limited"** approach. It's highly efficient (low time) but its accuracy ceiling is lower, and it doesn't benefit as dramatically from increasing `k`.
* **Cyan Diamonds** represent a **"Balanced and Scalable"** approach. It can achieve the peak accuracy in the chart, and its performance scales well with `k`, though at a cost of increased time. The poor performance at `k=1` suggests this method requires a minimum level of deliberation (`k>=3`) to be effective.
* **Red Circles** represent a **"Deliberate and High-Cost"** approach. It is inherently slower, but can reach high accuracy levels when given a high `k` value. Its performance degrades more steeply as `k` is reduced compared to the other methods.
The choice of method depends on the application's priority: if speed is critical, the Cyan Square method is best. If maximum accuracy is paramount and time is less constrained, the Cyan Diamond method with high `k` is optimal. The Red Circle method may be preferable if its specific characteristics (not shown, e.g., robustness, memory use) are beneficial, despite its higher time cost. The outlier at `k=1` for the Cyan Diamond method indicates a critical failure mode or threshold effect for that configuration.
</details>
(a) LN-Nano-8B
<details>
<summary>x73.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods
### Overview
This image is a scatter plot comparing the performance of three different methods (`majority@k`, `short-1@k`, `short-3@k`) across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). Each data point represents a specific configuration of a method, labeled with its `k` value. The plot illustrates the trade-off between computational time (thinking duration) and output accuracy for these methods.
### Components/Axes
* **X-Axis:** Labeled "Time-to-Answer (longest thinking in thousands)". The scale runs from approximately 7 to 20, with major tick marks at 7, 10, 12, 15, 17, and 20. The unit is "thousands," implying the values represent thousands of units (e.g., tokens, steps).
* **Y-Axis:** Labeled "Accuracy". The scale runs from 0.40 to 0.54, with major tick marks at intervals of 0.02 (0.40, 0.42, 0.44, 0.46, 0.48, 0.50, 0.52, 0.54).
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Red Circle:** `majority@k`
* **Blue Square:** `short-1@k (Ours)`
* **Cyan Diamond:** `short-3@k (Ours)`
* **Data Point Labels:** Each marker is annotated with a text label indicating its `k` value (e.g., "k=9", "k=5").
### Detailed Analysis
The plot contains nine distinct data points, three for each method.
**1. `majority@k` (Red Circles)**
* **Trend:** Shows a positive correlation. As Time-to-Answer increases, Accuracy generally increases.
* **Data Points:**
* `k=3`: Located at approximately (Time=17, Accuracy=0.43).
* `k=5`: Located at approximately (Time=20, Accuracy=0.48).
* `k=9`: Located at approximately (Time=22, Accuracy=0.515). This is the rightmost and one of the highest-accuracy points on the chart.
**2. `short-1@k (Ours)` (Blue Squares)**
* **Trend:** Shows a negative correlation. As Time-to-Answer increases, Accuracy decreases.
* **Data Points:**
* `k=3`: Located at approximately (Time=10, Accuracy=0.475).
* `k=5`: Located at approximately (Time=8, Accuracy=0.50).
* `k=9`: Located at approximately (Time=7, Accuracy=0.53). This is the leftmost point, indicating the fastest answer time, and has the highest accuracy on the entire chart.
**3. `short-3@k (Ours)` (Cyan Diamonds)**
* **Trend:** No clear monotonic trend. Points are scattered across the middle of the plot.
* **Data Points:**
* `k=1`: Located at approximately (Time=14, Accuracy=0.395). This is the lowest-accuracy point on the chart.
* `k=5`: Located at approximately (Time=13, Accuracy=0.51).
* `k=9`: Located at approximately (Time=11, Accuracy=0.535). This is the highest-accuracy point on the chart.
### Key Observations
1. **Performance Extremes:** The highest accuracy (~0.535) is achieved by `short-3@k` with `k=9` at a moderate time (~11). The fastest time (~7) is achieved by `short-1@k` with `k=9`, which also yields very high accuracy (~0.53).
2. **Method Behavior:** The two "Ours" methods (`short-1` and `short-3`) achieve peak accuracy at lower Time-to-Answer values compared to `majority@k`. `majority@k` requires significantly more time (17-22) to reach comparable accuracy levels (0.48-0.515).
3. **Impact of `k`:** For `short-1@k`, increasing `k` (from 3 to 9) dramatically *reduces* time and *increases* accuracy. For `majority@k`, increasing `k` increases both time and accuracy. For `short-3@k`, the relationship is non-linear.
4. **Outlier:** The `short-3@k, k=1` point is a clear outlier, having both low accuracy and moderate time, suggesting this configuration is ineffective.
### Interpretation
The data suggests a fundamental difference in how these methods utilize computational resources ("thinking time").
* **`short-1@k`** appears to be a highly efficient method. Its best performance (`k=9`) is both the fastest and among the most accurate, indicating it finds high-quality solutions quickly. The negative trend suggests that for this method, allocating more time (`k=3` being slower than `k=9`) may lead to overthinking or degraded performance.
* **`majority@k`** follows a more traditional trade-off: investing more time yields better accuracy. It is a reliable but slower method, requiring 2-3x the time of `short-1@k` to reach similar accuracy.
* **`short-3@k`** shows high potential (peak accuracy) but is inconsistent. Its performance varies widely with `k`, making it less predictable. The `k=1` failure indicates a minimum threshold of complexity (`k` value) is needed for it to function effectively.
**Overall Implication:** The "Ours" methods, particularly `short-1@k`, demonstrate a superior Pareto frontier, offering a better balance of speed and accuracy compared to the `majority@k` baseline. The choice of `k` is a critical hyperparameter that affects each method differently.
</details>
(b) R $1$ - $7$ B
Figure 21: Small models - time-to-answer comparison over math benchmarks.
<details>
<summary>x74.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Sample Size (k)
### Overview
The image is a line chart plotting model or method accuracy against sample size. It displays three distinct data series, each represented by a colored line with markers, showing how accuracy changes as the sample size increases from 1 to 10. The chart includes a grid for easier reading of values.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Sample Size (k)". It has integer markers from 1 to 10.
* **Y-Axis (Vertical):** Labeled "Accuracy". It has numerical markers from 0.52 to 0.57, with major grid lines at intervals of 0.01.
* **Data Series (Lines):** There are three lines, differentiated by color and marker shape. There is no explicit legend box within the chart area; the series must be identified by their visual properties.
1. **Cyan Line:** Uses diamond-shaped markers (β).
2. **Teal Line:** Uses square markers (β ).
3. **Maroon Line:** Uses circular markers (β).
* **Grid:** A light gray grid is present, with vertical lines at each integer sample size and horizontal lines at each 0.01 accuracy increment.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
* **Cyan Line (Diamonds):** Shows a steep initial increase, peaks, and then slightly declines.
* k=1: ~0.522
* k=2: ~0.544
* k=3: ~0.554
* k=4: ~0.560
* k=5: ~0.564
* k=6: ~0.567
* k=7: ~0.568 (Peak)
* k=8: ~0.568
* k=9: ~0.567
* k=10: ~0.566
* **Teal Line (Squares):** Increases rapidly at first, then plateaus and shows a very slight downward trend.
* k=1: ~0.522
* k=2: ~0.544
* k=3: ~0.552
* k=4: ~0.556
* k=5: ~0.557
* k=6: ~0.558 (Peak)
* k=7: ~0.5575
* k=8: ~0.557
* k=9: ~0.556
* k=10: ~0.555
* **Maroon Line (Circles):** Shows a steady, consistent increase across the entire range, ending as the highest point.
* k=1: ~0.522
* k=2: ~0.534
* k=3: ~0.546
* k=4: ~0.556
* k=5: ~0.561
* k=6: ~0.564
* k=7: ~0.566
* k=8: ~0.567
* k=9: ~0.568
* k=10: ~0.569 (Highest point on chart)
### Key Observations
1. **Convergent Start:** All three methods start at approximately the same accuracy (~0.522) when the sample size is 1.
2. **Divergent Paths:** The lines diverge significantly between k=2 and k=5. The cyan and teal lines rise faster initially than the maroon line.
3. **Peak and Plateau Behavior:** The cyan line peaks around k=7-8 and then declines. The teal line peaks earlier (k=6) and declines more gradually. The maroon line shows no sign of peaking within the given range.
4. **Final Ranking:** At the largest sample size (k=10), the maroon line has the highest accuracy, followed by the cyan line, with the teal line having the lowest accuracy of the three.
5. **Diminishing Returns:** All lines show diminishing returns; the gain in accuracy per additional sample is much larger for small k (1 to 4) than for large k (7 to 10).
### Interpretation
This chart likely compares the performance of three different algorithms, models, or sampling strategies as more data becomes available. The data suggests:
* **The Maroon method** benefits most consistently from additional data. Its steady upward trend implies it may have a higher capacity to learn from more samples or is less prone to overfitting within this range. It is the best choice if large sample sizes (k=10) are available.
* **The Cyan method** is highly effective with moderate sample sizes, achieving the highest accuracy in the mid-range (k=7-8). However, its slight decline at k=9-10 could indicate the onset of overfitting or that its optimal performance window has been passed.
* **The Teal method** learns quickly from the first few samples but hits a performance ceiling early (around k=6). Adding more data beyond this point provides no benefit and may even slightly hurt performance. This could be a simpler model or one with high bias.
The key takeaway is that "more data" is not universally better for all methods. The optimal sample size depends on the specific method used. There is a clear trade-off between the rapid initial gains of the Cyan/Teal methods and the sustained, long-term improvement of the Maroon method. An investigator would need to consider the cost of acquiring larger samples versus the required accuracy to choose the best method for their specific constraints.
</details>
(a) LN-Nano-8B
<details>
<summary>x75.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Sample Size for Three Methods
### Overview
The image is a line chart comparing the performance of three different methods as a function of sample size. The chart plots "Accuracy" on the vertical axis against "Sample Size (k)" on the horizontal axis. All three methods show a positive trend, with accuracy increasing as the sample size grows from 1 to 10.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Sample Size (k)"
* **Scale:** Linear, with integer markers from 1 to 10.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, with major gridlines and labels at 0.36, 0.38, 0.40, 0.42, and 0.44.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Items:**
1. `majority@k` - Represented by a brown line with circular markers.
2. `short-1@k (Ours)` - Represented by a blue line with square markers.
3. `short-3@k (Ours)` - Represented by a cyan (light blue) line with diamond markers.
* **Grid:** A light gray grid is present for both axes.
### Detailed Analysis
The chart displays three data series. Below is an approximate reconstruction of the data points based on visual inspection. Values are estimated to the nearest 0.005.
**Trend Verification:**
* **`majority@k` (Brown, Circles):** The line starts at the lowest point and slopes upward consistently, showing steady improvement.
* **`short-1@k (Ours)` (Blue, Squares):** The line starts higher than the brown line and slopes upward, maintaining a consistent lead over `majority@k`.
* **`short-3@k (Ours)` (Cyan, Diamonds):** The line starts at the same point as `short-1@k` but immediately diverges upward, maintaining the highest position throughout and showing the steepest initial slope.
**Data Point Extraction (Approximate Accuracy):**
| Sample Size (k) | `majority@k` (Brown) | `short-1@k (Ours)` (Blue) | `short-3@k (Ours)` (Cyan) |
| :--- | :--- | :--- | :--- |
| 1 | 0.355 | 0.355 | 0.355 |
| 2 | 0.368 | 0.385 | 0.385 |
| 3 | 0.379 | 0.400 | 0.402 |
| 4 | 0.395 | 0.410 | 0.415 |
| 5 | 0.407 | 0.417 | 0.424 |
| 6 | 0.416 | 0.424 | 0.431 |
| 7 | 0.423 | 0.429 | 0.437 |
| 8 | 0.428 | 0.434 | 0.442 |
| 9 | 0.432 | 0.438 | 0.447 |
| 10 | 0.436 | 0.442 | 0.452 |
### Key Observations
1. **Universal Improvement:** All three methods demonstrate a clear, monotonic increase in accuracy as the sample size (k) increases.
2. **Performance Hierarchy:** A consistent performance hierarchy is established from k=2 onward: `short-3@k` > `short-1@k` > `majority@k`. The two methods labeled "(Ours)" consistently outperform the `majority@k` baseline.
3. **Convergence at k=1:** All three methods begin at the exact same accuracy point (~0.355) when the sample size is 1.
4. **Diverging Gap:** The performance gap between the best method (`short-3@k`) and the baseline (`majority@k`) widens as k increases. At k=10, the gap is approximately 0.016 accuracy points.
5. **Relative Gain:** The `short-3@k` method shows a slightly greater relative improvement over `short-1@k` as k increases, suggesting it may scale better with more samples.
### Interpretation
This chart presents empirical evidence for the effectiveness of two proposed methods (`short-1@k` and `short-3@k`) against a baseline (`majority@k`) in a task where accuracy is measured. The key takeaway is that the proposed methods are superior, with `short-3@k` being the most effective.
The data suggests that the "short-3" variant is better at leveraging additional samples to improve its predictions or decisions. The fact that all methods start equally at k=1 implies the baseline is competitive with a single sample, but the proposed methods' architectures or algorithms allow them to extract more value from multiple samples (k > 1). The consistent, smooth curves indicate stable and predictable performance scaling. This type of analysis is common in machine learning and statistical method evaluation, where the goal is to show that a new technique provides a measurable and reliable improvement over existing approaches across a range of operational conditions (here, varying sample sizes).
</details>
(b) R $1$ - $7$ B
Figure 22: Small models - sample size ( $k$ ) comparison over GPQA-D.
<details>
<summary>x76.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against the amount of "thinking compute" allocated, measured in thousands of thinking tokens. It displays three distinct data series, each represented by a different colored line with unique markers, showing how performance scales with increased computational resources for different approaches or models.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale from 10 to 70, with major gridlines and labels at intervals of 10 (10, 20, 30, 40, 50, 60, 70).
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale from 0.52 to 0.57, with major gridlines and labels at intervals of 0.01 (0.52, 0.53, 0.54, 0.55, 0.56, 0.57).
* **Data Series (Lines):**
* **Series 1 (Cyan Line with Diamond Markers):** This line starts at the lowest compute point and shows the steepest initial improvement.
* **Series 2 (Blue Line with Square Markers):** This line also starts low and rises quickly but plateaus earlier than the others.
* **Series 3 (Red/Brown Line with Circle Markers):** This line starts at a similar low point but follows a steadier, more gradual upward trajectory.
* **Legend:** There is no explicit legend box within the chart area. The series are differentiated solely by line color and marker shape.
### Detailed Analysis
**Data Series 1: Cyan Line (Diamond Markers)**
* **Trend:** Shows a rapid, near-linear increase in accuracy from low compute, peaks, and then exhibits a slight decline at the highest compute levels shown for this series.
* **Approximate Data Points:**
* (7k tokens, ~0.522 accuracy)
* (12k tokens, ~0.544)
* (18k tokens, ~0.554)
* (25k tokens, ~0.560)
* (30k tokens, ~0.564)
* (35k tokens, ~0.567)
* (40k tokens, ~0.568) **[Peak]**
* (45k tokens, ~0.567)
* (50k tokens, ~0.566)
**Data Series 2: Blue Line (Square Markers)**
* **Trend:** Rises very steeply at the lowest compute levels, then flattens into a plateau, showing minimal gains and even a slight decrease as compute increases further.
* **Approximate Data Points:**
* (7k tokens, ~0.522 accuracy)
* (12k tokens, ~0.544)
* (16k tokens, ~0.552)
* (20k tokens, ~0.556)
* (25k tokens, ~0.557)
* (30k tokens, ~0.558) **[Plateau Start]**
* (35k tokens, ~0.557)
* (40k tokens, ~0.556)
* (45k tokens, ~0.555)
**Data Series 3: Red/Brown Line (Circle Markers)**
* **Trend:** Demonstrates a consistent, monotonic increase in accuracy across the entire range of compute. Its growth is less steep initially but sustains longer, eventually surpassing the other two series.
* **Approximate Data Points:**
* (7k tokens, ~0.522 accuracy)
* (20k tokens, ~0.546)
* (28k tokens, ~0.556)
* (35k tokens, ~0.561)
* (42k tokens, ~0.564)
* (50k tokens, ~0.566)
* (58k tokens, ~0.567)
* (65k tokens, ~0.568)
* (70k tokens, ~0.569) **[Highest Point on Chart]**
### Key Observations
1. **Convergence at Low Compute:** All three methods start at nearly the same accuracy (~0.522) when given minimal compute (~7k tokens).
2. **Diverging Scaling Laws:** The methods scale very differently. The cyan and blue methods show strong early returns but hit diminishing returns (blue) or a peak followed by slight degradation (cyan). The red method shows a more sustainable scaling law.
3. **Crossover Point:** The red line, which initially lags behind the cyan line, crosses above it at approximately 50k thinking tokens and continues to rise, becoming the highest-performing method at the highest compute levels shown (70k tokens).
4. **Performance Ceiling:** The cyan line suggests a potential performance ceiling or even a slight negative return beyond ~40k tokens for that specific method. The blue line hits a clear ceiling earlier, around 30k tokens.
### Interpretation
This chart illustrates a fundamental trade-off in AI model scaling: the relationship between computational investment (thinking tokens) and performance (accuracy). The data suggests that different model architectures, training methods, or prompting strategies (represented by the three lines) have vastly different **compute-optimal** profiles.
* The **blue method** is highly efficient for low-compute scenarios but cannot leverage additional resources effectively. It would be the best choice under strict compute budgets.
* The **cyan method** offers the best peak performance for a mid-range compute budget (30k-45k tokens) but may be unstable or over-optimize at higher levels, leading to performance regression.
* The **red method** demonstrates the most robust and scalable behavior. While less efficient at the very low end, it continues to improve predictably with more compute, making it the superior choice if high accuracy is the primary goal and computational resources are not a limiting factor.
The chart provides a visual argument for investigating why certain approaches plateau while others scale. It implies that simply adding more compute is not a universal solution; the underlying method must be capable of utilizing that compute productively. The crossover point is particularly important for decision-making, indicating the threshold at which investing in the more scalable (red) method becomes advantageous.
</details>
(a) LN-Nano-8B
<details>
<summary>x77.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Methods
### Overview
The image is a line chart comparing the performance of three different methods in terms of accuracy as a function of computational effort (thinking tokens). The chart demonstrates that two proposed methods ("short-1@k" and "short-3@k") achieve higher accuracy than a baseline method ("majority@k") for equivalent or lower computational cost.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** `Thinking Compute (thinking tokens in thousands)`
* **Scale:** Linear, ranging from approximately 5 to 70.
* **Major Ticks:** 10, 20, 30, 40, 50, 60, 70.
* **Y-Axis:**
* **Label:** `Accuracy`
* **Scale:** Linear, ranging from approximately 0.35 to 0.45.
* **Major Ticks:** 0.36, 0.38, 0.40, 0.42, 0.44.
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Red line with circle markers:** `majority@k`
* **Blue line with square markers:** `short-1@k (Ours)`
* **Cyan line with diamond markers:** `short-3@k (Ours)`
* **Grid:** Light gray grid lines are present for both major x and y ticks.
### Detailed Analysis
**Data Series and Trends:**
1. **`majority@k` (Red line, circle markers):**
* **Trend:** Shows a steady, concave-down upward slope. The rate of accuracy improvement slows as compute increases.
* **Approximate Data Points:**
* (10, 0.355)
* (20, 0.378)
* (30, 0.395)
* (40, 0.407)
* (50, 0.416)
* (60, 0.422)
* (70, 0.435)
2. **`short-1@k (Ours)` (Blue line, square markers):**
* **Trend:** Shows a steep, nearly linear upward slope, consistently above the red line. It achieves the highest accuracy values on the chart for a given compute level.
* **Approximate Data Points:**
* (10, 0.355) - Starts at the same point as the other series.
* (20, 0.417)
* (30, 0.429)
* (40, 0.442)
* (50, 0.450) - Highest visible point on the chart.
3. **`short-3@k (Ours)` (Cyan line, diamond markers):**
* **Trend:** Shows a steep upward slope, very close to but slightly below the blue line (`short-1@k`). It is consistently above the red baseline.
* **Approximate Data Points:**
* (10, 0.355) - Starts at the same point as the other series.
* (20, 0.402)
* (30, 0.423)
* (40, 0.438)
* (50, 0.448)
### Key Observations
* **Common Origin:** All three methods begin at approximately the same accuracy (~0.355) when thinking compute is 10,000 tokens.
* **Performance Hierarchy:** For all compute levels >10k tokens, the order of performance from highest to lowest accuracy is: `short-1@k` > `short-3@k` > `majority@k`.
* **Efficiency Gap:** The performance gap between the proposed methods (blue/cyan) and the baseline (red) widens significantly as compute increases from 10k to 40k tokens.
* **Diminishing Returns:** All curves show signs of diminishing returns (flattening slope) at higher compute levels, but the baseline (`majority@k`) flattens most noticeably.
### Interpretation
This chart presents a compelling case for the efficiency of the authors' proposed methods (`short-1@k` and `short-3@k`). The core message is that these methods deliver superior accuracy (a ~5-7 percentage point advantage at 40k-50k tokens) compared to the `majority@k` baseline while using the same or fewer computational resources (thinking tokens).
The near-overlap of the two "short" methods suggests that the specific variant (`-1` vs `-3`) has a minor impact compared to the fundamental advantage they both hold over the baseline. The data implies that the proposed techniques are more effective at converting computational "thinking" into accurate outcomes. The widening gap in the mid-range of compute (20k-40k tokens) is particularly notable, indicating this is where the new methods offer the greatest relative benefit. The chart effectively argues that investing thinking tokens yields a better return on accuracy with the authors' approach.
</details>
(b) R $1$ - $7$ B
Figure 23: Small models - thinking compute comparison over GPQA-D.
<details>
<summary>x78.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different 'k' Values
### Overview
The image is a scatter plot comparing model accuracy against the time taken to generate an answer (measured in thousands of thinking steps). The plot displays data points for three distinct model configurations, differentiated by marker shape and color, across various 'k' values (likely a parameter such as the number of reasoning steps or candidates considered).
### Components/Axes
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from approximately 4.5 to 10.5, with major gridlines at intervals of 1 (5, 6, 7, 8, 9, 10).
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from 0.52 to 0.57, with major gridlines at intervals of 0.01.
* **Data Series & Legend (Inferred from markers):**
* **Series 1 (Light Blue Squares):** Represents one model configuration.
* **Series 2 (Cyan Diamonds):** Represents a second model configuration.
* **Series 3 (Dark Red Circles):** Represents a third model configuration.
* **Data Point Labels:** Each marker is annotated with a text label indicating its 'k' value (e.g., "k=1", "k=3", "k=5", "k=9").
### Detailed Analysis
**Data Points (Approximate Coordinates & Labels):**
* **Light Blue Squares:**
* Point 1: (x β 4.7, y β 0.556) - Label: **k=9**
* Point 2: (x β 5.0, y β 0.557) - Label: **k=5**
* Point 3: (x β 5.5, y β 0.552) - Label: **k=3**
* **Cyan Diamonds:**
* Point 4: (x β 5.8, y β 0.567) - Label: **k=9**
* Point 5: (x β 7.0, y β 0.564) - Label: **k=5**
* Point 6: (x β 7.0, y β 0.522) - Label: **k=1** (This is the lowest accuracy point on the chart).
* Point 7: (x β 8.6, y β 0.554) - Label: **k=3**
* **Dark Red Circles:**
* Point 8: (x β 8.6, y β 0.546) - Label: **k=3**
* Point 9: (x β 9.3, y β 0.562) - Label: **k=5**
* Point 10: (x β 10.0, y β 0.568) - Label: **k=9** (This is the highest accuracy and highest time-to-answer point).
**Trend Verification:**
* **Light Blue Squares:** The trend is not strictly linear. Accuracy is highest for k=5 and k=9 at lower time costs (~4.7-5.0), then dips for k=3 at a slightly higher time (~5.5).
* **Cyan Diamonds:** Shows a complex, non-monotonic relationship. The highest accuracy (k=9) occurs at a moderate time (~5.8). Accuracy drops for k=5 at higher time (~7.0), plummets for k=1 at the same time, and recovers partially for k=3 at an even higher time (~8.6).
* **Dark Red Circles:** Shows a clear positive trend. As time-to-answer increases from ~8.6 to 10.0, accuracy consistently increases from ~0.546 to ~0.568 for k=3, k=5, and k=9 respectively.
### Key Observations
1. **Performance Clusters:** The three marker types occupy distinct regions of the time-accuracy space. Light blue squares are clustered in the low-time, mid-accuracy region. Cyan diamonds are spread across the middle-time range with high variance in accuracy. Dark red circles are clustered in the high-time, high-accuracy region.
2. **The k=1 Outlier:** The cyan diamond labeled "k=1" is a significant outlier, showing the lowest accuracy (~0.522) despite a moderate time cost (~7.0), suggesting this configuration is highly inefficient.
3. **Highest Achiever:** The dark red circle (k=9) at (10.0, 0.568) achieves the peak accuracy but at the highest computational cost.
4. **Efficiency Frontier:** The most efficient points (high accuracy for low time) appear to be the light blue square (k=9) and the cyan diamond (k=9), both achieving accuracy >0.555 with time-to-answer <6.0.
### Interpretation
This chart visualizes the **trade-off between computational cost (time) and performance (accuracy)** for different model settings. The 'k' parameter significantly influences this trade-off.
* **Diminishing Returns:** For the dark red circle series, increasing 'k' from 3 to 9 yields a clear accuracy gain but requires a substantial increase in thinking time. This suggests a region of diminishing returns where more computation leads to better, but not proportionally better, results.
* **Configuration Matters More Than k:** The stark separation between the three marker series indicates that the underlying model configuration (represented by shape/color) is a more fundamental determinant of the time-accuracy profile than the 'k' value alone. A cyan diamond with k=9 is far more efficient than a dark red circle with k=3.
* **The Cost of Low k:** The poor performance of the k=1 point suggests that a minimum level of reasoning or search (a higher 'k') is necessary for competent performance. Setting k too low cripples accuracy without saving meaningful time compared to better configurations.
* **Strategic Choice:** The optimal choice depends on the priority. For speed-critical applications, a light blue square configuration with k=5 or k=9 is optimal. For maximum accuracy regardless of cost, the dark red circle with k=9 is the choice. The cyan diamond series offers a middle ground but with unpredictable performance at certain 'k' values.
In essence, the chart is a tool for **resource allocation**, helping to select the model configuration and 'k' parameter that best balances the need for quick answers against the need for correct answers.
</details>
(a) LN-Nano-8B
<details>
<summary>x79.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. Time-to-Answer for Different Methods
### Overview
The image is a scatter plot comparing the performance of three different methods (`majority@k`, `short-1@k (Ours)`, and `short-3@k (Ours)`) across two metrics: **Accuracy** (y-axis) and **Time-to-Answer** (x-axis). The plot visualizes the trade-off between answer quality and computational cost (thinking time) for different values of a parameter `k`.
### Components/Axes
* **X-Axis:** Labeled **"Time-to-Answer (longest thinking in thousands)"**. The scale runs from approximately 4 to 12, with major gridlines at 4, 6, 8, 10, and 12. The unit is implied to be thousands of some time unit (e.g., tokens, steps, or milliseconds).
* **Y-Axis:** Labeled **"Accuracy"**. The scale runs from approximately 0.35 to 0.45, with major gridlines at 0.36, 0.38, 0.40, 0.42, and 0.44.
* **Legend:** Located in the bottom-right quadrant of the chart area. It defines three data series:
* **Red Circle:** `majority@k`
* **Blue Square:** `short-1@k (Ours)`
* **Cyan Diamond:** `short-3@k (Ours)`
* **Data Point Labels:** Each marker is annotated with its corresponding `k` value (e.g., `k=9`, `k=5`, `k=3`, `k=1`).
### Detailed Analysis
The plot contains nine distinct data points, three for each method, corresponding to `k` values of 3, 5, and 9. An additional point for `short-3@k` exists at `k=1`.
**1. Data Series: `majority@k` (Red Circles)**
* **Trend:** This series shows a clear positive correlation. As Time-to-Answer increases, Accuracy also increases. The line connecting these points would slope upward from left to right.
* **Data Points (Approximate):**
* `k=3`: Time-to-Answer β 9.5, Accuracy β 0.378
* `k=5`: Time-to-Answer β 10.5, Accuracy β 0.408
* `k=9`: Time-to-Answer β 11.8, Accuracy β 0.432
**2. Data Series: `short-1@k (Ours)` (Blue Squares)**
* **Trend:** This series shows a negative correlation. As Time-to-Answer increases, Accuracy decreases. The line connecting these points would slope downward from left to right.
* **Data Points (Approximate):**
* `k=3`: Time-to-Answer β 5.0, Accuracy β 0.400
* `k=5`: Time-to-Answer β 4.5, Accuracy β 0.418
* `k=9`: Time-to-Answer β 4.0, Accuracy β 0.438
**3. Data Series: `short-3@k (Ours)` (Cyan Diamonds)**
* **Trend:** This series also shows a negative correlation, similar to `short-1@k`. As Time-to-Answer increases, Accuracy decreases. The line slopes downward from left to right.
* **Data Points (Approximate):**
* `k=1`: Time-to-Answer β 7.0, Accuracy β 0.355 (This is the lowest accuracy point on the chart).
* `k=3`: Time-to-Answer β 9.2, Accuracy β 0.402
* `k=5`: Time-to-Answer β 6.8, Accuracy β 0.424
* `k=9`: Time-to-Answer β 5.5, Accuracy β 0.448 (This is the highest accuracy point on the chart).
### Key Observations
1. **Performance Frontier:** The proposed methods (`short-1@k` and `short-3@k`) occupy the top-left region of the plot, indicating they achieve higher accuracy with lower time-to-answer compared to the `majority@k` baseline for most `k` values.
2. **Inverse Relationship for Proposed Methods:** Both "Ours" methods demonstrate an inverse relationship between time and accuracy. Increasing `k` for these methods leads to higher accuracy but *lower* time-to-Answer, which is a highly desirable efficiency characteristic.
3. **Baseline Trade-off:** The `majority@k` method shows a direct, positive trade-off: higher accuracy requires significantly more time.
4. **Outlier Point:** The `short-3@k` point at `k=1` is an outlier. It has the lowest accuracy (~0.355) and a moderate time-to-answer (~7.0), breaking the smooth downward trend of its series. This suggests `k=1` may be a suboptimal configuration for this method.
5. **Peak Performance:** The single highest accuracy point (~0.448) is achieved by `short-3@k` at `k=9`, and it does so with a relatively low time-to-answer (~5.5).
### Interpretation
This chart presents a compelling case for the efficiency of the authors' proposed methods (`short-1@k` and `short-3@k`). The core finding is that these methods invert the typical accuracy-speed trade-off seen in the `majority@k` baseline.
* **What the data suggests:** The proposed methods are not just faster; they become *more efficient* (higher accuracy per unit of time) as the parameter `k` increases. This is evidenced by the data points moving up and to the left as `k` grows. In contrast, the baseline method requires a linear increase in time to gain accuracy.
* **How elements relate:** The `k` parameter acts as a control knob. For the authors' methods, turning it up (`k=9`) optimizes both metrics simultaneously. For the baseline, turning it up improves one metric (accuracy) at the direct expense of the other (time).
* **Notable Implications:** The results imply that the "short" strategies (likely involving some form of early exiting or conditional computation) are fundamentally more scalable. The `short-3@k` method at `k=9` represents the optimal point in this evaluation, offering the best accuracy at a competitive speed. The poor performance at `k=1` for `short-3@k` indicates a minimum threshold of "short" paths or votes is needed for the method to be effective. This visualization strongly supports the adoption of the proposed methods over the majority-vote baseline for tasks where both accuracy and response time are critical.
</details>
(b) R $1$ - $7$ B
Figure 24: Small models - time-to-answer comparison over GPQA-D.
## Appendix E GPQA-D sequential results
The results for GPQA-D when accounting for sequential compute are presented in Figure Λ 25.
<details>
<summary>x80.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting "Accuracy" against "Thinking Compute." It displays two distinct data series, showing how accuracy changes as the amount of thinking compute (measured in thousands of tokens) increases. Both series demonstrate a positive correlation, with accuracy rising sharply at lower compute levels before exhibiting diminishing returns.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear scale with major tick marks and grid lines at 10, 20, 30, 40, and 50. The axis starts at approximately 5.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy"
* **Scale:** Linear scale with major tick marks and grid lines at 0.650, 0.655, 0.660, 0.665, 0.670, 0.675, 0.680, 0.685, and 0.690.
* **Data Series:**
1. **Cyan Line with Diamond Markers:** Positioned as the upper line for most of the chart.
2. **Brown Line with Circle Markers:** Positioned as the lower line for most of the chart.
* **Legend:** No explicit legend is present within the chart area. The series are distinguished solely by color and marker shape.
* **Grid:** A light gray grid is present, aligned with the major ticks on both axes.
### Detailed Analysis
**Data Series 1: Cyan Line (Diamond Markers)**
* **Trend:** The line shows a steep, concave-downward increase. It rises rapidly from low compute, begins to decelerate around 20-30k tokens, and plateaus near the top of the chart.
* **Approximate Data Points:**
* (~5, 0.651)
* (~10, 0.668)
* (~15, 0.675)
* (~20, 0.683)
* (~25, 0.688)
* (~30, 0.690)
* (~35, 0.691)
* (~40, 0.691)
* (~45, 0.691)
* (~50, 0.691)
**Data Series 2: Brown Line (Circle Markers)**
* **Trend:** The line also shows a concave-downward increase but with a consistently lower slope than the cyan line. It rises steadily and continues a gradual ascent even at higher compute levels, without a clear plateau within the plotted range.
* **Approximate Data Points:**
* (~5, 0.651) *[Starts at the same point as the cyan line]*
* (~15, 0.671)
* (~20, 0.675)
* (~25, 0.678)
* (~30, 0.680)
* (~35, 0.682)
* (~40, 0.683)
* (~45, 0.684)
* (~50, 0.684)
### Key Observations
1. **Performance Gap:** A clear performance gap emerges immediately after the first data point. The cyan series consistently achieves higher accuracy than the brown series for the same amount of thinking compute beyond ~5k tokens.
2. **Diminishing Returns:** Both series exhibit diminishing returns. The cyan series' gains become negligible after approximately 30k thinking tokens, while the brown series' gains slow but do not fully plateau.
3. **Convergence at Origin:** Both lines appear to originate from the same point at the lowest compute value (~5k tokens, ~0.651 accuracy), suggesting a common baseline performance.
4. **Slope Comparison:** The cyan line has a steeper initial slope, indicating it translates additional compute into accuracy gains more efficiently in the low-to-mid compute range (5k-25k tokens).
### Interpretation
This chart likely compares the scaling efficiency of two different models, algorithms, or prompting strategies (represented by the cyan and brown lines) as a function of allocated "thinking" compute. The data suggests that the method represented by the **cyan line is significantly more compute-efficient**, achieving higher accuracy with fewer resources and reaching a performance ceiling (around 0.691 accuracy) that the brown method does not approach within the tested range.
The brown line's continued, albeit slow, ascent suggests it may eventually reach similar accuracy levels but would require substantially more compute, making it less efficient. The shared starting point implies that for very simple tasks requiring minimal thinking (low compute), the methods perform identically. The divergence highlights how architectural or procedural differences become critical as task complexity (and required compute) increases. The plateau of the cyan line could indicate a fundamental limit of that approach or the task's inherent difficulty.
</details>
(a) LN-Super-49B
<details>
<summary>x81.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in "thinking tokens." It displays three distinct data series, each represented by a different colored line with markers, showing how accuracy changes as the thinking compute increases from approximately 5,000 to 60,000 tokens.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Thinking Compute (thinking tokens in thousands)". The scale runs from 0 to 60, with major tick marks at 10, 20, 30, 40, 50, and 60. The axis represents the amount of computational resources (in thousands of tokens) allocated to a model's reasoning process.
* **Y-Axis (Vertical):** Labeled "Accuracy". The scale runs from 0.620 to 0.650, with major tick marks at 0.620, 0.625, 0.630, 0.635, 0.640, 0.645, and 0.650. This represents a performance metric, likely a proportion or score.
* **Legend:** There is no explicit legend box within the chart area. The three data series are distinguished solely by color and marker shape.
* **Grid:** A light gray grid is present, with vertical lines at each major x-axis tick and horizontal lines at each major y-axis tick.
### Detailed Analysis
The chart contains three data series. Their trends and approximate data points are as follows:
**1. Cyan Line with Diamond Markers:**
* **Trend:** Shows the steepest initial increase in accuracy, followed by a continued but more gradual rise, achieving the highest final accuracy.
* **Data Points (Approximate):**
* (5, 0.620)
* (10, 0.633)
* (15, 0.636)
* (20, 0.640)
* (25, 0.644)
* (30, 0.647)
* (35, 0.649)
* (40, 0.650)
* (45, 0.651)
* (50, 0.651)
* (55, 0.651)
**2. Red Line with Circle Markers:**
* **Trend:** Shows a steady, consistent increase in accuracy across the entire range, with a slope that is less steep than the cyan line's initial phase but remains positive throughout.
* **Data Points (Approximate):**
* (5, 0.620)
* (15, 0.636)
* (20, 0.637)
* (25, 0.640)
* (30, 0.643)
* (35, 0.645)
* (40, 0.646)
* (45, 0.647)
* (50, 0.648)
* (55, 0.648)
* (60, 0.648)
**3. Blue Line with Square Markers:**
* **Trend:** Shows a rapid initial increase in accuracy, which then plateaus very early (around 15-20k tokens) and remains nearly flat for the remainder of the chart.
* **Data Points (Approximate):**
* (5, 0.620)
* (10, 0.633)
* (15, 0.636)
* (20, 0.637)
* (25, 0.637)
* (30, 0.636)
* (35, 0.636)
* (40, 0.637)
* (45, 0.637)
### Key Observations
1. **Common Starting Point:** All three models/series begin at the same accuracy point (~0.620) at the lowest compute level (5k tokens).
2. **Diverging Paths:** The performance diverges significantly after the initial point. The cyan line consistently outperforms the others from about 20k tokens onward.
3. **Plateau Behavior:** The blue line exhibits an early and severe performance plateau, showing negligible accuracy gains beyond ~15k thinking tokens. The cyan and red lines show diminishing returns but continue to improve, with the cyan line's gains becoming very marginal after ~45k tokens.
4. **Final Performance Hierarchy:** At the highest comparable compute levels (~55k tokens), the final accuracy order is: Cyan (~0.651) > Red (~0.648) > Blue (~0.637).
### Interpretation
This chart demonstrates the relationship between allocated reasoning compute ("thinking tokens") and task accuracy for three different models or model configurations. The data suggests:
* **The Efficiency-Performance Trade-off:** The blue line represents a model that is highly efficient at low compute but hits a hard performance ceiling quickly. It cannot leverage additional compute for better results.
* **The Value of Scalable Reasoning:** The cyan line represents a model architecture or method that effectively translates increased computational investment into higher accuracy, showing strong scalability. It is the most capable when sufficient resources are available.
* **A Middle Ground:** The red line shows a model that scales reliably but less efficiently than the cyan model. It requires more compute to reach the same accuracy levels the cyan model achieves earlier.
* **Practical Implication:** The choice between these models would depend on the computational budget. For low-latency or resource-constrained applications, the blue model might be sufficient. For tasks where maximum accuracy is critical and compute is available, the cyan model is superior. The chart underscores that "more compute" only improves performance if the underlying model is designed to utilize it effectively.
</details>
(b) R $1$ - $32$ B
<details>
<summary>x82.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Thinking Compute
### Overview
The image is a line chart plotting model accuracy against computational effort, measured in thinking tokens. It compares the performance of three distinct models or methods, represented by three colored lines with different markers. The chart demonstrates how accuracy scales with increased compute for each approach.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear, ranging from approximately 10 to 85 (thousands of tokens).
* **Major Tick Marks:** 20, 40, 60, 80.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, ranging from 0.635 to 0.665.
* **Major Tick Marks:** 0.635, 0.640, 0.645, 0.650, 0.655, 0.660, 0.665.
* **Legend:** Located in the top-right corner of the plot area. It contains three entries:
1. A cyan line with diamond markers (β).
2. A blue line with square markers (β ).
3. A red line with circle markers (β).
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
**Data Series 1: Cyan Line with Diamond Markers (β)**
* **Trend:** Shows a strong, consistent, and slightly decelerating upward trend across the entire compute range. It is the top-performing series for compute values above ~20k tokens.
* **Key Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.636
* At ~20k tokens: Accuracy β 0.650
* At ~40k tokens: Accuracy β 0.661
* At ~60k tokens: Accuracy β 0.664
* At ~80k tokens: Accuracy β 0.666 (highest point on the chart)
**Data Series 2: Blue Line with Square Markers (β )**
* **Trend:** Increases sharply initially, then plateaus and slightly declines after reaching a peak. It performs similarly to the cyan line at very low compute but is quickly surpassed.
* **Key Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.636 (similar starting point to cyan)
* At ~20k tokens: Accuracy β 0.650
* At ~30k tokens: Accuracy β 0.652
* At ~40k tokens: Accuracy β 0.653
* At ~50k tokens: Accuracy β 0.653 (peak)
* At ~65k tokens: Accuracy β 0.652 (shows a slight decrease from peak)
**Data Series 3: Red Line with Circle Markers (β)**
* **Trend:** Shows a steady, approximately linear increase, but with a shallower slope than the cyan line. It consistently performs below the cyan line but eventually surpasses the plateauing blue line.
* **Key Data Points (Approximate):**
* At ~10k tokens: Accuracy β 0.636
* At ~25k tokens: Accuracy β 0.647
* At ~40k tokens: Accuracy β 0.653
* At ~60k tokens: Accuracy β 0.656
* At ~80k tokens: Accuracy β 0.657
### Key Observations
1. **Common Origin:** All three models start at nearly the same accuracy point (~0.636) at the lowest compute level (~10k tokens).
2. **Divergence:** Performance diverges significantly as compute increases. The cyan model scales the best, the red model scales moderately, and the blue model stops scaling after ~50k tokens.
3. **Crossover Point:** The red line crosses above the blue line at approximately 45k thinking tokens, indicating that for higher compute budgets, the "red" method becomes more accurate than the "blue" method.
4. **Diminishing Returns:** All curves show signs of diminishing returns (the slope decreases as compute increases), but the degree varies greatly. The cyan curve maintains a positive slope throughout, while the blue curve's slope becomes zero or slightly negative.
### Interpretation
This chart illustrates a fundamental trade-off in AI model scaling: the relationship between computational investment (thinking tokens) and performance (accuracy). The data suggests:
* **Model Architecture/Method Matters Profoundly:** The three lines likely represent different model architectures, training techniques, or inference strategies. The "cyan" method is significantly more efficient at converting additional compute into accuracy gains.
* **The "Blue" Method Hits a Ceiling:** The plateau of the blue line indicates a performance bottleneck. This could be due to a model capacity limit, a flaw in the training objective, or an inherent limitation of that specific approach that cannot be overcome simply by throwing more compute at it.
* **Strategic Compute Allocation:** For low-compute applications (<20k tokens), the choice between methods may be less critical. However, for high-compute, high-performance scenarios, selecting the "cyan" method is crucial. The "red" method offers a middle ground, providing steady but less efficient scaling.
* **Investigative Question:** The stark difference in scaling behavior invites investigation into the underlying causes. What architectural feature allows the cyan model to keep improving? Why does the blue model saturate? This chart provides empirical evidence to guide such research and development decisions.
</details>
(c) QwQ-32B
<details>
<summary>x83.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Thinking Compute for Different Methods
### Overview
The image is a line chart comparing the performance of three different methods ("majority@k", "short-1@k (Ours)", and "short-3@k (Ours)") in terms of accuracy as a function of thinking compute, measured in thousands of thinking tokens. The chart demonstrates how accuracy scales with increased computational resources for each method.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Title:** "Thinking Compute (thinking tokens in thousands)"
* **Scale:** Linear, ranging from approximately 10 to 125.
* **Major Tick Marks:** 20, 40, 60, 80, 100, 120.
* **Y-Axis:**
* **Title:** "Accuracy"
* **Scale:** Linear, ranging from 0.74 to 0.81.
* **Major Tick Marks:** 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.80, 0.81.
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Entries:**
1. `majority@k` - Represented by a red line with circular markers.
2. `short-1@k (Ours)` - Represented by a blue line with square markers.
3. `short-3@k (Ours)` - Represented by a cyan (light blue) line with diamond markers.
### Detailed Analysis
The chart plots three distinct data series. Below is an analysis of each, including approximate data points extracted from the grid.
**1. Data Series: `majority@k` (Red line, circle markers)**
* **Trend:** Shows a steady, near-linear upward slope across the entire range of thinking compute. It starts as the lowest-performing method at low compute but eventually surpasses the others.
* **Approximate Data Points:**
* ~10k tokens: Accuracy β 0.740
* ~40k tokens: Accuracy β 0.770
* ~60k tokens: Accuracy β 0.790
* ~80k tokens: Accuracy β 0.795
* ~100k tokens: Accuracy β 0.802
* ~120k tokens: Accuracy β 0.808
**2. Data Series: `short-1@k (Ours)` (Blue line, square markers)**
* **Trend:** Exhibits a rapid initial increase in accuracy, which then plateaus and slightly declines after approximately 60k thinking tokens. It shows diminishing returns.
* **Approximate Data Points:**
* ~10k tokens: Accuracy β 0.740
* ~20k tokens: Accuracy β 0.762
* ~30k tokens: Accuracy β 0.769
* ~40k tokens: Accuracy β 0.772
* ~60k tokens: Accuracy β 0.774 (peak)
* ~80k tokens: Accuracy β 0.774
* ~90k tokens: Accuracy β 0.773
**3. Data Series: `short-3@k (Ours)` (Cyan line, diamond markers)**
* **Trend:** Shows a very steep initial increase, followed by a continued but more gradual rise. It maintains the highest accuracy for most of the middle range (approx. 30k to 80k tokens) before being overtaken by `majority@k`.
* **Approximate Data Points:**
* ~10k tokens: Accuracy β 0.740
* ~20k tokens: Accuracy β 0.763
* ~30k tokens: Accuracy β 0.780
* ~40k tokens: Accuracy β 0.789
* ~60k tokens: Accuracy β 0.795
* ~80k tokens: Accuracy β 0.797
* ~100k tokens: Accuracy β 0.799
### Key Observations
1. **Convergence at Low Compute:** All three methods start at approximately the same accuracy (β0.740) when thinking compute is very low (~10k tokens).
2. **Performance Crossover:** There is a notable crossover point between 80k and 100k tokens where the steadily rising `majority@k` line surpasses the `short-3@k` line.
3. **Plateau Behavior:** The `short-1@k` method clearly plateaus, suggesting a limit to its performance gain from additional compute. In contrast, `majority@k` shows no sign of plateauing within the charted range.
4. **Efficiency of "Ours" Methods:** Both methods labeled "(Ours)" achieve higher accuracy than the baseline (`majority@k`) at lower to medium compute budgets (e.g., at 40k tokens, `short-3@k` is ~0.789 vs. `majority@k`'s ~0.770).
### Interpretation
This chart likely comes from a research paper on efficient inference or reasoning in language models, where "thinking tokens" represent intermediate computation steps. The data suggests:
* **Trade-off Between Efficiency and Peak Performance:** The proposed methods (`short-1@k`, `short-3@k`) are more **compute-efficient**, reaching high accuracy levels with fewer thinking tokens. `short-3@k` is particularly effective in the mid-range. However, the baseline `majority@k` method, while less efficient, appears to have a higher **ultimate performance ceiling** if given sufficient compute resources.
* **Methodological Insight:** The "short" methods might involve techniques that truncate or summarize reasoning chains, leading to quick gains but eventual saturation. The `majority@k` method (possibly a form of majority voting over many reasoning paths) scales more predictably with compute, implying it can leverage additional resources to refine answers further without an obvious limit in this range.
* **Practical Implication:** The choice of method depends on the operational constraint. For applications with a strict budget on inference compute (tokens), `short-3@k` is optimal. If maximum accuracy is the goal and compute is less constrained, `majority@k` becomes preferable at higher token counts. The chart provides a clear empirical basis for making this trade-off decision.
</details>
(d) R1-670B
Figure 25: Comparing different methods for the GPQA-D benchmark under sequential compute.
## Appendix F Backtracks under controlled length results
Below we present the results for the backtrack count under controlled length scenarios (Section Λ 5.2). The results over the math benchmarks are presented in Table Λ 8 and for GPQA-D in Table Λ 9.
Table 8: Average number of backtracks for correct (C), incorrect (IC) answers, binned by thinking length. Results are averaged across math benchmarks.
| Model Thinking Tokens | 0-5k | 5-10k | 10-15k | 15-20k | 20-25k | 25-30k | 30-32k |
| --- | --- | --- | --- | --- | --- | --- | --- |
| C/IC | C/IC | C/IC | C/IC | C/IC | C/IC | C/IC | |
| LN-Super-49B | 35/ 0 64 | 100/133 | 185/236 | 261/299 | 307/320 | 263/323 | 0 β 0 / 0 304 |
| R1-32B | 29/ 0 74 | 0 88/166 | 171/279 | 244/351 | 334/370 | 268/355 | 326/1006 |
| QwQ-32B | 50/148 | 120/174 | 194/247 | 285/353 | 354/424 | 390/476 | 551/ 0 469 |
| R1-670B | 58/ 0 27 | 100/ 0 86 | 143/184 | 222/203 | 264/254 | 309/289 | 352/ 0 337 |
Table 9: Average number of backtracks for correct (C), incorrect (IC) answers, binned by thinking length. Results are reported for GPQA-D.
| Model Thinking Tokens | 0-5k | 5-10k | 10-15k | 15-20k | 20-25k | 25-30k | 30-32k |
| --- | --- | --- | --- | --- | --- | --- | --- |
| C/IC | C/IC | C/IC | C/IC | C/IC | C/IC | C/IC | |
| LN-Super-49B | 38/52 | 175/164 | 207/213 | 0 β 0 / 0 β 0 | 0 β 0 / 0 β 0 | 0 β 0 / 0 β 0 | 0 β 0 / 0 β 0 |
| R1-32B | 39/54 | 194/221 | 301/375 | 525/668 | 0 β 0 / 0 β 0 | 0 β 0 / 0 β 0 | 0 β 0 / 0 β 0 |
| QwQ-32B | 65/71 | 169/178 | 333/358 | 378/544 | 357/703 | 0 β 0 / 0 β 0 | 0 β 0 / 0 β 0 |
| R1-670B | 44/72 | 0 93/155 | 178/232 | 297/300 | 341/341 | 463/382 | 553/477 |
## Appendix G Further details and results for finetuning experiments
The generation process of all three variants of S $1$ uses the hyperparameters detailed in Section Λ 3.1. Figure Λ 26 shows the thinking token count histograms for the three variants of the S1 dataset (short/long/random) presented in Section Λ 6.
As for finetuning, we follow the S $1$ approach, and finetune the Qwen- $2.5$ - $7$ B-Instruct and Qwen- $2.5$ - $32$ B-Instruct models on the three S $1$ variants. The finetuning hyperparameters are consistent with those used for the S $1.1$ model (Muennighoff et al., 2025), and training is conducted on $32$ H $100$ GPUs. We match the number of gradient steps as in used for S1.1. The resulting models are evaluated using the benchmarks and experimental setup described in Section Λ 3.1. Specifically, for each model we generate 20 answers per example, and report average accuracy.
Results for the 7B version model are presented in Table Λ 10.
<details>
<summary>x84.png Details</summary>

### Visual Description
## Histogram: Distribution of Thinking Token Usage
### Overview
The image displays a histogram illustrating the frequency distribution of "Thinking Tokens" used, measured in thousands. The chart shows a right-skewed distribution, indicating that most instances involve a moderate number of thinking tokens, with a long tail extending toward higher token counts.
### Components/Axes
* **Chart Type:** Histogram (vertical bars).
* **X-Axis (Horizontal):**
* **Label:** "Number of Thinking Tokens (in thousands)"
* **Scale:** Linear scale from 0 to 30, with major tick marks and labels at intervals of 5 (0, 5, 10, 15, 20, 25, 30).
* **Bin Width:** Each bar represents a range of 2.5 thousand tokens (e.g., 0-2.5, 2.5-5, 5-7.5, etc.).
* **Y-Axis (Vertical):**
* **Label:** "Frequency"
* **Scale:** Linear scale from 0 to over 200, with major tick marks and labels at intervals of 50 (0, 50, 100, 150, 200).
* **Legend:** None present. The chart represents a single data series.
* **Visual Style:** Bars are filled with a light blue color and have a thin, dark outline. The background is white with a light gray grid aligned with the major y-axis ticks.
### Detailed Analysis
The following table reconstructs the approximate frequency for each token range (bin), derived from visual inspection of the bar heights against the y-axis scale. Values are approximate.
| Thinking Tokens (in thousands) | Approximate Frequency |
| :--- | :--- |
| 0 - 2.5 | 30 |
| 2.5 - 5 | 90 |
| 5 - 7.5 | 180 |
| **7.5 - 10** | **~215** (Peak) |
| 10 - 12.5 | 165 |
| 12.5 - 15 | 120 |
| 15 - 17.5 | 85 |
| 17.5 - 20 | 45 |
| 20 - 22.5 | 30 |
| 22.5 - 25 | 15 |
| 25 - 27.5 | 10 |
| 27.5 - 30 | 5 |
**Trend Verification:** The visual trend is a rapid rise to a peak, followed by a steady decline. The frequency increases sharply from the 0-2.5k bin to the 7.5-10k bin (the mode), then decreases progressively for each subsequent higher token range, forming a long right tail.
### Key Observations
1. **Modal Peak:** The most common usage range is between 7,500 and 10,000 thinking tokens.
2. **Right Skew:** The distribution is positively skewed. The tail on the right side (higher token counts) is longer and more gradual than the left side.
3. **Concentration:** The vast majority of instances (the bulk of the distribution) fall between 2,500 and 17,500 tokens.
4. **Rare High Usage:** Instances requiring more than 20,000 tokens are relatively infrequent, and those exceeding 25,000 are very rare.
5. **Lower Bound:** There is a non-zero frequency for the lowest bin (0-2.5k), indicating some processes complete with minimal thinking tokens.
### Interpretation
This histogram provides a quantitative profile of computational "effort" or complexity, as measured by thinking token consumption.
* **Typical Process Complexity:** The concentration of data between 5k and 15k tokens suggests that the standard or most common tasks processed by this system require a moderate amount of sequential reasoning or computation.
* **Efficiency and Outliers:** The sharp peak indicates a well-defined typical operating range. The long right tail is significant; it demonstrates that while rare, there exists a subset of tasks that are substantially more complex, requiring 2-3 times the token usage of a typical task. This could represent edge cases, highly intricate problems, or potential inefficiencies in certain processes.
* **System Design Implication:** The distribution informs resource allocation. System capacity should be designed to comfortably handle the 7.5k-10k token range for peak load, while also having the capability (e.g., timeout limits, memory allocation) to accommodate the less frequent but much more demanding tasks in the 20k+ range without failure.
* **Absence of a Lower Mode:** The lack of a significant peak near zero suggests that very few tasks are trivial; most require some substantive "thinking."
</details>
(a) S1-short
<details>
<summary>x85.png Details</summary>

### Visual Description
## Histogram: Distribution of Thinking Token Counts
### Overview
The image displays a histogram chart illustrating the frequency distribution of a dataset measuring the "Number of Thinking Tokens (in thousands)." The chart uses vertical bars to show how many instances (frequency) fall within specific ranges (bins) of token counts. The overall shape indicates a right-skewed distribution.
### Components/Axes
* **Chart Type:** Histogram (vertical bars).
* **X-Axis (Horizontal):**
* **Label:** "Number of Thinking Tokens (in thousands)"
* **Scale:** Linear scale from 0 to approximately 32.5.
* **Major Tick Marks:** Labeled at 0, 5, 10, 15, 20, 25, 30.
* **Bin Width:** Each bar appears to represent a range of 2.5 thousand tokens (e.g., 0-2.5, 2.5-5, 5-7.5, etc.).
* **Y-Axis (Vertical):**
* **Label:** "Frequency"
* **Scale:** Linear scale from 0 to over 200.
* **Major Tick Marks:** Labeled at 0, 50, 100, 150, 200.
* **Data Series:** A single data series represented by solid, medium-purple bars with thin black outlines.
* **Legend:** None present. The chart contains only one data series.
* **Grid:** A light gray grid is present in the background, aligned with the major tick marks on both axes.
* **Spatial Layout:** The chart area is centered within the frame. The x-axis label is centered below the axis. The y-axis label is rotated 90 degrees and centered to the left of the axis.
### Detailed Analysis
The histogram consists of 13 contiguous bars. The approximate frequency (height) for each bin, reading from left to right, is as follows. *Note: Values are visual estimates from the grid lines and carry inherent uncertainty.*
1. **Bin 0 - 2.5k:** Frequency β 15
2. **Bin 2.5k - 5k:** Frequency β 40
3. **Bin 5k - 7.5k:** Frequency β 95
4. **Bin 7.5k - 10k:** Frequency β 138 (This is the peak of the distribution)
5. **Bin 10k - 12.5k:** Frequency β 136
6. **Bin 12.5k - 15k:** Frequency β 120
7. **Bin 15k - 17.5k:** Frequency β 115
8. **Bin 17.5k - 20k:** Frequency β 95
9. **Bin 20k - 22.5k:** Frequency β 55
10. **Bin 22.5k - 25k:** Frequency β 54
11. **Bin 25k - 27.5k:** Frequency β 42
12. **Bin 27.5k - 30k:** Frequency β 33
13. **Bin 30k - 32.5k:** Frequency β 20 (Note: This final bar shows a slight increase from the previous bin).
**Trend Verification:** The visual trend shows a rapid increase in frequency from the first bin, peaking sharply at the 7.5k-10k bin. Following the peak, there is a steady, gradual decline in frequency as the number of thinking tokens increases, forming a long tail to the right. The final bin (30k-32.5k) presents a minor upward deviation from the declining trend.
### Key Observations
1. **Modal Peak:** The most common range for thinking tokens is between 7,500 and 10,000, with a frequency of approximately 138 instances.
2. **Right Skew:** The distribution is positively skewed (right-skewed). The tail on the right side (higher token counts) is longer and more gradual than the left side.
3. **Concentration:** The vast majority of the data (roughly bins 2 through 8, from 2.5k to 20k tokens) contains the highest frequencies. The frequency drops below 50 for all bins beyond 25k tokens.
4. **Potential Outlier/Anomaly:** The slight increase in frequency in the final bin (30k-32.5k) compared to the bin before it (27.5k-30k) is a minor anomaly in the otherwise smooth declining trend. This could indicate a small cluster of instances requiring very high token counts.
### Interpretation
This histogram visualizes the resource consumption (in "thinking tokens") for a set of tasks or processes. The data suggests that:
* **Typical Usage:** Most tasks require a moderate amount of "thinking," clustering between 5,000 and 15,000 tokens. The process is optimized or naturally tends toward this range.
* **Efficiency Tail:** A significant number of tasks are completed with relatively low token counts (under 5k), indicating efficiency for simpler cases.
* **Complexity Gradient:** The long right tail demonstrates that a subset of tasks is substantially more complex, requiring progressively more tokens. The smooth decline suggests a continuum of complexity rather than distinct categories.
* **High-End Cluster:** The small uptick at the far right (30k+ tokens) may represent a specific class of highly complex or outlier tasks that consistently demand maximum resources. This could be a target for optimization or further investigation to understand their unique characteristics.
The chart effectively communicates that while there is a typical "cost" for thinking, there is also significant variability, with a non-trivial number of cases demanding resources far beyond the average.
</details>
(b) S1-random
<details>
<summary>x86.png Details</summary>

### Visual Description
## Histogram: Distribution of Thinking Token Counts
### Overview
The image displays a histogram showing the frequency distribution of a metric called "Thinking Tokens" measured in thousands. The chart visualizes how often different ranges of token counts occur within a dataset. The overall shape suggests a right-skewed distribution with a notable outlier or separate category at the high end.
### Components/Axes
* **Chart Type:** Histogram (bar chart representing frequency distribution).
* **X-Axis (Horizontal):**
* **Label:** "Number of Thinking Tokens (in thousands)"
* **Scale:** Linear scale from 0 to 30, with major tick marks and labels at intervals of 5 (0, 5, 10, 15, 20, 25, 30).
* **Interpretation:** Each bar represents a bin (range) of token counts. The bins appear to be approximately 2.5 thousand tokens wide.
* **Y-Axis (Vertical):**
* **Label:** "Frequency"
* **Scale:** Linear scale from 0 to 200, with major tick marks and labels at intervals of 50 (0, 50, 100, 150, 200).
* **Visual Elements:**
* **Bars:** Colored in a muted red/salmon shade (`#c27d7d` approximate). Each bar's height represents the frequency (count) of observations within its corresponding token range.
* **Grid:** A light gray grid is present in the background, aligned with the major tick marks on both axes.
* **Legend:** No legend is present, as there is only one data series.
### Detailed Analysis
The histogram consists of 14 distinct bars. The following table reconstructs the approximate data, reading from left to right along the x-axis. Values are estimated based on bar height relative to the y-axis grid.
| Approx. Token Range (thousands) | Approx. Frequency (Count) | Visual Trend & Notes |
| :--- | :--- | :--- |
| 0 - 2.5 | ~5 | Very low frequency. |
| 2.5 - 5 | ~10 | Slight increase. |
| 5 - 7.5 | ~25 | Noticeable increase. |
| 7.5 - 10 | ~55 | Sharp increase. |
| 10 - 12.5 | ~85 | Continued increase. |
| 12.5 - 15 | ~100 | **Peak of the main distribution.** |
| 15 - 17.5 | ~90 | Slight decrease from peak. |
| 17.5 - 20 | ~80 | Further decrease. |
| 20 - 22.5 | ~85 | Slight rebound. |
| 22.5 - 25 | ~65 | Decrease. |
| 25 - 27.5 | ~50 | Plateau. |
| 27.5 - 30 | ~50 | Plateau continues. |
| 30 - 32.5 | ~35 | Decrease. |
| **> 32.5 (or 32.5-35)** | **~110** | **Significant, isolated high bar.** This bar is positioned to the right of the 30k mark, suggesting it represents a bin for tokens >30k or a final bin of 32.5-35k. Its height is the second-highest on the chart. |
**Trend Verification:** The main body of the data (from 0 to ~30k tokens) forms a unimodal, right-skewed distribution. It rises steadily to a peak at the 12.5-15k bin and then gradually declines. The final bar is a clear outlier, breaking the declining trend with a sharp, isolated spike.
### Key Observations
1. **Primary Mode:** The most common range for thinking tokens is between 12,500 and 15,000.
2. **Right Skew:** The tail of the distribution extends further to the right (higher token counts) than to the left, indicating a subset of instances requiring significantly more tokens.
3. **Significant Outlier Bin:** The final bar (representing >30k or 32.5-35k tokens) has a frequency (~110) nearly as high as the main peak (~100). This is not a gradual tail but a distinct, high-frequency cluster at the extreme end of the scale.
4. **Data Range:** The vast majority of observed token counts fall between 5,000 and 30,000.
### Interpretation
This histogram likely represents the distribution of computational effort (measured in "thinking tokens") required for a set of tasks or queries processed by an AI model. The data suggests:
* **Typical Performance:** Most tasks require a moderate amount of processing, clustering around 10,000 to 20,000 tokens, with a central tendency near 13,000-14,000 tokens.
* **Efficiency Tail:** The gradual decline from 15k to 30k shows that progressively fewer tasks require this higher level of effort.
* **Critical Anomaly:** The isolated spike at the far right is the most important feature. It indicates a **non-trivial subset of tasks (over 100 instances) that demand exceptionally high token counts (>30k)**. This could represent:
* A specific, complex category of problem.
* Potential inefficiencies or "hard" cases for the model.
* A separate class of data that was inadvertently included or requires different handling.
* A possible data artifact or binning error (e.g., all values >30k being lumped into one final bin).
**Conclusion:** The system's token usage is not uniformly distributed. While it operates efficiently for most tasks, there is a significant and distinct group of outlier cases that consume resources at a rate far exceeding the norm. Investigating the nature of these high-token tasks would be crucial for optimizing performance, managing costs, or understanding model limitations.
</details>
(c) S1-long
Figure 26: Thinking token count histograms for S1-short, S1-random and S1-long datasets.
Table 10: Results for our finetuned models over the S $1$ variants using Qwen-2.5-7B-Instruct: S $1$ -short/long/random.
| | GPQA-D | AIME 2024 | AIME 2025 | HMMT | Math Average | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | Thinking Tokens $\downarrow$ | Acc. $\uparrow$ | |
| S1-random | 14095 | 39.1 | 25207 | 22.0 | 23822 | 22.0 | 25028 | 10.8 | 24686 | 18.2 |
| S1-long | 15528 $(+10.2\$ | 38.5 | 26210 | 21.7 | 24395 | 19.5 | 26153 | 9.2 | 25586 $(+3.7\$ | 16.8 |
| S1-short | 13093 $(-7.1\$ | 40.3 | 24495 | 22.0 | 21945 | 20.8 | 23329 | 11.2 | 23256 $(-5.8\$ | 18.0 |