# Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs
**Authors**: Benjamin Estermann, ETH ZĂĽrich, &Roger Wattenhofer, ETH ZĂĽrich
Abstract
Large Language Models (LLMs) have demonstrated remarkable text generation capabilities, and recent advances in training paradigms have led to breakthroughs in their reasoning performance. In this work, we investigate how the reasoning effort of such models scales with problem complexity. We use the infinitely scalable Tents puzzle, which has a known linear-time solution, to analyze this scaling behavior. Our results show that reasoning effort scales with problem size, but only up to a critical problem complexity. Beyond this threshold, the reasoning effort does not continue to increase, and may even decrease. This observation highlights a critical limitation in the logical coherence of current LLMs as problem complexity increases, and underscores the need for strategies to improve reasoning scalability. Furthermore, our results reveal significant performance differences between current state-of-the-art reasoning models when faced with increasingly complex logical puzzles.
1 Introduction
Large language models (LLMs) have demonstrated remarkable abilities in a wide range of natural language tasks, from text generation to complex problem-solving. Recent advances, particularly with models trained for enhanced reasoning, have pushed the boundaries of what machines can achieve in tasks requiring logical inference and deduction.
<details>
<summary>extracted/6290299/Figures/tents.png Details</summary>

### Visual Description
## Grid Matrix: Symbol Distribution Pattern
### Overview
The image displays a 6x6 grid with symbolic elements (green trees and orange triangles) positioned in specific cells. Numerical labels are present on the right vertical axis (rows) and bottom horizontal axis (columns). The grid appears to encode a structured pattern of symbols, with numerical annotations suggesting quantitative relationships.
### Components/Axes
- **Vertical Axis (Right Side)**: Labeled with numbers `1, 1, 2, 0, 0, 3` (top to bottom). These likely represent row-specific values or counts.
- **Horizontal Axis (Bottom)**: Labeled with numbers `1, 1, 1, 2, 0, 2` (left to right). These likely represent column-specific values or counts.
- **Grid Cells**:
- **Green Trees**: Represented by a stylized tree icon (brown trunk, green canopy).
- **Orange Triangles**: Represented by a solid orange triangle.
- **Empty Cells**: White background with no symbols.
### Detailed Analysis
- **Row 1 (Top Row)**:
- Column 1: Green tree
- Column 2: Orange triangle
- Column 5: Green tree
- Right Axis Value: `1` (matches 1 tree in this row)
- **Row 2**:
- Column 6: Orange triangle
- Right Axis Value: `1` (matches 1 triangle in this row)
- **Row 3**:
- Column 3: Green tree
- Column 6: Green tree
- Right Axis Value: `2` (matches 2 trees in this row)
- **Row 4**:
- Column 3: Green tree
- Column 6: Green tree
- Right Axis Value: `0` (contradicts 2 trees; possible anomaly)
- **Row 5**:
- Column 1: Green tree
- Column 2: Green tree
- Column 3: Orange triangle
- Column 6: Orange triangle
- Right Axis Value: `3` (matches 3 symbols: 2 trees + 1 triangle)
- **Row 6**:
- Column 1: Orange triangle
- Column 3: Green tree
- Column 6: Orange triangle
- Right Axis Value: `2` (matches 2 triangles)
- **Column 1**:
- Row 1: Green tree
- Row 5: Green tree
- Row 6: Orange triangle
- Bottom Axis Value: `1` (matches 1 tree)
- **Column 2**:
- Row 1: Orange triangle
- Row 5: Green tree
- Bottom Axis Value: `1` (matches 1 triangle)
- **Column 3**:
- Row 3: Green tree
- Row 4: Green tree
- Row 5: Orange triangle
- Row 6: Green tree
- Bottom Axis Value: `1` (contradicts 3 trees; possible anomaly)
- **Column 4**:
- No symbols
- Bottom Axis Value: `2` (no symbols; possible anomaly)
- **Column 5**:
- Row 1: Green tree
- Bottom Axis Value: `0` (contradicts 1 tree; possible anomaly)
- **Column 6**:
- Row 2: Orange triangle
- Row 3: Green tree
- Row 4: Green tree
- Row 5: Orange triangle
- Row 6: Orange triangle
- Bottom Axis Value: `2` (matches 2 triangles)
### Key Observations
1. **Inconsistent Axis Values**:
- Row 4 has 2 trees but a right axis value of `0`.
- Column 3 has 3 trees but a bottom axis value of `1`.
- Column 5 has 1 tree but a bottom axis value of `0`.
2. **Symbol Distribution**:
- Trees and triangles are distributed unevenly, with clusters in specific rows/columns.
- Row 5 contains the highest density of symbols (4 total: 2 trees, 2 triangles).
3. **Pattern Ambiguity**:
- The numerical axis values do not consistently align with the count of symbols in their respective rows/columns, suggesting either a non-count metric or a data entry error.
### Interpretation
The grid likely represents a structured dataset where symbols encode categorical or binary data (e.g., presence/absence of features). The numerical axis values may correspond to a secondary metric (e.g., priority, frequency, or a coded value) rather than direct counts. The discrepancies between symbol counts and axis values indicate potential anomalies or a need for additional context (e.g., a legend explaining the axis values). The pattern suggests a deliberate arrangement, possibly for visualization or encoding purposes, but the exact relationship between symbols and axis values remains unclear without further information.
</details>
Figure 1: An example instance of a partially solved 6 by 6 tents puzzle. Tents need to be placed next to trees, away from other tents and fulfilling the row and column constraints.
A critical factor in the success of these advanced models is the ability to leverage increased computational resources at test time, allowing them to explore more intricate solution spaces. This capability raises a fundamental question: how does the ”reasoning effort” of these models scale as the complexity of the problem increases?
Understanding this scaling relationship is crucial for several reasons. First, it sheds light on the fundamental nature of reasoning within LLMs, moving beyond simply measuring accuracy on isolated tasks. By examining how the computational demands, reflected in token usage, evolve with problem difficulty, we can gain insights into the efficiency and potential bottlenecks of current LLM architectures. Second, characterizing this scaling behavior is essential for designing more effective and resource-efficient reasoning models in the future.
In this work, we address this question by investigating the scaling of reasoning effort in LLMs using a specific, infinitely scalable logic puzzle: the Tents puzzle The puzzle is available to play in the browser at https://www.chiark.greenend.org.uk/~sgtatham/puzzles/js/tents.html (see Figure 1). This puzzle offers a controlled environment for studying algorithmic reasoning, as its problem size can be systematically increased, and it possesses a known linear-time solution. Our analysis focuses on how the number of tokens used by state-of-the-art reasoning LLMs changes as the puzzle grid size grows. In addition to reasoning effort, we also evaluate the success rate across different puzzle sizes to provide a comprehensive view of their performance.
2 Related Work
The exploration of reasoning abilities in large language models (LLMs) is a rapidly evolving field with significant implications for artificial intelligence. Several benchmarks have been developed to evaluate the reasoning capabilities of LLMs across various domains. These benchmarks provide standardized tasks and evaluation metrics to assess and compare different models. Notable benchmarks include GSM8K (Cobbe et al., 2021), ARC-AGI (Chollet, 2019), GPQA (Rein et al., 2023), MMLU (Hendrycks et al., 2020), SWE-bench (Jimenez et al., 2023) and NPhard-eval (Fan et al., 2023). These benchmarks cover topics from mathematics to commonsense reasoning and coding. More recently, also math competitions such as AIME2024 (of America, 2024) have been used to evaluate the newest models. Estermann et al. (2024) have introduced PUZZLES, a benchmark focusing on algorithmic and logical reasoning for reinforcement learning. While PUZZLES does not focus on LLMs, except for a short ablation in the appendix, we argue that the scalability provided by the underlying puzzles is an ideal testbed for testing extrapolative reasoning capabilities in LLMs.
The reasoning capabilities of traditional LLMs without specific prompting strategies are quite limited (Huang & Chang, 2022). Using prompt techniques such as chain-of-thought (Wei et al., 2022), least-to-most (Zhou et al., 2022) and tree-of-thought (Yao et al., 2023), the reasoning capabilities of traditional LLMs can be greatly improved. Lee et al. (2024) have introduced the Language of Thought Hypothesis, based on the idea that human reasoning is rooted in language. Lee et al. propose to see the reasoning capabilities through three different properties: Logical coherence, compositionality and productivity. In this work we will mostly focus on the aspect of algorithmic reasoning, which falls into logical coherence. Specifically, we analyze the limits of logical coherence.
With the release of OpenAI’s o1 model, it became apparent that new training strategies based on reinforcement learning are able to boost the reasoning performance even further. Since o1, there now exist a number of different models capable of enhanced reasoning (Guo et al., 2025; DeepMind, 2025; Qwen, 2024; OpenAI, 2025). Key to the success of these models is the scaling of test-time compute. Instead of directly producing an answer, or thinking for a few steps using chain-of-thought, the models are now trained to think using several thousands of tokens before coming up with an answer.
3 Methods
3.1 The Tents Puzzle Problem
In this work, we employ the Tents puzzle, a logic problem that is both infinitely scalable and solvable in linear time See a description of the algorithm of the solver as part of the PUZZLES benchmark here: https://github.com/ETH-DISCO/rlp/blob/main/puzzles/tents.c#L206C3-L206C67, making it an ideal testbed for studying algorithmic reasoning in LLMs. The Tents puzzle, popularized by Simon Tatham’s Portable Puzzle Collection (Tatham, ), requires deductive reasoning to solve. The puzzle is played on a rectangular grid, where some cells are pre-filled with trees. The objective is to place tents in the remaining empty cells according to the following rules:
- no two tents are adjacent, even diagonally
- there are exactly as many tents as trees and the number of tents in each row and column matches the numbers around the edge of the grid
- it is possible to match all tents to trees so that each tent is orthogonally adjacent to its own tree (a tree may also be adjacent to other tents).
An example instance of the Tents puzzle is visualized in Figure 1 in the Introduction. The scalability of the puzzle is achieved by varying the grid dimensions, allowing for systematic exploration of problem complexity. Where not otherwise specified, we used the ”easy” difficulty preset available in the Tents puzzle generator, with ”tricky” being evaluated in Section A.2.1. Crucially, the Tents puzzle is designed to test extrapolative reasoning as puzzle instances, especially larger ones, are unlikely to be present in the training data of LLMs. We utilized a text-based interface for the Tents puzzle, extending the code base provided by Estermann et al. (2024) to generate puzzle instances and represent the puzzle state in a format suitable for LLMs.
Models were presented with the same prompt (detailed in Appendix A.1) for all puzzle sizes and models tested. The prompt included the puzzle rules and the initial puzzle state in a textual format. Models were tasked with directly outputting the solved puzzle grid in JSON format. This one-shot approach contrasts with interactive or cursor-based approaches previously used in (Estermann et al., 2024), isolating the reasoning process from potential planning or action selection complexities.
3.2 Evaluation Criteria
Our evaluation focuses on two key metrics: success rate and reasoning effort. Success is assessed as a binary measure: whether the LLM successfully outputs a valid solution to the Tents puzzle instance, adhering to all puzzle rules and constraints. We quantify problem complexity by the problem size, defined as the product of the grid dimensions (rows $Ă—$ columns). To analyze the scaling of reasoning effort, we measure the total number of tokens generated by the LLMs to produce the final answer, including all thinking tokens. We hypothesize a linear scaling relationship between problem size and reasoning effort, and evaluate this hypothesis by fitting a linear model to the observed token counts as a function of problem size. The goodness of fit is quantified using the $R^{2}$ metric, where scores closer to 1 indicate that a larger proportion of the variance in reasoning effort is explained by a linear relationship with problem size. Furthermore, we analyze the percentage of correctly solved puzzles across different problem sizes to assess the performance limits of each model.
3.3 Considered Models
We evaluated the reasoning performance of the following large language models known for their enhanced reasoning capabilities: Gemini 2.0 Flash Thinking (DeepMind, 2025), OpenAI o3-mini (OpenAI, 2025), DeepSeek R1 Guo et al. (2025), and Qwen/QwQ-32B-Preview Qwen (2024).
4 Results
<details>
<summary>x1.png Details</summary>

### Visual Description
## Scatter Plot: Reasoning Tokens vs. Problem Size
### Overview
The image is a scatter plot comparing the relationship between problem size (x-axis) and reasoning tokens (y-axis) for two AI models: **deepseek/deepseek-r1** (blue circles) and **o3-mini** (orange squares). Two fitted regression lines are overlaid, along with a legend in the top-left corner. A third dataset (**qwen/qwq-32b-preview**, green triangles) is mentioned in the legend but not plotted.
---
### Components/Axes
- **X-axis (Problem Size)**: Ranges from 20 to 80, with gridlines at intervals of 10.
- **Y-axis (Reasoning Tokens)**: Ranges from 0 to 20,000, with gridlines at intervals of 5,000.
- **Legend**: Located in the top-left corner, with four entries:
1. **Blue circles**: "deepseek/deepseek-r1" (data points).
2. **Solid blue line**: "deepseek/deepseek-r1 fit (R²: 0.667)".
3. **Orange squares**: "o3-mini" (data points).
4. **Dashed orange line**: "o3-mini fit (R²: 0.833)".
5. **Green triangles**: "qwen/qwq-32b-preview" (not plotted).
6. **Dashed green line**: "qwen/qwq-32b-preview fit (R²: 0.087)" (not plotted).
---
### Detailed Analysis
1. **deepseek/deepseek-r1 (Blue Circles)**:
- Data points span x=20 to 60, y=5,000 to 15,000.
- Fitted line (solid blue) shows a moderate upward trend with R² = 0.667.
- Points are scattered but generally follow the line, indicating moderate correlation.
2. **o3-mini (Orange Squares)**:
- Data points span x=20 to 80, y=3,000 to 20,000.
- Fitted line (dashed orange) shows a steep upward trend with R² = 0.833.
- Points align closely with the line, indicating strong correlation.
3. **qwen/qwq-32b-preview (Green Triangles)**:
- Mentioned in the legend but absent from the plot. No data points or fitted line are visible.
---
### Key Observations
- **o3-mini** demonstrates a stronger linear relationship (R² = 0.833) compared to **deepseek/deepseek-r1** (R² = 0.667).
- The **o3-mini** line has a steeper slope, suggesting faster growth in reasoning tokens with increasing problem size.
- **deepseek/deepseek-r1** exhibits more variability in reasoning tokens for similar problem sizes.
- The absence of **qwen/qwq-32b-preview** data suggests it may not be relevant to this analysis or was excluded.
---
### Interpretation
- **Model Efficiency**: **o3-mini** scales more predictably with problem size, making it potentially more efficient for larger tasks. Its high R² value indicates consistent performance.
- **deepseek/deepseek-r1** shows less consistency, with reasoning tokens fluctuating more for equivalent problem sizes. This could imply higher computational variability or sensitivity to input specifics.
- The **qwen/qwq-32b-preview** dataset is referenced but not visualized, raising questions about its exclusion or relevance to the current analysis.
- The plot highlights trade-offs between model architectures in balancing problem size and resource allocation (reasoning tokens).
</details>
(a)
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Graph: Model Success Rates Across Problem Sizes
### Overview
The graph compares the success rates of four AI models (deepseek/deepseek-r1, o3-mini, gemini-2.0-flash-thinking-exp-01-21, qwen/qwq-32b-preview) as problem size increases from 20 to 120. Success rate is measured in percentage, with all models starting near 100% at smaller problem sizes but declining at varying rates.
### Components/Axes
- **X-axis (Problem Size)**: Ranges from 20 to 120 in increments of 20.
- **Y-axis (Success Rate %)**: Ranges from 0 to 100 in increments of 20.
- **Legend**: Located in the top-right corner, mapping colors to models:
- Blue solid line: deepseek/deepseek-r1
- Orange dashed line: o3-mini
- Green dash-dot line: gemini-2.0-flash-thinking-exp-01-21
- Red dotted line: qwen/qwq-32b-preview
### Detailed Analysis
1. **deepseek/deepseek-r1 (Blue)**:
- Starts at 100% success rate at problem size 20.
- Sharp decline to ~75% at 30, ~65% at 40, ~50% at 50, ~20% at 60, then stabilizes near 0% for sizes ≥70.
- Trend: Rapid early degradation followed by plateau.
2. **o3-mini (Orange)**:
- Maintains 100% success rate until problem size 60.
- Drops to ~30% at 70, ~65% at 80, then stabilizes near 0% for sizes ≥90.
- Trend: Sudden collapse after problem size 60.
3. **gemini-2.0-flash-thinking-exp-01-21 (Green)**:
- Begins at ~65% at 20, peaks at ~80% at 25, then plummets to 0% by 40.
- Remains at 0% for all larger problem sizes.
- Trend: Brief improvement followed by catastrophic failure.
4. **qwen/qwq-32b-preview (Red)**:
- Starts at ~65% at 20, drops to ~35% at 25, then collapses to 0% by 40.
- Remains at 0% for all larger problem sizes.
- Trend: Steep early decline with no recovery.
### Key Observations
- **o3-mini** demonstrates the most robust performance across larger problem sizes (up to 60), though it fails catastrophically beyond that.
- **deepseek/deepseek-r1** shows gradual degradation but retains some capability at problem size 60 (~20% success).
- **gemini** and **qwen** exhibit early failure patterns, collapsing to 0% success by problem size 40.
- All models fail completely (0% success) for problem sizes ≥90, except o3-mini at 80 (~65%).
### Interpretation
The data suggests that model performance degrades non-linearly with problem size, with most models failing entirely beyond a critical threshold. o3-mini and deepseek-r1 exhibit superior scalability, maintaining partial functionality at larger problem sizes. The abrupt drops (e.g., gemini at 40, qwen at 40) indicate potential architectural limitations in handling complexity beyond specific problem sizes. The divergence in failure patterns implies differing optimization strategies: o3-mini prioritizes consistency until a critical point, while others prioritize early performance at the cost of scalability.
</details>
(b)
Figure 2: (a) Reasoning effort in number of reasoning tokens versus problem size for DeepSeek R1, o3-mini, and Qwen/QwQ-32B-Preview. Successful attempts only. Linear fits are added for each model. Gemini 2.0 Flash Thinking is excluded due to unknown number of thinking tokens. (b) Solved percentage versus problem size for all models. No model solved problems larger than size 100. o3-mini achieves the highest success rate, followed by DeepSeek R1 and Gemini 2.0 Flash Thinking. Qwen/QwQ-32B-Preview struggles with problem instances larger than size 20.
The relationship between reasoning effort and problem size reveals interesting scaling behaviors across the evaluated models. Figure 2(a) illustrates the scaling of reasoning effort, measured by the number of reasoning tokens, as the problem size increases for successfully solved puzzles. For DeepSeek R1 and o3-mini, we observe a roughly linear increase in reasoning effort with problem size. Notably, the slopes of the linear fits for R1 and o3-mini are very similar, suggesting comparable scaling behavior in reasoning effort for these models, although DeepSeek R1 consistently uses more tokens than o3-mini across problem sizes. Qwen/QwQ-32B-Preview shows a weaker linear correlation, likely due to the limited number of larger puzzles it could solve successfully.
The problem-solving capability of the models, shown in Figure 2(b), reveals performance limits as problem size increases. None of the models solved puzzles with a problem size exceeding 100. o3-mini demonstrates the highest overall solvability, managing to solve the largest problem instances, followed by DeepSeek R1 and Gemini 2.0 Flash Thinking. Qwen/QwQ-32B-Preview’s performance significantly degrades with increasing problem size, struggling to solve instances larger than 25.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Scatter Plot: Reasoning Tokens vs. Problem Size for o3-mini
### Overview
The image is a scatter plot comparing the number of reasoning tokens used by the o3-mini model against problem size, differentiated by success/failure outcomes. Two data series are represented: blue circles for successful runs and orange squares for failed runs. The plot spans problem sizes from 0 to 400 and reasoning tokens from 0 to 50,000.
### Components/Axes
- **X-axis (Problem Size)**: Labeled "Problem Size," ranging from 0 to 400 in increments of 100.
- **Y-axis (Reasoning Tokens)**: Labeled "Reasoning Tokens," ranging from 0 to 50,000 in increments of 10,000.
- **Legend**: Located in the top-right corner, with:
- **Blue circles**: "o3-mini (Successful)"
- **Orange squares**: "o3-mini (Failed)"
### Detailed Analysis
- **Blue Circles (Successful)**:
- Clustered primarily in the lower-left quadrant.
- Problem sizes range from ~0 to ~100.
- Reasoning tokens range from ~0 to ~25,000.
- Density decreases as problem size increases.
- **Orange Squares (Failed)**:
- Distributed across the entire plot but concentrated in the upper-right quadrant.
- Problem sizes range from ~50 to ~400.
- Reasoning tokens range from ~5,000 to ~50,000.
- Notable outliers: A few orange squares at problem size ~100 with reasoning tokens ~50,000.
### Key Observations
1. **Successful Runs**: Dominated by smaller problem sizes (<100) and lower token usage (<25,000).
2. **Failed Runs**: More variable, with larger problem sizes (>200) and higher token usage (>10,000).
3. **Outliers**: A small subset of failed runs at problem size ~100 required ~50,000 tokens, suggesting inefficiency or edge cases.
### Interpretation
The data suggests that successful reasoning by o3-mini is strongly correlated with smaller problem sizes and efficient token usage. Failed runs, however, exhibit a broader range of problem sizes and token consumption, indicating potential challenges with scalability or resource allocation. The outliers at problem size ~100 with high token usage may represent edge cases where the model struggles despite smaller inputs, possibly due to algorithmic complexity or data quality issues. This highlights a trade-off between problem size, computational resources, and success rates for the o3-mini model.
</details>
(a)
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Reasoning Tokens vs Problem Size
### Overview
The chart illustrates the relationship between problem size (x-axis) and reasoning tokens consumed (y-axis) across three reasoning effort levels: low, medium, and high. Three data series with distinct markers and trend lines are plotted, each with associated R² values indicating model fit quality.
### Components/Axes
- **X-axis (Problem Size)**: Ranges from 20 to 100 in increments of 20
- **Y-axis (Reasoning Tokens)**: Ranges from 0 to 50,000 in increments of 10,000
- **Legend**: Positioned in top-left corner with three entries:
- Low (blue circles, solid line, R²=0.489)
- Medium (orange squares, dashed line, R²=0.833)
- High (green triangles, dotted line, R²=0.813)
### Detailed Analysis
1. **Low Effort (Blue Circles)**:
- Data points cluster tightly around a shallow upward slope
- Starts near 1,000 tokens at problem size 20
- Reaches ~4,000 tokens at problem size 40
- R²=0.489 indicates moderate linear correlation
2. **Medium Effort (Orange Squares)**:
- Stronger upward trajectory than low effort
- Begins at ~3,000 tokens at problem size 20
- Reaches ~18,000 tokens at problem size 80
- R²=0.833 shows excellent linear fit
3. **High Effort (Green Triangles)**:
- Steepest slope among all series
- Starts at ~5,000 tokens at problem size 20
- Peaks at ~45,000 tokens at problem size 100
- R²=0.813 indicates strong linear relationship
- Notable outliers: 3 data points exceed trend line at problem sizes 60-100
### Key Observations
- All series show positive correlation between problem size and token consumption
- High effort demonstrates 11x greater token usage than low effort at maximum problem size
- Medium effort achieves best predictive accuracy (highest R²)
- High effort series contains 3 outliers above predicted values at larger problem sizes
- Low effort shows weakest linear relationship (lowest R²)
### Interpretation
The data suggests that increased reasoning effort correlates with exponentially higher computational resource requirements. While all effort levels show linear scaling with problem size, the medium effort achieves optimal balance between predictive accuracy (R²=0.833) and resource efficiency. The high effort's outliers at larger problem sizes may indicate edge cases requiring disproportionate resources, potentially highlighting limitations in current reasoning architectures. These findings could inform AI system design by quantifying the trade-off between reasoning depth and computational cost, particularly for large-scale problem solving applications.
</details>
(b)
Figure 3: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini. A peak in reasoning effort is observed around problem size 100, followed by a decline for larger problem sizes. (b) Reasoning effort in number of reasoning tokens versus problem size for o3-mini, categorized by low, medium, and high reasoning effort strategies. Steeper slopes are observed for higher reasoning effort strategies. High reasoning effort enables solving larger instances but also increases token usage for smaller, already solvable problems.
A more detailed analysis of o3-mini’s reasoning effort (Figure 3(a)) reveals a non-monotonic trend. While generally increasing with problem size initially, reasoning effort peaks around a problem size of 100. Beyond this point, the reasoning effort decreases, suggesting a potential ”frustration” effect where increased complexity no longer leads to proportionally increased reasoning in the model. The same behavior could not be observed for other models, see Section A.2.2. It would be interesting to see the effect of recent works trying to optimize reasoning length would have on these results (Luo et al., 2025).
Figure 3(b) further explores o3-mini’s behavior by categorizing reasoning effort into low, medium, and high strategies. The steepness of the scaling slope increases with reasoning effort, indicating that higher effort strategies lead to a more pronounced increase in token usage as problem size grows. While high reasoning effort enables solving larger puzzles (up to 10x10), it also results in a higher token count even for smaller problems that were already solvable with lower effort strategies. This suggests a trade-off where increased reasoning effort can extend the solvable problem range but may also introduce inefficiencies for simpler instances.
5 Conclusion
This study examined how reasoning effort scales in LLMs using the Tents puzzle. We found that reasoning effort generally scales linearly with problem size for solvable instances. Model performance varied, with o3-mini and DeepSeek R1 showing better performance than Qwen/QwQ-32B-Preview and Gemini 2.0 Flash Thinking. These results suggest that while LLMs can adapt reasoning effort to problem complexity, their logical coherence has limits, especially for larger problems. Future work should extend this analysis to a wider variety of puzzles contained in the PUZZLES benchmark to include puzzles with different algorithmic complexity. These insights could lead the way to find strategies to improve reasoning scalability and efficiency, potentially by optimizing reasoning length or refining prompting techniques. Understanding these limitations is crucial for advancing LLMs in complex problem-solving.
References
- Chollet (2019) François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- DeepMind (2025) DeepMind. Gemini flash thinking. https://deepmind.google/technologies/gemini/flash-thinking/, 2025. Accessed: February 6, 2025.
- Estermann et al. (2024) Benjamin Estermann, Luca A Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. Puzzles: A benchmark for neural algorithmic reasoning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.
- Fan et al. (2023) Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, and Yongfeng Zhang. Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv preprint arXiv:2312.14890, 2023.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Huang & Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
- Jimenez et al. (2023) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- Lee et al. (2024) Seungpil Lee, Woochang Sim, Donghyeon Shin, Wongyu Seo, Jiwon Park, Seokki Lee, Sanha Hwang, Sejin Kim, and Sundong Kim. Reasoning abilities of large language models: In-depth analysis on the abstraction and reasoning corpus. ACM Transactions on Intelligent Systems and Technology, 2024.
- Luo et al. (2025) Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025.
- of America (2024) Mathematical Association of America. 2024 aime i problems. https://artofproblemsolving.com/wiki/index.php/2024_AIME_I, 2024. Accessed: February 6, 2025.
- OpenAI (2025) OpenAI. Openai o3 mini. https://openai.com/index/openai-o3-mini/, 2025. Accessed: February 6, 2025.
- Qwen (2024) Qwen. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/.
- Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
- (16) Simon Tatham. Simon tatham’s portable puzzle collection. https://www.chiark.greenend.org.uk/~sgtatham/puzzles/. Accessed: 2025-02-06.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, may 2023. arXiv preprint arXiv:2305.10601, 14, 2023.
- Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
Appendix A Appendix
A.1 Full Prompt
The full prompt used in the experiments is the following, on the example of a 4x4 puzzle:
⬇
You are a logic puzzle expert. You will be given a logic puzzle to solve. Here is a description of the puzzle:
You have a grid of squares, some of which contain trees. Your aim is to place tents in some of the remaining squares, in such a way that the following conditions are met:
There are exactly as many tents as trees.
The tents and trees can be matched up in such a way that each tent is directly adjacent (horizontally or vertically, but not diagonally) to its own tree. However, a tent may be adjacent to other trees as well as its own.
No two tents are adjacent horizontally, vertically or diagonally.
The number of tents in each row, and in each column, matches the numbers given in the row or column constraints.
Grass indicates that there cannot be a tent in that position.
You receive an array representation of the puzzle state as a grid. Your task is to solve the puzzle by filling out the grid with the correct values. You need to solve the puzzle on your own, you cannot use any external resources or run any code. Once you have solved the puzzle, tell me the final answer without explanation. Return the final answer as a JSON array of arrays.
Here is the current state of the puzzle as a string of the internal state representation:
A 0 represents an empty cell, a 1 represents a tree, a 2 represents a tent, and a 3 represents a grass patch.
Tents puzzle state:
Current grid:
[[0 0 1 0]
[0 1 0 0]
[1 0 0 0]
[0 0 0 0]]
The column constraints are the following:
[1 1 0 1]
The row constraints are the following:
[2 0 0 1]
A.2 Additional Figures
A.2.1 Easy vs. Tricky Puzzles
<details>
<summary>x5.png Details</summary>

### Visual Description
## Scatter Plot: Relationship Between Problem Size and Reasoning Tokens
### Overview
The image is a scatter plot comparing "Reasoning Tokens" (y-axis) against "Problem Size" (x-axis) for two difficulty levels: "easy" and "tricky." Two trend lines (solid blue for "easy" and dashed orange for "tricky") are overlaid, with R² values indicating model fit. The legend is positioned in the bottom-right corner.
### Components/Axes
- **X-axis (Problem Size)**: Ranges from 15 to 40, with increments of 5.
- **Y-axis (Reasoning Tokens)**: Ranges from 0 to 5000, with increments of 1000.
- **Legend**:
- **Blue circles**: "easy" data points.
- **Orange squares**: "tricky" data points.
- **Solid blue line**: "easy" trend line (R²: 0.468).
- **Dashed orange line**: "tricky" trend line (R²: 0.502).
### Detailed Analysis
- **Data Points**:
- **Easy (blue circles)**:
- Clustered around the trend line, with some spread.
- Example values:
- Problem Size 15: ~1500 tokens.
- Problem Size 20: ~1800 tokens.
- Problem Size 30: ~2500 tokens.
- Problem Size 40: ~4000 tokens.
- **Tricky (orange squares)**:
- More dispersed, with higher variability.
- Example values:
- Problem Size 20: ~1200 tokens.
- Problem Size 25: ~2800 tokens.
- Problem Size 30: ~4500 tokens.
- Problem Size 35: ~5000 tokens.
- **Trend Lines**:
- **Easy**: Slightly upward slope, starting near 1500 tokens at Problem Size 15 and reaching ~3500 at 40.
- **Tricky**: Steeper upward slope, starting near 1200 tokens at Problem Size 15 and reaching ~5000 at 35.
### Key Observations
1. **Trend Correlation**:
- Both "easy" and "tricky" show positive correlations between Problem Size and Reasoning Tokens.
- "Tricky" has a higher R² (0.502 vs. 0.468), suggesting a stronger linear relationship.
2. **Outliers/Anomalies**:
- A single "tricky" data point at Problem Size 35 (5000 tokens) appears as an outlier, far above the trend line.
- "Easy" data points at Problem Size 40 (~4000 tokens) deviate slightly from the trend line.
3. **Variability**:
- "Tricky" data points exhibit greater spread, indicating higher inconsistency in reasoning token usage.
- "Easy" data points are more tightly clustered around the trend line.
### Interpretation
The data suggests that **Problem Size directly influences Reasoning Tokens**, with "tricky" problems requiring significantly more tokens than "easy" ones. The higher R² for "tricky" implies the model fits this data better, but the spread in "easy" data points may reflect variability in how "easy" problems are processed. The outlier at Problem Size 35 for "tricky" could indicate an exceptional case or a data entry error. The steeper slope for "tricky" highlights that complexity amplifies token usage, while the "easy" trend line’s gentler slope suggests more predictable resource allocation.
</details>
(a)
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Graph: Success Rate vs. Problem Size
### Overview
The graph illustrates the relationship between problem size (x-axis) and success rate (y-axis) for two difficulty levels: "easy" (blue solid line) and "tricky" (orange dashed line). Both lines show a sharp decline in success rate as problem size increases, with distinct thresholds where success rates drop to near 0%.
### Components/Axes
- **Title**: "Success Rate (%)"
- **X-axis**: "Problem Size" (ranges from 20 to 120 in increments of 20)
- **Y-axis**: "Success Rate (%)" (ranges from 0% to 100% in increments of 20)
- **Legend**: Located in the top-right corner, with:
- Blue solid line labeled "easy"
- Orange dashed line labeled "tricky"
### Detailed Analysis
1. **Easy (Blue Line)**:
- Starts at **100%** success rate for problem sizes 20, 25, and 30.
- Drops sharply to **~65%** at problem size 40.
- Plummets to **~0%** by problem size 50, remaining flat thereafter.
2. **Tricky (Orange Line)**:
- Starts at **100%** success rate for problem sizes 20 and 25.
- Declines to **~35%** at problem size 30.
- Further drops to **~25%** at problem size 35.
- Reaches **~0%** by problem size 40, remaining flat thereafter.
### Key Observations
- Both difficulty levels exhibit a **threshold effect**: success rates collapse abruptly after specific problem sizes (50 for "easy," 40 for "tricky").
- The "tricky" line shows a **steeper initial decline** compared to "easy," suggesting higher sensitivity to problem size increases.
- Success rates for both lines are **identical (0%)** for problem sizes ≥50.
### Interpretation
The data demonstrates that problem size critically impacts success rates, with "tricky" problems being more vulnerable to increases in size. The abrupt drops suggest a **non-linear relationship** where small increases in problem size beyond certain thresholds render success nearly impossible. This could imply that "tricky" problems require significantly more resources or cognitive effort, making them less scalable than "easy" problems. The identical 0% success rate for large problem sizes may indicate a **hard failure ceiling** for both difficulty levels under the tested conditions.
</details>
(b)
Figure 4: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini with reasoning effort low. Successful tries only. Linear fits are added for each model. (b) Solved percentage versus problem size for o3-mini with reasoning effort low.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Scatter Plot: Relationship Between Problem Size and Reasoning Tokens
### Overview
The image is a scatter plot comparing the relationship between problem size (x-axis) and reasoning tokens (y-axis) for two difficulty levels: "easy" and "tricky." Two trend lines are overlaid: a solid blue line for "easy" problems (R² = 0.829) and a dashed orange line for "tricky" problems (R² = 0.903). Data points are represented as blue dots (easy) and orange squares (tricky).
### Components/Axes
- **X-axis (Problem Size)**: Ranges from 20 to 80 in increments of 10.
- **Y-axis (Reasoning Tokens)**: Ranges from 0 to 30,000 in increments of 5,000.
- **Legend**: Located in the top-left corner, with:
- Blue dots labeled "easy" (solid line, R² = 0.829).
- Orange squares labeled "tricky" (dashed line, R² = 0.903).
### Detailed Analysis
1. **Trend Lines**:
- **Easy (Blue)**: The solid blue line slopes upward, indicating a positive correlation between problem size and reasoning tokens. R² = 0.829 suggests a strong linear relationship.
- **Tricky (Orange)**: The dashed orange line also slopes upward but is steeper than the blue line, with a higher R² (0.903), indicating a stronger linear fit for tricky problems.
2. **Data Points**:
- **Easy (Blue Dots)**: Scattered around the blue line, with values ranging from ~2,000 to ~20,000 tokens as problem size increases from 20 to 80.
- **Tricky (Orange Squares)**: Positioned higher than easy data points, with values ranging from ~4,000 to ~30,000 tokens. The orange squares align closely with the dashed line, especially at larger problem sizes.
3. **Spatial Grounding**:
- Legend: Top-left corner.
- Blue line: Bottom-left to top-right, slightly below the orange line.
- Orange line: Bottom-left to top-right, steeper and higher than the blue line.
### Key Observations
- Both difficulty levels show increasing reasoning tokens with larger problem sizes.
- Tricky problems consistently require more tokens than easy problems at equivalent problem sizes.
- The orange line (tricky) has a higher R², suggesting better predictability for tricky problems.
- Data points for tricky problems are more tightly clustered around their trend line compared to easy problems.
### Interpretation
The data demonstrates that problem size directly influences the number of reasoning tokens required, with tricky problems demanding significantly more resources. The higher R² for tricky problems (0.903 vs. 0.829) implies that the model’s predictions for tricky problems are more reliable, possibly due to clearer patterns in their complexity. The steeper slope for tricky problems suggests that their difficulty scales more rapidly with problem size. Outliers are minimal, but the spread in easy problem data points may reflect variability in problem-solving strategies or external factors affecting token usage.
</details>
(a)
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Graph: Success Rate vs. Problem Size
### Overview
The graph compares success rates for "easy" and "tricky" problem categories across varying problem sizes (20–120). Two lines represent success rates: a solid blue line for "easy" and a dashed orange line for "tricky". Both lines start at 100% success rate but diverge significantly after problem size 40.
### Components/Axes
- **X-axis (Problem Size)**: Labeled from 20 to 120 in increments of 20.
- **Y-axis (Success Rate %)**: Labeled from 0 to 100 in increments of 20.
- **Legend**: Located in the top-right corner, with:
- Solid blue line: "easy"
- Dashed orange line: "tricky"
### Detailed Analysis
1. **Blue Line ("easy")**:
- **Trend**: Maintains 100% success rate from problem size 20 to 40.
- **Drop**: Sharp decline from 100% at 40 to 0% at 100, forming a straight diagonal line.
- **Post-100**: Remains at 0% until 120.
2. **Orange Line ("tricky")**:
- **Initial Stability**: Stays at 100% from 20 to 40.
- **Dip**: Falls to ~35% at 40, then rises sharply to 100% at 50.
- **Plateau**: Remains at 100% from 50 to 120.
### Key Observations
- **Threshold Behavior**: Both categories maintain perfect success rates until problem size 40.
- **Divergence at 40**: "Easy" problems collapse to 0% success rate by 100, while "tricky" problems recover to 100% by 50.
- **Asymmetry**: "Tricky" problems outperform "easy" problems at larger sizes (50–120), despite initial parity.
### Interpretation
The data suggests that "easy" problems are only solvable up to a critical problem size (~40), after which success rates plummet. In contrast, "tricky" problems exhibit resilience: their success rate recovers fully by problem size 50 and remains stable. This implies that "tricky" problems may involve adaptive strategies or resource allocation that scale effectively with complexity, whereas "easy" problems lack such flexibility. The abrupt drop in the "easy" line could indicate a phase transition or computational limitation at larger scales.
</details>
(b)
Figure 5: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini with reasoning effort medium. Successful tries only. Linear fits are added for each model. (b) Solved percentage versus problem size for o3-mini with reasoning effort medium.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Scatter Plot: Reasoning Tokens vs Problem Size
### Overview
The image is a scatter plot comparing reasoning tokens required for "easy" and "tricky" problem types across varying problem sizes (20-100). Two trend lines with R² values are overlaid to show correlation strength.
### Components/Axes
- **X-axis**: Problem Size (20-100, linear scale)
- **Y-axis**: Reasoning Tokens (0-60,000, linear scale)
- **Legend**:
- Top-left corner
- Blue circles: "easy" (solid line, R²=0.811)
- Orange squares: "tricky" (dashed line, R²=0.607)
### Detailed Analysis
1. **Easy Data Series**:
- Blue circles show a strong positive linear trend (R²=0.811)
- At problem size 20: ~5,000 tokens
- At problem size 100: ~55,000 tokens
- Consistent upward trajectory with minimal scatter
2. **Tricky Data Series**:
- Orange squares show weaker positive trend (R²=0.607)
- At problem size 20: ~8,000 tokens
- At problem size 100: ~52,000 tokens
- Greater vertical dispersion, especially at mid-problem sizes (40-80)
3. **Trend Lines**:
- Solid blue line (easy) has steeper slope than dashed orange line (tricky)
- Both lines pass through origin but diverge at higher problem sizes
### Key Observations
- **Correlation Strength**: Easy problems show significantly stronger linear relationship (R²=0.811 vs 0.607)
- **Token Scaling**: Both problem types scale similarly at extremes (20 and 100), but diverge in mid-range
- **Outliers**: Tricky problems show 3-4 data points exceeding trend line predictions at problem sizes 60-80
- **Data Density**: Higher concentration of data points in 40-60 problem size range for both series
### Interpretation
The data demonstrates that while both easy and tricky problems require increasing tokens with problem size, easy problems exhibit more predictable scaling. The higher R² value for easy problems suggests better model generalizability for these cases. The convergence at problem size 100 implies both types reach similar complexity thresholds at maximum size, despite different difficulty classifications. The scattered nature of tricky problems indicates potential confounding variables affecting token requirements beyond problem size alone. This pattern could inform resource allocation strategies for AI systems handling mixed difficulty tasks.
</details>
(a)
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Graph: Success Rate vs Problem Size
### Overview
The image depicts a line graph comparing success rates for two difficulty levels ("easy" and "tricky") across varying problem sizes (20–120). The graph shows two distinct trends: one for easy problems (solid blue line) and one for tricky problems (dashed orange line).
### Components/Axes
- **X-axis (Problem Size)**: Labeled "Problem Size" with markers at 20, 40, 60, 80, 100, and 120.
- **Y-axis (Success Rate %)**: Labeled "Success Rate (%)" with markers at 0, 20, 40, 60, 80, and 100.
- **Legend**: Located in the bottom-left corner, with:
- Solid blue line: "easy"
- Dashed orange line: "tricky"
### Detailed Analysis
1. **Easy (Blue Line)**:
- Maintains 100% success rate for problem sizes 20–100.
- Drops sharply from 100% to 0% between problem sizes 100 and 120.
- Key data points:
- 20: 100%
- 40: 100%
- 60: 100%
- 80: 100%
- 100: 100%
- 120: 0%
2. **Tricky (Orange Line)**:
- Maintains 100% success rate for problem sizes 20–60.
- Drops to ~65% at problem size 80, remains flat until 100.
- Drops sharply to 0% between problem sizes 100 and 120.
- Key data points:
- 20: 100%
- 40: 100%
- 60: 100%
- 80: ~65%
- 100: ~65%
- 120: 0%
### Key Observations
- **Threshold Behavior**: Both difficulty levels maintain perfect success until a critical problem size (100 for "easy," 60 for "tricky"), after which performance collapses.
- **Catastrophic Failure**: Both lines exhibit abrupt drops to 0% at the maximum problem size (120), suggesting a binary success/failure outcome.
- **Divergence at Mid-Sizes**: The "tricky" line shows a significant drop (~35%) at problem size 80, while the "easy" line remains stable.
### Interpretation
The graph demonstrates that:
1. **Problem Size Thresholds**: There exists a critical problem size beyond which success rates collapse entirely, regardless of difficulty. For "easy" problems, this occurs at size 100; for "tricky" problems, it occurs earlier at size 60.
2. **Difficulty Impact**: While "easy" problems maintain perfect performance up to size 100, "tricky" problems begin failing at half that size (60), though not catastrophically until size 100.
3. **Binary Outcomes**: The abrupt drops to 0% suggest a system with no partial success—problems are either fully solved or completely failed, with no intermediate states.
This pattern could reflect a system where:
- Problem complexity increases non-linearly with size
- Users/algorithms have a "breaking point" beyond which performance degrades rapidly
- Difficulty scaling introduces compounding errors that become insurmountable at certain thresholds
</details>
(b)
Figure 6: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini with reasoning effort high. Successful tries only. Linear fits are added for each model. (b) Solved percentage versus problem size for o3-mini with reasoning effort high.
A.2.2 Reasoning Effort for All Models
<details>
<summary>x11.png Details</summary>

### Visual Description
## Scatter Plot: Reasoning Tokens vs Problem Size for qwen/qwq-32b-preview
### Overview
The image shows a scatter plot comparing reasoning token usage against problem size for two categories of outcomes: successful and failed runs of the qwen/qwq-32b-preview model. The plot uses distinct markers (blue circles for successful, orange squares for failed) to differentiate outcomes across problem sizes from 0 to 400.
### Components/Axes
- **X-axis (Problem Size)**:
- Range: 0 to 400
- Increment: 100
- Label: "Problem Size"
- **Y-axis (Reasoning Tokens)**:
- Range: 0 to 20,000
- Increment: 5,000
- Label: "Reasoning Tokens"
- **Legend**:
- Position: Top-right corner
- Entries:
- Blue circles: "qwen/qwq-32b-preview (Successful)"
- Orange squares: "qwen/qwq-32b-preview (Failed)"
### Detailed Analysis
- **Successful Runs (Blue Circles)**:
- Concentrated in the lower-left quadrant (Problem Size: 0–100, Reasoning Tokens: 3,000–7,000).
- No data points observed beyond Problem Size 100.
- Clustered tightly, suggesting consistent token usage for smaller problems.
- **Failed Runs (Orange Squares)**:
- Distributed across the entire Problem Size range (0–400).
- Token usage spans 3,000–20,000, with a notable outlier at (400, 20,000).
- Higher density of points in the 100–300 Problem Size range (Tokens: 8,000–15,000).
### Key Observations
1. **Problem Size Correlation**:
- Successful runs cluster at smaller problem sizes (≤100), while failed runs span all sizes.
- Token usage increases with problem size for failed cases, peaking at 20,000 tokens for Problem Size 400.
2. **Outlier**:
- A single failed run at (400, 20,000) represents the maximum token usage observed.
3. **Efficiency Gap**:
- Successful runs use ≤7,000 tokens, while failed runs often exceed this threshold, especially at larger problem sizes.
### Interpretation
The data suggests that problem size significantly impacts both success rates and token efficiency. Successful runs are confined to smaller problem sizes and exhibit lower token consumption, indicating the model's capacity limits. Failed runs show a clear trend of increasing token usage with problem size, culminating in the outlier at Problem Size 400. This outlier may represent an edge case where the model expended maximum resources but still failed, highlighting potential scalability challenges. The stark contrast between successful and failed runs implies that token efficiency is a critical factor in determining task success, with larger problem sizes correlating with higher computational demands and lower success rates.
</details>
Figure 7: Reasoning effort in tokens for Qwen QwQ.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Scatter Plot: Reasoning Tokens vs. Problem Size for deepseek/deepseek-r1
### Overview
This scatter plot compares the number of reasoning tokens used by the deepseek/deepseek-r1 model across varying problem sizes, distinguishing between successful and failed outcomes. The x-axis represents problem size (0–400), and the y-axis represents reasoning tokens (4,000–16,000). Two data series are plotted: blue circles for successful runs and orange squares for failed runs.
### Components/Axes
- **X-axis (Problem Size)**:
- Range: 0 to 400 (linear scale).
- Labels: Incremented by 100 (0, 100, 200, 300, 400).
- **Y-axis (Reasoning Tokens)**:
- Range: 4,000 to 16,000 (linear scale).
- Labels: Incremented by 2,000 (4,000, 6,000, 8,000, 10,000, 12,000, 14,000, 16,000).
- **Legend**:
- Position: Bottom-right corner.
- Labels:
- Blue circles: "deepseek/deepseek-r1 (Successful)".
- Orange squares: "deepseek/deepseek-r1 (Failed)".
### Detailed Analysis
- **Successful Runs (Blue Circles)**:
- **Problem Size**: Concentrated between 0 and 100.
- **Reasoning Tokens**: Clustered between ~4,000 and ~14,000, with a peak density near 8,000 tokens.
- **Distribution**: Approximately 30–40 data points, tightly grouped in the lower-left quadrant.
- **Failed Runs (Orange Squares)**:
- **Problem Size**: Spread across 50–400, with higher density above 150.
- **Reasoning Tokens**: Range from ~8,000 to ~16,000, with a notable outlier at ~16,000 tokens for problem size 400.
- **Distribution**: Approximately 30–40 data points, dispersed diagonally from lower-left to upper-right.
### Key Observations
1. **Inverse Relationship**: Successful runs dominate at smaller problem sizes (0–100), while failed runs increase as problem size grows.
2. **Token Usage**:
- Successful runs use fewer tokens on average (~6,000–10,000).
- Failed runs use more tokens (~10,000–16,000), suggesting higher computational effort for unresolved problems.
3. **Outliers**:
- A single failed run at problem size 400 uses ~16,000 tokens, the maximum observed.
- Successful runs at problem size 100 use ~12,000–14,000 tokens, indicating edge-case complexity.
### Interpretation
The data suggests that deepseek/deepseek-r1 performs well on smaller problems but struggles with larger ones. Successful runs cluster at lower problem sizes and token usage, while failed runs correlate with higher problem sizes and token consumption. This implies that the model’s reasoning capacity is limited by problem complexity, requiring more resources (tokens) for larger inputs but failing more frequently. The outlier at problem size 400 highlights a potential failure mode where extreme token usage does not guarantee success. The trend underscores the need for optimization in handling larger-scale reasoning tasks.
</details>
Figure 8: Reasoning effort in tokens for Deepseek R1.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Scatter Plot: Response Time vs Problem Size for Gemini-2.0-Flash-Thinking Experiments
### Overview
The image is a scatter plot comparing response times (in seconds) to problem sizes for two experimental conditions: "Successful" and "Failed" outcomes from the Gemini-2.0-flash-thinking-exp-01-21 experiment. The plot uses distinct markers (blue circles for successful, orange squares for failed) to differentiate outcomes.
### Components/Axes
- **X-axis (Problem Size)**: Ranges from 0 to 400 in increments of 100.
- **Y-axis (Response Time (s))**: Ranges from 0 to 150 in increments of 25.
- **Legend**: Located in the bottom-left corner, with:
- **Blue circles**: "gemini-2.0-flash-thinking-exp-01-21 (Successful)"
- **Orange squares**: "gemini-2.0-flash-thinking-exp-01-21 (Failed)"
- **Gridlines**: Light gray horizontal and vertical lines for reference.
### Detailed Analysis
1. **Successful Cases (Blue Circles)**:
- **Distribution**: Clustered tightly in the lower-left quadrant.
- **Response Time**: Approximately 10–30 seconds.
- **Problem Size**: Mostly ≤50, with a few outliers up to ~75.
- **Trend**: Response time increases slightly with problem size but remains low overall.
2. **Failed Cases (Orange Squares)**:
- **Distribution**: Spread across the entire plot, with higher density in the upper-right quadrant.
- **Response Time**: Ranges from ~50 to 150 seconds.
- **Problem Size**: Extends up to 400, with a notable concentration between 200–400.
- **Trend**: Response time increases significantly with problem size, especially beyond 200.
3. **Outliers**:
- A single successful case (blue circle) at problem size ~100 and response time ~25 seconds.
- A failed case (orange square) at problem size ~350 and response time ~150 seconds (highest observed).
### Key Observations
- **Problem Size vs. Response Time**: Both successful and failed cases show a positive correlation between problem size and response time, but the relationship is much stronger for failed cases.
- **Success Threshold**: Successful outcomes are predominantly associated with problem sizes ≤75, while failures dominate at larger sizes.
- **Response Time Variability**: Failed cases exhibit greater variability in response times, with some instances exceeding 125 seconds.
### Interpretation
The data suggests that problem size is a critical factor in determining the success of the Gemini-2.0-flash-thinking-exp-01-21 experiment. Successful outcomes are consistently achieved for smaller problem sizes (≤75), with response times remaining efficient (10–30 seconds). As problem size increases beyond 75, the likelihood of failure rises sharply, accompanied by a proportional increase in response time. This implies potential limitations in the model's ability to handle larger inputs efficiently, possibly due to computational constraints or algorithmic complexity. The failed cases at the highest problem sizes (300–400) with response times near 150 seconds may indicate timeouts or resource exhaustion, highlighting a need for optimization or scaling strategies for larger-scale applications.
</details>
Figure 9: Reasoning effort quantified by response time for Gemini-2.0-flash-thinking.
A.3 Cost
Total cost of these experiments was around 80 USD in API credits.