# Interpretable Contrastive Monte Carlo Tree Search Reasoning
soft open fences
## Abstract
We propose (S) peculative (C) ontrastive MCTS ∗: a novel Monte Carlo Tree Search (MCTS) reasoning algorithm for Large Language Models (LLMs) which significantly improves both reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM reasoning works often overlooked its biggest drawback—slower speed compared to CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on various tasks with limited quantitative analysis or ablation studies of its components from reasoning interpretability perspective. 3. The reward model is the most crucial component in MCTS, however previous work has rarely conducted in-depth study or improvement of MCTS’s reward models. Thus, we conducted extensive ablation studies and quantitative analysis on components of MCTS, revealing the impact of each component on the MCTS reasoning performance of LLMs. Building on this, (i) we designed a highly interpretable reward model based on the principle of contrastive decoding and (ii) achieved an average speed improvement of 51.9% per node using speculative decoding. Additionally, (iii) we improved UCT node selection strategy and backpropagation used in previous works, resulting in significant performance improvement. We outperformed o1-mini by an average of 17.4% on the Blocksworld multi-step reasoning dataset using Llama-3.1-70B with SC-MCTS ∗. Our code is available at https://github.com/zitian-gao/SC-MCTS.
## 1 Introduction
With the remarkable development of Large Language Models (LLMs), models such as o1 (OpenAI, 2024a) have now gained a strong ability for multi-step reasoning across complex tasks and can solve problems that are more difficult than previous scientific, code, and mathematical problems. The reasoning task has long been considered challenging for LLMs. These tasks require converting a problem into a series of reasoning steps and then executing those steps to arrive at the correct answer. Recently, LLMs have shown great potential in addressing such problems. A key approach is using Chain of Thought (CoT) (Wei et al., 2024), where LLMs break down the solution into a series of reasoning steps before arriving at the final answer. Despite the impressive capabilities of CoT-based LLMs, they still face challenges when solving problems with an increasing number of reasoning steps due to the curse of autoregressive decoding (Sprague et al., 2024). Previous work has explored reasoning through the use of heuristic reasoning algorithms. For example, Yao et al. (2024) applied heuristic-based search, such as Depth-First Search (DFS) to derive better reasoning paths. Similarly, Hao et al. (2023) employed MCTS to iteratively enhance reasoning step by step toward the goal.
The tremendous success of AlphaGo (Silver et al., 2016) has demonstrated the effectiveness of the heuristic MCTS algorithm, showcasing its exceptional performance across various domains (Jumper et al., 2021; Silver et al., 2017). Building on this, MCTS has also made notable progress in the field of LLMs through multi-step heuristic reasoning. Previous work has highlighted the potential of heuristic MCTS to significantly enhance LLM reasoning capabilities. Despite these advancements, substantial challenges remain in fully realizing the benefits of heuristic MCTS in LLM reasoning.
<details>
<summary>extracted/6087579/fig/Fig1.png Details</summary>

### Visual Description
## Diagram: Blocksworld Task Planning with Expert vs. Amateur Logits
### Overview
This image is a technical diagram illustrating a two-step planning process in a "Blocksworld" domain. It compares the decision-making (logits) of an "Expert" model and an "Amateur" model at two sequential states (S₁ and S₂) to achieve a stated goal. The diagram combines bar charts for numerical logits, text boxes for action selection, and circular state diagrams to visualize the physical block configurations and action outcomes.
### Components/Axes
**1. Header (S₀):**
* **Text:** "Blocksworld task goal: The red block is on top of the yellow block" (enclosed in an orange box).
* **State Label:** `S₀` (top-left).
**2. State S₁ Section:**
* **Left - Bar Charts:**
* **Expert Logits (S<sub>E₁</sub>):** Horizontal bars with values.
* Red bar: `8` (Label: `unstack red` from legend)
* Blue bar: `1` (Label: `pick-up blue` from legend)
* Yellow bar: `1` (Label: `pick-up yellow` from legend)
* **Amateur Logits (S<sub>A₁</sub>):** Horizontal bars with values.
* Red bar: `6` (Label: `unstack red`)
* Blue bar: `2` (Label: `pick-up blue`)
* Yellow bar: `2` (Label: `pick-up yellow`)
* **Legend (Center):** Color-coded squares with action labels: Red=`unstack red`, Blue=`pick-up blue`, Yellow=`pick-up yellow`.
* **Center - CD Logits Box (S<sub>CD₁</sub>):**
* Title: `CD Logits (S<sub>CD₁</sub>):`
* Action List:
* `unstack red` (highlighted in yellow with a green checkmark ✅)
* `pick up blue`
* `pick up yellow`
* **Right - State Transition Diagram (a₀):**
* **Initial State Circle (Top):** Shows three blocks on a surface: a red block stacked on a green block, a blue block, and a yellow block.
* **Action Arrow:** Labeled `unstack the red`, pointing from the initial state to the next state.
* **Resulting State Circle (Middle):** Shows the red block being lifted off the green block (indicated by an upward arrow). The green, blue, and yellow blocks remain on the surface.
* **Alternative Action Ghost Circles (Dashed):** Two faint circles to the right show alternative outcomes: picking up the blue block or picking up the yellow block.
**3. State S₂ Section:**
* **Left - Bar Charts:**
* **Expert Logits (S<sub>E₂</sub>):**
* Yellow bar: `7` (Label: `stack on yellow` from legend)
* Blue bar: `1` (Label: `stack on blue`)
* Green bar: `1` (Label: `stack on green`)
* Red bar: `1` (Label: `put-down red`)
* **Amateur Logits (S<sub>A₂</sub>):**
* Yellow bar: `3` (Label: `stack on yellow`)
* Blue bar: `2` (Label: `stack on blue`)
* Green bar: `2` (Label: `stack on green`)
* Red bar: `3` (Label: `put-down red`)
* **Legend (Center):** Color-coded squares: Yellow=`stack on yellow`, Blue=`stack on blue`, Green=`stack on green`, Red=`put-down red`.
* **Center - CD Logits Box (S<sub>CD₂</sub>):**
* Title: `CD Logits (S<sub>CD₂</sub>):`
* Action List:
* `stack on yellow` (highlighted in yellow with a green checkmark ✅)
* `stack on blue`
* `stack on green`
* `put-down red`
* **Right - State Transition Diagram (a₁):**
* **Initial State Circle (Left):** Shows the state after S₁'s action: red block held above the surface, green, blue, and yellow blocks on the surface.
* **Action Arrow:** Labeled `stack on The yellow`, pointing from this state to the goal state.
* **Goal State Circle (Right):** Shows the red block placed on top of the yellow block. The green and blue blocks are separate on the surface.
* **Alternative Action Ghost Circles (Dashed):** Three faint circles show alternatives: stacking on blue, stacking on green, or putting the red block down on the surface.
### Detailed Analysis
**State S₁ Analysis:**
* **Trend Verification:** The Expert Logits show a very strong, singular preference for `unstack red` (value 8), with minimal weight on other actions (1 each). The Amateur Logits also favor `unstack red` (value 6) but with more distributed confidence (2 each on alternatives).
* **Action Selection:** The CD Logits box (`S<sub>CD₁</sub>`) selects `unstack red`, which aligns with the highest expert logit and is marked as correct (✅). This action is visually depicted in the state transition diagram `a₀`.
* **Spatial Grounding:** The legend for S₁ is positioned between the bar charts and the CD Logits box. The red color in the bar charts corresponds exactly to the `unstack red` label in the legend.
**State S₂ Analysis:**
* **Trend Verification:** Following the `unstack red` action, the Expert Logits now show a strong preference for `stack on yellow` (value 7). The Amateur Logits are highly uncertain, with nearly equal weight on `stack on yellow` (3) and `put-down red` (3), and moderate weight on other stacking options (2 each).
* **Action Selection:** The CD Logits box (`S<sub>CD₂</sub>`) selects `stack on yellow`, again matching the highest expert logit and marked correct (✅). This action completes the goal, as shown in the final state of diagram `a₁`.
* **Spatial Grounding:** The legend for S₂ is positioned similarly to S₁. The yellow color in the bar charts corresponds to the `stack on yellow` label.
### Key Observations
1. **Expert vs. Amateur Confidence:** The Expert model demonstrates high confidence (peaked distributions) in the correct action at each step. The Amateur model shows lower confidence and more uncertainty (flatter distributions), considering incorrect or suboptimal actions more seriously.
2. **Sequential Decision-Making:** The diagram clearly breaks down a long-horizon goal ("red on yellow") into a sequence of two discrete, correct sub-goals: first `unstack red`, then `stack on yellow`.
3. **Visual Confirmation:** Each selected action in the CD Logits boxes is visually validated by the corresponding state transition diagram, showing the physical change in the block world.
4. **Alternative Paths:** The dashed "ghost" circles explicitly show the alternative actions that were considered (especially by the Amateur model) but not taken, highlighting the planning choices.
### Interpretation
This diagram serves as an explanatory model for how a hierarchical or critic-guided planning system (represented by "CD Logits") might operate. It suggests that:
* **The "Expert" model (`S_E`) provides a strong, correct policy signal.** Its logits are sharply peaked on the action that makes progress toward the goal.
* **The "Amateur" model (`S_A`) represents a noisier, less reliable policy.** Its broader distributions indicate it is less certain about the optimal action, potentially exploring more of the action space.
* **The "CD Logits" (`S_CD`) mechanism acts as a decision-making layer.** It appears to successfully select the correct action, likely by leveraging the expert signal or some form of consensus/critique, as indicated by the green checkmarks. The process demonstrates effective **sub-goal decomposition**—breaking a complex goal into a tractable sequence of primitive actions.
* The overall narrative is one of **guided planning**: using a stronger signal (expert) to steer a weaker one (amateur) or to verify a plan, ensuring efficient and correct task completion in a symbolic domain. The contrast in logit distributions visually argues for the value of having a reliable expert model or a robust decision-fusion mechanism in AI planning systems.
</details>
Figure 1: An overview of SC-MCTS ∗. We employ a novel reward model based on the principle of contrastive decoding to guide MCTS Reasoning on Blocksworld multi-step reasoning dataset.
The first key challenge is that MCTS’s general reasoning ability is almost entirely dependent on the reward model’s performance (as demonstrated by our ablation experiments in Section 5.5), making it highly challenging to design dense, general yet efficient rewards to guide MCTS reasoning. Previous works either require two or more LLMs (Tian et al., 2024) or training epochs (Zhang et al., 2024a), escalating the VRAM and computational demand, or they rely on domain-specific tools (Xin et al., 2024a; b) or datasets (Qi et al., 2024), making it difficult to generalize to other tasks or datasets.
The second key challenge is that MCTS is significantly slower than Chain of Thoughts (CoT). CoT only requires designing a prompt of multi-turn chats (Wei et al., 2024). In contrast, MCTS builds a reasoning tree with 2–10 layers depending on the difficulty of the task, where each node in the tree represents a chat round with LLM which may need to be visited one or multiple times. Moreover, to obtain better performance, we typically perform 2–10 MCTS iterations, which greatly increases the number of nodes, leading to much higher computational costs and slower reasoning speed.
To address the these challenges, we went beyond prior works that treated MCTS as a tool and focused on analyzing and improving its components especially reward model. Using contrastive decoding, we redesigned reward model by integrating interpretable reward signals, clustering their prior distributions, and normalizing the rewards using our proposed prior statistical method. To prevent distribution shift, we also incorporated an online incremental update algorithm. We found that the commonly used Upper Confidence Bound on Trees (UCT) strategy often underperformed due to sensitivity to the exploration constant, so we refined it and improved backpropagation to favor steadily improving paths. To address speed issues, we integrated speculative decoding as a "free lunch." All experiments were conducted using the Blocksworld dataset detailed in Section 5.1.
Our goal is to: (i) design novel and high-performance reward models and maximize the performance of reward model combinations, (ii) analyze and optimize the performance of various MCTS components, (iii) enhance the interpretability of MCTS reasoning, (iv) and accelerate MCTS reasoning. Our contributions are summarized as follows:
1. We went beyond previous works who primarily treated MCTS as an tool rather than analyzing and improving its components. Specifically, we found the UCT strategy in most previous works may failed to function from our experiment. We also refined the backpropagation of MCTS to prefer more steadily improving paths, boosting performance.
1. To fully study the interpretability of MCTS multi-step reasoning, we conducted extensive quantitative analysis and ablation studies on every component. We carried out numerous experiments from both the numerical and distributional perspectives of the reward models, as well as its own interpretability, providing better interpretability for MCTS multi-step reasoning.
1. We designed a novel, general action-level reward model based on the principle of contrastive decoding, which requires no external tools, training, or datasets. Additionally, we found that previous works often failed to effectively harness multiple reward models, thus we proposed a statistical linear combination method. At the same time, we introduced speculative decoding to speed up MCTS reasoning by an average of 52% as a "free lunch."
We demonstrated the effectiveness of our approach by outperforming OpenAI’s flagship o1-mini model by an average of 17.4% using Llama-3.1-70B on the Blocksworld multi-step reasoning dataset.
## 2 Related Work
#### Large Language Models Multi-Step Reasoning
One of the key focus areas for LLMs is understanding and enhancing their reasoning capabilities. Recent advancements in this area focused on developing methods that improve LLMs’ ability to handle complex tasks in domains like code generation and mathematical problem-solving. Chain-of-Thought (CoT) (Wei et al., 2024) reasoning has been instrumental in helping LLMs break down intricate problems into a sequence of manageable steps, making them more adept at handling tasks that require logical reasoning. Building upon this, Tree-of-Thought (ToT) (Yao et al., 2024) reasoning extends CoT by allowing models to explore multiple reasoning paths concurrently, thereby enhancing their ability to evaluate different solutions simultaneously. Complementing these approaches, Monte Carlo Tree Search (MCTS) has emerged as a powerful reasoning method for decision-making in LLMs. Originally successful in AlphaGo’s victory (Silver et al., 2016), MCTS has been adapted to guide model-based planning by balancing exploration and exploitation through tree-based search and random sampling, and later to large language model reasoning (Hao et al., 2023), showing great results. This adaptation has proven particularly effective in areas requiring strategic planning. Notable implementations like ReST-MCTS ∗ (Zhang et al., 2024a), rStar (Qi et al., 2024), MCTSr (Zhang et al., 2024b) and Xie et al. (2024) have shown that integrating MCTS with reinforced self-training, self-play mutual reasoning or Direct Preference Optimization (Rafailov et al., 2023) can significantly improve reasoning capabilities in LLMs. Furthermore, recent advancements such as Deepseek Prover (Xin et al., 2024a; b) demonstrates the potential of these models to understand complex instructions such as formal mathematical proof.
#### Decoding Strategies
Contrastive decoding and speculative decoding both require Smaller Language Models (SLMs), yet few have realized that these two clever decoding methods can be seamlessly combined without any additional cost. The only work that noticed this was Yuan et al. (2024a), but their proposed speculative contrastive decoding focused on token-level decoding. In contrast, we designed a new action-level contrastive decoding to guide MCTS reasoning, the distinction will be discussed further in Section 4.1. For more detailed related work please refer to Appendix B.
## 3 Preliminaries
### 3.1 Multi-Step Reasoning
A multi-step reasoning problem can be modeled as a Markov Decision Process (Bellman, 1957) $\mathcal{M}=(S,A,P,r,\gamma)$ . $S$ is the state space containing all possible states, $A$ the action space, $P(s^{\prime}|s,a)$ the state transition function, $r(s,a)$ the reward function, and $\gamma$ the discount factor. The goal is to learn and to use a policy $\pi$ to maximize the discounted cumulative reward $\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{T}\gamma^{t}r_{t}\right]$ . For reasoning with LLMs, we are more focused on using an existing LLM to achieve the best reasoning.
### 3.2 Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) is a decision-making algorithm involving a search tree to simulate and evaluate actions. The algorithm operates in the following four phases:
Node Selection: The selection process begins at the root, selecting nodes hierarchically using strategies like UCT as the criterion to favor a child node based on its quality and novelty.
Expansion: New child nodes are added to the selected leaf node by sampling $d$ possible actions, predicting the next state. If the leaf node is fully explored or terminal, expansion is skipped.
Simulation: During simulation or “rollout”, the algorithm plays out the “game” randomly from that node to a terminal state using a default policy.
Backpropagation: Once a terminal state is reached, the reward is propagated up the tree, and each node visited during the selection phase updates its value based on the simulation result.
Through iterative application of its four phases, MCTS efficiently improves reasoning through trials and heuristics, converging on the optimal solution.
### 3.3 Contrastive Decoding
We discuss vanilla Contrastive Decoding (CD) from Li et al. (2023), which improves text generation in LLMs by reducing errors like repetition and self-contradiction. CD uses the differences between an expert model and an amateur model, enhancing the expert’s strengths and suppressing the amateur’s weaknesses. Consider a prompt of length $n$ , the CD objective is defined as:
$$
{\mathcal{L}}_{\text{CD}}(x_{\text{cont}},x_{\text{pre}})=\log p_{\text{EXP}}(
x_{\text{cont}}|x_{\text{pre}})-\log p_{\text{AMA}}(x_{\text{cont}}|x_{\text{
pre}})
$$
where $x_{\text{pre}}$ is the sequence of tokens $x_{1},\dots,x_{n}$ , the model generates continuations of length $m$ , $x_{\text{cont}}$ is the sequence of tokens $x_{n+1},\dots,x_{n+m}$ , and $p_{\text{EXP}}$ and $p_{\text{AMA}}$ are the expert and amateur probability distributions. To avoid penalizing correct behavior of the amateur or promoting implausible tokens, CD applies an adaptive plausibility constraint using an $\alpha$ -mask, which filters tokens by their logits against a threshold, the filtered vocabulary $V_{\text{valid}}$ is defined as:
$$
V_{\text{valid}}=\{i\mid s^{(i)}_{\text{EXP}}\geq\log\alpha+\max_{k}s^{(k)}_{
\text{EXP}}\}
$$
where $s^{(i)}_{\text{EXP}}$ and $s^{(i)}_{\text{AMA}}$ are unnormalized logits assigned to token i by the expert and amateur models. Final logits are adjusted with a coefficient $(1+\beta)$ , modifying the contrastive effect on output scores (Liu et al., 2021):
$$
s^{(i)}_{\text{CD}}=(1+\beta)s^{(i)}_{\text{EXP}}-s^{(i)}_{\text{AMA}}
$$
However, our proposed CD is at action level, averaging over the whole action, instead of token level in vanilla CD. Our novel action-level CD reward more robustly captures the differences in confidence between the expert and amateur models in the generated answers compared to vanilla CD. The distinction will be illustrated in Section 4.1 and explained further in Appendix A.
### 3.4 Speculative Decoding as "free lunch"
Based on Speculative Decoding (Leviathan et al., 2023), the process can be summarized as follows: Let $M_{p}$ be the target model with the conditional distribution $p(x_{t}|x_{<t})$ , and $M_{q}$ be a smaller approximation model with $q(x_{t}|x_{<t})$ . The key idea is to generate $\gamma$ tokens using $M_{q}$ and filter them against $M_{p}$ ’s distribution, accepting tokens consistent with $M_{p}$ . Speculative decoding samples $\gamma$ tokens autoregressively from $M_{q}$ , keeping those where $q(x)\leq p(x)$ . If $q(x)>p(x)$ , the sample is rejected with probability $1-\frac{p(x)}{q(x)}$ , and a new sample is drawn from the adjusted distribution:
$$
p^{\prime}(x)=\text{norm}(\max(0,p(x)-q(x))).
$$
Since both contrastive and speculative decoding rely on the same smaller models, we can achieve the acceleration effect of speculative decoding as a "free lunch" (Yuan et al., 2024a).
## 4 Method
### 4.1 Multi-Reward Design
Our primary goal is to design novel and and high-performance reward models for MCTS reasoning and to maximize the performance of reward model combinations, as our ablation experiments in Section 5.5 demonstrate that MCTS performance is almost entirely determined by the reward model.
SC-MCTS ∗ is guided by three highly interpretable reward models: contrastive JS divergence, loglikelihood and self evaluation. Previous work such as (Hao et al., 2023) often directly adds reward functions with mismatched numerical magnitudes without any prior statistical analysis or linear combination. As a result, their combined reward models may fail to demonstrate full performance. Moreover, combining multiple rewards online presents numerous challenges such as distributional shifts in the values. Thus, we propose a statistically-informed reward combination method: Multi-RM method. Each reward model is normalized contextually by the fine-grained prior statistics of its empirical distribution. The pseudocode for reward model construction is shown in Algorithm 1. Please refer to Appendix D for a complete version of SC-MCTS ∗ that includes other improvements such as dealing with distribution shift when combining reward functions online.
Algorithm 1 SC-MCTS ∗, reward model construction
1: Expert LLM $\pi_{e}$ , Amateur SLM $\pi_{a}$ , Problem set $D$ ; $M$ selected problems for prior statistics, $N$ pre-generated solutions per problem, $K$ clusters
2: $\tilde{A}\leftarrow\text{Sample-solutions}(\pi_{e},D,M,N)$ $\triangleright$ Pre-generate $M\times N$ solutions
3: $p_{e},p_{a}\leftarrow\text{Evaluate}(\pi_{e},\pi_{a},\tilde{A})$ $\triangleright$ Get policy distributions
4: for $r\in\{\text{JSD},\text{LL},\text{SE}\}$ do
5: $\bm{\mu}_{r},\bm{\sigma}_{r},\bm{b}_{r}\leftarrow\text{Cluster-stats}(r(\tilde {A}),K)$ $\triangleright$ Prior statistics (Equation 1)
6: $R_{r}\leftarrow x\mapsto(r(x)-\mu_{r}^{k^{*}})/\sigma_{r}^{k^{*}}$ $\triangleright$ Reward normalization (Equation 2)
7: end for
8: $R\leftarrow\sum_{r\in\{\text{JSD},\text{LL},\text{SE}\}}w_{r}R_{r}$ $\triangleright$ Composite reward
9: $A_{D}\leftarrow\text{MCTS-Reasoning}(\pi_{e},R,D,\pi_{a})$ $\triangleright$ Search solutions guided by $R$
10: $A_{D}$
#### Jensen-Shannon Divergence
The Jensen-Shannon divergence (JSD) is a symmetric and bounded measure of similarity between two probability distributions $P$ and $Q$ . It is defined as:
$$
\mathrm{JSD}(P\,\|\,Q)=\frac{1}{2}\mathrm{KL}(P\,\|\,M)+\frac{1}{2}\mathrm{KL}
(Q\,\|\,M),\quad M=\frac{1}{2}(P+Q),
$$
where $\mathrm{KL}(P\,\|\,Q)$ is the Kullback-Leibler Divergence (KLD), and $M$ represents the midpoint distribution. The JSD is bounded between 0 and 1 for discrete distributions, making it better than KLD for online normalization of reward modeling.
Inspired by contrastive decoding, we propose our novel reward model: JSD between the expert model’s logits and the amateur model’s logits. Unlike vanilla token-level contrastive decoding (Li et al., 2023), our reward is computed at action-level, treating a sequence of action tokens as a whole:
$$
R_{\text{JSD}}=\frac{1}{n}\sum_{i=T_{\text{prefix}}+1}^{n}\left[\mathrm{JSD}(p
_{\text{e}}(x_{i}|x_{<i})\,\|\,p_{\text{a}}(x_{i}|x_{<i})\right]
$$
where $n$ is the length of tokens, $T_{\text{prefix}}$ is the index of the last prefix token, $p_{\text{e}}$ and $p_{\text{a}}$ represent the softmax probabilities of the expert and amateur models, respectively. This approach ensures that the reward captures model behavior at the action level as the entire sequence of tokens is taken into account at once. This contrasts with vanilla token-level methods where each token is treated serially.
#### Loglikelihood
Inspired by Hao et al. (2023), we use a loglikelihood reward model to evaluate the quality of generated answers based on a given question prefix. The model computes logits for the full sequence (prefix + answer) and accumulates the log-probabilities over the answer part tokens.
Let the full sequence $x=(x_{1},x_{2},\dots,x_{T_{\text{total}}})$ consist of a prefix and a generated answer. The loglikelihood reward $R_{\text{LL}}$ is calculated over the answer portion:
$$
R_{\text{LL}}=\sum_{i=T_{\text{prefix}}+1}^{T_{\text{total}}}\log\left(\frac{
\exp(z_{\theta}(x_{i}))}{\sum_{x^{\prime}\in V}\exp(z_{\theta}(x^{\prime}))}\right)
$$
where $z_{\theta}(x_{i})$ represents the unnormalized logit for token $x_{i}$ . After calculating logits for the entire sequence, we discard the prefix and focus on the answer tokens to form the loglikelihood reward.
#### Self Evaluation
Large language models’ token-level self evaluation can effectively quantify the model’s uncertainty, thereby improving the quality of selective generation (Ren et al., 2023). We instruct the LLM to perform self evaluation on its answers, using a action level evaluation method, including a self evaluation prompt to explicitly indicate the model’s uncertainty.
After generating the answer, we prompt the model to self-evaluate its response by asking "Is this answer correct/good?" This serves to capture the model’s confidence in its own output leading to more informed decision-making. The self evaluation prompt’s logits are then used to calculate a reward function. Similar to the loglikelihood reward model, we calculate the self evaluation reward $R_{\text{SE}}$ by summing the log-probabilities over the self-evaluation tokens.
#### Harnessing Multiple Reward Models
We collected prior distributions for the reward models and found some of them span multiple regions. Therefore, we compute the fine-grained prior statistics as mean and standard deviation of modes of the prior distribution ${\mathcal{R}}\in\{{\mathcal{R}}_{\text{JSD}},{\mathcal{R}}_{\text{LL}},{ \mathcal{R}}_{\text{SE}}\}$ :
$$
\mu^{(k)}=\frac{1}{c_{k}}\sum_{R_{i}\in\rinterval{b_{1}}{b_{k+1}}}R_{i}\quad
\text{and}\quad\sigma^{(k)}=\sqrt{\frac{1}{c_{k}}\sum_{R_{i}\in\rinterval{b_{1
}}{b_{k+1}}}(R_{i}-\mu^{(k)})^{2}} \tag{1}
$$
where $b_{1}<b_{2}<\dots<b_{K+1}$ are the region boundaries in ${\mathcal{R}}$ , $R_{i}\in{\mathcal{R}}$ , and $c_{k}$ is the number of $R_{i}$ in $\rinterval{b_{1}}{b_{k+1}}$ . The region boundaries were defined during the prior statistical data collection phase 1.
After we computed the fine-grained prior statistics, the reward factors are normalized separately for each region (which degenerates to standard normalization if only a single region is found):
$$
R_{\text{norm}}(x)=(R(x)-\mu^{(k^{*})})/\sigma^{(k^{*})},~{}\text{where}~{}k^{
*}=\operatorname*{arg\,max}\{k:b_{k}\leq R(x)\} \tag{2}
$$
This reward design, which we call Multi-RM method, has some caveats: first, to prevent distribution shift during reasoning, we update the mean and standard deviation of the reward functions online for each mode (see Appendix D for pseudocode); second, we focus only on cases with clearly distinct reward modes, leaving general cases for future work. For the correlation heatmap, see Appendix C.
### 4.2 Node Selection Strategy
Upper Confidence Bound applied on Trees Algorithm (UCT) (Coquelin & Munos, 2007) is crucial for the selection phase, balancing exploration and exploitation by choosing actions that maximize:
$$
UCT_{j}=\bar{X}_{j}+C\sqrt{\frac{\ln N}{N_{j}}}
$$
where $\bar{X}_{j}$ is the average reward of taking action $j$ , $N$ is the number of times the parent has been visited, and $N_{j}$ is the number of times node $j$ has been visited for simulation, $C$ is a constant to balance exploitation and exploration.
However, $C$ is a crucial part of UCT. Previous work (Hao et al., 2023; Zhang et al., 2024b) had limited thoroughly investigating its components, leading to potential failures of the UCT strategy. This is because they often used the default value of 1 from the original proposed UCT (Coquelin & Munos, 2007) without conducting sufficient quantitative experiments to find the optimal $C$ . This will be discussed in detail in Section 5.4.
### 4.3 Backpropagation
After each MCTS iteration, multiple paths from the root to terminal nodes are generated. By backpropagating along these paths, we update the value of each state-action pair. Previous MCTS approaches often use simple averaging during backpropagation, but this can overlook paths where the goal achieved metric $G(p)$ progresses smoothly (e.g., $G(p_{1})=0\rightarrow 0.25\rightarrow 0.5\rightarrow 0.75$ ). These paths just few step away from the final goal $G(p)=1$ , are often more valuable than less stable ones.
To improve value propagation, we propose an algorithm that better captures value progression along a path. Given a path $\mathbf{P}=\{p_{1},p_{2},\dots,p_{n}\}$ with $n$ nodes, where each $p_{i}$ represents the value at node $i$ , the total value is calculated by summing the increments between consecutive nodes with a length penalty. The increment between nodes $p_{i}$ and $p_{i-1}$ is $\Delta_{i}=p_{i}-p_{i-1}$ . Negative increments are clipped at $-0.1$ and downweighted by 0.5. The final path value $V_{\text{final}}$ is:
$$
V_{\text{final}}=\sum_{i=2}^{n}\left\{\begin{array}[]{ll}\Delta_{i},&\text{if
}\Delta_{i}\geq 0\\
0.5\times\max(\Delta_{i},-0.1),&\text{if }\Delta_{i}<0\end{array}\right\}-
\lambda\times n \tag{3}
$$
where $n$ is the number of nodes in the path and $\lambda=0.1$ is the penalty factor to discourage long paths.
## 5 Experiments
### 5.1 Dataset
Blocksworld (Valmeekam et al., 2024; 2023) is a classic domain in AI research for reasoning and planning, where the goal is to rearrange blocks into a specified configuration using actions like ’pick-up,’ ’put-down,’ ’stack,’ and ’unstack. Blocks can be moved only if no block on top, and only one block at a time. The reasoning process in Blocksworld is a MDP. At time step $t$ , the LLM agent selects an action $a_{t}\sim p(a\mid s_{t},c)$ , where $s_{t}$ is the current block configuration, $c$ is the prompt template. The state transition $s_{t+1}=P(s_{t},a_{t})$ is deterministic and is computed by rules. This forms a trajectory of interleaved states and actions $(s_{0},a_{0},s_{1},a_{1},\dots,s_{T})$ towards the goal state.
One key feature of Blocksworld is its built-in verifier, which tracks progress toward the goal at each step. This makes Blocksworld ideal for studying heuristic LLM multi-step reasoning. However, we deliberately avoid using the verifier as part of the reward model as it is task-specific. More details of Blocksworld can be found in Appendix F.
### 5.2 Main Results
To evaluate the SC-MCTS ∗ algorithm in LLM multi-step reasoning, we implemented CoT, RAP-MCTS, and SC-MCTS ∗ using Llama-3-70B and Llama-3.1-70B. For comparison, we used Llama-3.1-405B and GPT-4o for CoT, and applied 0 and 4 shot single turn for o1-mini, as OpenAI (2024b) suggests avoiding CoT prompting. The experiment was conducted on Blocksworld dataset across all steps and difficulties. For LLM settings, GPU and OpenAI API usage data, see Appendix E and H.
| Mode | Models | Method | Steps | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Step 2 | Step 4 | Step 6 | Step 8 | Step 10 | Step 12 | Avg. | | | |
| Easy | Llama-3-70B ~Llama-3.2-1B | 4-shot CoT | 0.2973 | 0.4405 | 0.3882 | 0.2517 | 0.1696 | 0.1087 | 0.2929 |
| RAP-MCTS | 0.9459 | 0.9474 | 0.8138 | 0.4196 | 0.2136 | 0.1389 | 0.5778 | | |
| SC-MCTS* (Ours) | 0.9730 | 0.9737 | 0.8224 | 0.4336 | 0.2136 | 0.2222 | 0.5949 | | |
| Llama-3.1-70B ~Llama-3.2-1B | 4-shot CoT | 0.5405 | 0.4868 | 0.4069 | 0.2238 | 0.2913 | 0.2174 | 0.3441 | |
| RAP-MCTS | 1.0000 | 0.9605 | 0.8000 | 0.4336 | 0.2039 | 0.1111 | 0.5796 | | |
| SC-MCTS* (Ours) | 1.0000 | 0.9737 | 0.7724 | 0.4503 | 0.3010 | 0.1944 | 0.6026 | | |
| Llama-3.1-405B | 0-shot CoT | 0.8108 | 0.6579 | 0.5931 | 0.5105 | 0.4272 | 0.3611 | 0.5482 | |
| 4-shot CoT | 0.7838 | 0.8553 | 0.6483 | 0.4266 | 0.5049 | 0.4167 | 0.5852 | | |
| o1-mini | 0-shot | 0.9730 | 0.7368 | 0.5103 | 0.3846 | 0.3883 | 0.1944 | 0.4463 | |
| 4-shot | 0.9459 | 0.8026 | 0.6276 | 0.3497 | 0.3301 | 0.2222 | 0.5167 | | |
| GPT-4o | 0-shot CoT | 0.5405 | 0.4868 | 0.3241 | 0.1818 | 0.1165 | 0.0556 | 0.2666 | |
| 4-shot CoT | 0.5135 | 0.6579 | 0.6000 | 0.2797 | 0.3010 | 0.3611 | 0.4444 | | |
| Hard | Llama-3-70B ~Llama-3.2-1B | 4-shot CoT | 0.5556 | 0.4405 | 0.3882 | 0.2517 | 0.1696 | 0.1087 | 0.3102 |
| RAP-MCTS | 1.0000 | 0.8929 | 0.7368 | 0.4503 | 0.1696 | 0.1087 | 0.5491 | | |
| SC-MCTS* (Ours) | 0.9778 | 0.8929 | 0.7566 | 0.5298 | 0.2232 | 0.1304 | 0.5848 | | |
| Llama-3.1-70B ~Llama-3.2-1B | 4-shot CoT | 0.6222 | 0.2857 | 0.3421 | 0.1722 | 0.1875 | 0.2174 | 0.2729 | |
| RAP-MCTS | 0.9778 | 0.9048 | 0.7829 | 0.4702 | 0.1875 | 0.1087 | 0.5695 | | |
| SC-MCTS* (Ours) | 0.9778 | 0.9405 | 0.8092 | 0.4702 | 0.1696 | 0.2174 | 0.5864 | | |
| Llama-3.1-405B | 0-shot CoT | 0.7838 | 0.6667 | 0.6053 | 0.3684 | 0.2679 | 0.2609 | 0.4761 | |
| 4-shot CoT | 0.8889 | 0.6667 | 0.6579 | 0.4238 | 0.5804 | 0.5217 | 0.5915 | | |
| o1-mini | 0-shot | 0.6889 | 0.4286 | 0.1776 | 0.0993 | 0.0982 | 0.0000 | 0.2034 | |
| 4-shot | 0.9556 | 0.8452 | 0.5263 | 0.3907 | 0.2857 | 0.1739 | 0.4966 | | |
| GPT-4o | 0-shot CoT | 0.6222 | 0.3929 | 0.3026 | 0.1523 | 0.0714 | 0.0000 | 0.2339 | |
| 4-shot CoT | 0.6222 | 0.4167 | 0.5197 | 0.3642 | 0.3304 | 0.1739 | 0.4102 | | |
Table 1: Accuracy of various reasoning methods and models across steps and difficulty modes on the Blocksworld multi-step reasoning dataset.
From Table 1, it can be observed that SC-MCTS ∗ significantly outperforms RAP-MCTS and 4-shot CoT across both easy and hard modes, and in easy mode, Llama-3.1-70B model using SC-MCTS ∗ outperforms the 4-shot CoT Llama-3.1-405B model.
<details>
<summary>extracted/6087579/fig/acc.png Details</summary>

### Visual Description
## Line Charts: Model Accuracy Across Reasoning Steps
### Overview
The image displays two side-by-side line charts comparing the accuracy of different large language models and reasoning methods across an increasing number of steps (from Step 2 to Step 12). The charts appear to be from a technical paper or report evaluating model performance on a multi-step reasoning task. The left chart features "Llama-3.1" models, while the right chart features "Llama-3" models, suggesting a comparison between model generations or versions.
### Components/Axes
**Common Elements (Both Charts):**
* **X-Axis:** Labeled "Step". Major tick marks are at Step 2, Step 4, Step 6, Step 8, Step 10, and Step 12.
* **Y-Axis:** Labeled "Accuracy". Scale ranges from 0.0 to 1.0, with major gridlines at 0.2 intervals (0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
* **Legend:** Positioned in the top-right corner of each chart. Contains five entries with distinct line styles and colors.
* **Grid:** Light gray horizontal and vertical gridlines are present.
**Left Chart Specifics:**
* **Title/Context:** Implied to be evaluating "Llama-3.1" series models.
* **Legend Entries:**
1. `Llama-3.1-70B: 4-shot CoT` (Yellow, dashed line with circle markers)
2. `Llama-3.1-70B: RAP-MCTS` (Orange, dashed line with circle markers)
3. `Llama-3.1-70B: SC-MCTS* (Ours)` (Red, solid line with circle markers)
4. `o1-mini: 4-shot` (Magenta, dashed line with circle markers)
5. `Llama-3.1-405B: 4-shot CoT` (Cyan, dashed line with circle markers)
**Right Chart Specifics:**
* **Title/Context:** Implied to be evaluating "Llama-3" series models.
* **Legend Entries:**
1. `Llama-3-70B: 4-shot CoT` (Yellow, dashed line with circle markers)
2. `Llama-3-70B: RAP-MCTS` (Orange, dashed line with circle markers)
3. `Llama-3-70B: SC-MCTS* (Ours)` (Red, solid line with circle markers)
4. `o1-mini: 4-shot` (Magenta, dashed line with circle markers)
5. `Llama-3.1-405B: 4-shot CoT` (Cyan, dashed line with circle markers) *[Note: This is the same model as in the left chart, used as a common baseline.]*
### Detailed Analysis
**Left Chart (Llama-3.1 Models):**
* **Trend Verification & Data Points (Approximate):**
* **Llama-3.1-70B: 4-shot CoT (Yellow):** Starts low (~0.62 at Step 2), drops sharply to ~0.29 at Step 4, recovers slightly to ~0.34 at Step 6, then declines steadily to ~0.17 at Step 8, ~0.19 at Step 10, and ~0.21 at Step 12. Overall trend is a steep initial drop followed by a low, fluctuating plateau.
* **Llama-3.1-70B: RAP-MCTS (Orange):** Starts very high (~0.98 at Step 2), declines gradually to ~0.91 at Step 4 and ~0.79 at Step 6, then drops more sharply to ~0.47 at Step 8, ~0.17 at Step 10, and ~0.11 at Step 12. Shows a consistent downward trend.
* **Llama-3.1-70B: SC-MCTS* (Ours) (Red):** Starts highest (~0.99 at Step 2), remains high at ~0.95 at Step 4 and ~0.81 at Step 6, then plummets to ~0.47 at Step 8, ~0.17 at Step 10, and recovers slightly to ~0.22 at Step 12. Follows a similar but slightly superior trajectory to RAP-MCTS until Step 6, after which it drops sharply.
* **o1-mini: 4-shot (Magenta):** Starts high (~0.96 at Step 2), drops to ~0.84 at Step 4, then falls sharply to ~0.53 at Step 6, ~0.39 at Step 8, ~0.29 at Step 10, and ~0.17 at Step 12. Shows a steady, significant decline.
* **Llama-3.1-405B: 4-shot CoT (Cyan):** Starts high (~0.89 at Step 2), drops to ~0.67 at Step 4, remains stable at ~0.66 at Step 6, dips to ~0.42 at Step 8, then recovers to ~0.59 at Step 10 and ~0.52 at Step 12. Exhibits a unique "dip and recover" pattern, ending as the highest-performing model at Step 12.
**Right Chart (Llama-3 Models):**
* **Trend Verification & Data Points (Approximate):**
* **Llama-3-70B: 4-shot CoT (Yellow):** Starts at ~0.55 at Step 2, declines to ~0.44 at Step 4, ~0.39 at Step 6, ~0.25 at Step 8, ~0.17 at Step 10, and ~0.12 at Step 12. Shows a steady, monotonic decline.
* **Llama-3-70B: RAP-MCTS (Orange):** Starts highest (~1.0 at Step 2), declines to ~0.89 at Step 4, ~0.75 at Step 6, ~0.45 at Step 8, ~0.22 at Step 10, and ~0.12 at Step 12. A strong, steady downward trend.
* **Llama-3-70B: SC-MCTS* (Ours) (Red):** Starts very high (~0.98 at Step 2), follows closely with RAP-MCTS to ~0.89 at Step 4 and ~0.76 at Step 6, then declines to ~0.53 at Step 8, ~0.22 at Step 10, and ~0.14 at Step 12. Performs slightly better than RAP-MCTS from Step 8 onward.
* **o1-mini: 4-shot (Magenta):** Starts high (~0.96 at Step 2), drops to ~0.84 at Step 4, then falls to ~0.52 at Step 6, ~0.39 at Step 8, ~0.29 at Step 10, and ~0.17 at Step 12. Trajectory is very similar to its performance in the left chart.
* **Llama-3.1-405B: 4-shot CoT (Cyan):** (Common baseline) Starts at ~0.88 at Step 2, drops to ~0.67 at Step 4, ~0.65 at Step 6, ~0.42 at Step 8, recovers to ~0.58 at Step 10, and ~0.52 at Step 12. Mirrors its performance in the left chart exactly.
### Key Observations
1. **Universal Performance Degradation:** All models and methods show a clear trend of decreasing accuracy as the number of reasoning steps increases. No model maintains high accuracy beyond 6-8 steps.
2. **Method Superiority at Low Steps:** The "SC-MCTS* (Ours)" and "RAP-MCTS" methods consistently achieve the highest accuracy (near 1.0) for the first 4-6 steps in both charts, significantly outperforming the standard 4-shot Chain-of-Thought (CoT) prompting.
3. **Catastrophic Drop-off:** The advanced MCTS-based methods (SC-MCTS*, RAP-MCTS) experience a particularly steep performance collapse between Step 6 and Step 8, falling from ~0.8 to below 0.5.
4. **Model Size vs. Method:** The largest model, `Llama-3.1-405B: 4-shot CoT` (cyan), demonstrates more resilience at higher step counts (Steps 10-12) compared to the 70B models using advanced methods, despite starting with lower initial accuracy. It is the only series that shows a recovery trend after Step 8.
5. **Consistency of o1-mini:** The `o1-mini: 4-shot` model (magenta) shows nearly identical performance profiles across both charts, suggesting its behavior is stable relative to the Llama model generations being tested.
6. **Llama-3.1 vs. Llama-3:** The `Llama-3.1-70B: 4-shot CoT` (yellow, left chart) starts with higher accuracy (~0.62) than its `Llama-3-70B` counterpart (~0.55, right chart) but both converge to similarly low performance by Step 12.
### Interpretation
The data suggests a fundamental challenge in multi-step reasoning for current LLMs: **accuracy decays rapidly with problem complexity (step count)**. While advanced search-guided methods like SC-MCTS and RAP-MCTS provide a massive boost for shorter reasoning chains (4-6 steps), they do not solve the scalability issue and may even become brittle, leading to a sharp performance cliff.
The standout performance of the much larger `Llama-3.1-405B` model using simple 4-shot CoT at higher step counts implies that **raw model scale may provide more robustness for long-horizon tasks than sophisticated but potentially fragile search algorithms applied to smaller models**. The "dip and recover" pattern for this model is anomalous and could indicate a specific difficulty at Step 8 for the test dataset, or a characteristic of the model's reasoning process.
From a research perspective, the charts argue that the frontier for complex reasoning lies not just in improving search efficiency (the MCTS methods), but in developing architectures or training paradigms that prevent the exponential error accumulation demonstrated here. The authors' method ("Ours") is effective within a limited range but does not overcome the core scalability barrier.
</details>
Figure 2: Accuracy comparison of various models and reasoning methods on the Blocksworld multi-step reasoning dataset across increasing reasoning steps.
From Figure 2, we observe that as the reasoning path lengthens, the performance advantage of two MCTS reasoning algorithms over themselves, GPT-4o, and Llama-3.1-405B’s CoT explicit multi-turn chats and o1-mini implicit multi-turn chats (OpenAI, 2024b) in terms of accuracy diminishes, becoming particularly evident after Step 6. The accuracy decline for CoT is more gradual as the reasoning path extends, whereas models employing MCTS reasoning exhibits a steeper decline. This trend could be due to the fixed iteration limit of 10 across different reasoning path lengths, which might be unfair to longer paths. Future work could explore dynamically adjusting the iteration limit based on reasoning path length. It may also be attributed to our use of a custom EOS token to ensure output format stability in the MCTS reasoning process, which operates in completion mode. As the number of steps and prompt prefix lengths increases, the limitations of completion mode may become more pronounced compared to the chat mode used in multi-turn chats. Additionally, we observe that Llama-3.1-405B benefits significantly from its huge parameter size, although underperforming at fewer steps, experiences the slowest accuracy decline as the reasoning path grows longer.
### 5.3 Reasoning Speed
<details>
<summary>extracted/6087579/fig/speed.png Details</summary>

### Visual Description
## Token Generation Speed Comparison Charts
### Overview
The image displays two side-by-side bar charts comparing the token generation speed (in tokens per second) of different configurations of the Llama language model. The left chart focuses on the "Llama3.1-70B" model, while the right chart focuses on the "Llama-3.1-405B" model. Each chart compares a "Vanilla" baseline against two "SD" (likely Speculative Decoding) variants.
### Components/Axes
* **Chart Type:** Vertical Bar Charts (2 instances).
* **Y-Axis (Both Charts):** Labeled "Token/s". The scale is linear.
* Left Chart (Llama3.1-70B): Range from 0 to 100, with major ticks at 0, 20, 40, 60, 80, 100.
* Right Chart (Llama-3.1-405B): Range from 0 to 14, with major ticks at 0, 2, 4, 6, 8, 10, 12, 14.
* **X-Axis (Both Charts):** Each chart has a single categorical label centered below its bars.
* Left Chart: "Llama3.1-70B"
* Right Chart: "Llama-3.1-405B"
* **Legend (Top-Right of each chart):** A shared legend defines three data series:
1. **Vanilla:** Represented by a light salmon/pink bar.
2. **SD-Llama-3.1-8B:** Represented by a pale yellow/cream bar.
3. **SD-Llama-3.2-1B:** Represented by a light cyan/aqua bar.
* **Baseline Reference:** A dashed horizontal grey line extends from the top of the "Vanilla" bar across each chart, serving as a visual baseline for the 1.00x multiplier.
### Detailed Analysis
**Left Chart: Llama3.1-70B**
* **Vanilla (Pink Bar):** Height corresponds to approximately 60 Token/s. A label above it reads "1.00x".
* **SD-Llama-3.1-8B (Yellow Bar):** Height is greater than Vanilla, approximately 69 Token/s. A label above it reads "1.15x".
* **SD-Llama-3.2-1B (Cyan Bar):** The tallest bar, height approximately 91 Token/s. A label above it reads "1.52x".
* **Trend:** Both SD variants show increased token generation speed compared to the Vanilla baseline for the 70B model. The speedup increases from the 8B to the 1B SD configuration.
**Right Chart: Llama-3.1-405B**
* **Vanilla (Pink Bar):** Height corresponds to approximately 6 Token/s. A label above it reads "1.00x".
* **SD-Llama-3.1-8B (Yellow Bar):** The tallest bar, height approximately 12 Token/s. A label above it reads "2.00x".
* **SD-Llama-3.2-1B (Cyan Bar):** Height is less than Vanilla, approximately 3.3 Token/s. A label above it reads "0.55x".
* **Trend:** The results are mixed for the 405B model. The SD-Llama-3.1-8B configuration provides a significant 2x speedup, while the SD-Llama-3.2-1B configuration results in a performance degradation (0.55x the speed of Vanilla).
### Key Observations
1. **Model Size Impact:** The absolute token/s values are an order of magnitude higher for the 70B model (tens of tokens/s) compared to the 405B model (single digits of tokens/s), reflecting the increased computational cost of the larger model.
2. **Inconsistent SD Performance:** The effectiveness of Speculative Decoding (SD) is highly dependent on both the target model size (70B vs. 405B) and the specific SD model used (3.1-8B vs. 3.2-1B).
3. **Dramatic Slowdown:** The most notable outlier is the SD-Llama-3.2-1B configuration on the 405B model, which cuts performance nearly in half compared to the baseline, suggesting a poor match or overhead that negates any speculative benefit.
4. **Optimal Pairing:** For the 70B model, the smaller SD model (3.2-1B) yields the best speedup. For the 405B model, the larger SD model (3.1-8B) is optimal.
### Interpretation
These charts demonstrate the practical performance trade-offs when applying Speculative Decoding to accelerate large language model inference. Speculative Decoding uses a smaller, faster "draft" model to propose tokens, which are then verified by the larger "target" model, aiming to increase overall throughput.
The data suggests that the choice of draft model is critical and not universally beneficial. A well-matched draft model (like SD-Llama-3.1-8B for the 405B target) can double inference speed. However, a poorly matched draft model (like SD-Llama-3.2-1B for the 405B target) can introduce significant overhead, likely due to a high rejection rate of proposed tokens, forcing the large model to do more work and slowing down the process. The inverse relationship seen here—where the smaller draft model works better for the smaller target and the larger draft model works better for the larger target—implies that alignment in model architecture, training data, or capability between the draft and target models is a key factor for successful acceleration. The charts provide empirical evidence that speculative decoding is a powerful but nuanced optimization technique requiring careful configuration.
</details>
Figure 3: Speedup comparison of different model combinations. For speculative decoding, we use Llama-3.2-1B and Llama-3.1.8B as amateur models with Llama-3.1-70B and Llama-3.1-405B as expert models, based on average node-level reasoning speed in MCTS for Blocksworld multi-step reasoning dataset.
As shown in Figure 3, we can observe that the combination of Llama-3.1-405B with Llama-3.1-8B achieves the highest speedup, improving inference speed by approximately 100% compared to vanilla decoding. Similarly, pairing Llama-3.1-70B with Llama-3.2-1B results in a 51.9% increase in reasoning speed. These two combinations provide the most significant gains, demonstrating that speculative decoding with SLMs can substantially enhance node level reasoning speed. However, we can also observe from the combination of Llama-3.1-405B with Llama-3.2-1B that the parameters of SLMs in speculative decoding should not be too small, since the threshold for accepting draft tokens during the decoding process remains fixed to prevent speculative decoding from affecting performance (Leviathan et al., 2023), as overly small parameters may have a negative impact on decoding speed, which is consistent with the findings in Zhao et al. (2024); Chen et al. (2023).
### 5.4 Parameters
<details>
<summary>extracted/6087579/fig/uct.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Parameter C for Different MCTS Methods
### Overview
The image is a line chart plotting "Accuracy" on the y-axis against a parameter "C" on the x-axis. It compares the performance of three different methods or conditions: RAP-MCTS, SC-MCTS* (Ours), and a Negative Control (c=0). The chart demonstrates how accuracy changes as the value of C increases from 0 to 400.
### Components/Axes
* **X-Axis:** Labeled "C". The scale is linear, with major tick marks at 0, 50, 100, 150, 200, 250, 300, 350, and 400.
* **Y-Axis:** Labeled "Accuracy". The scale is linear, with major tick marks at 0.54, 0.56, 0.58, 0.60, and 0.62.
* **Legend:** Located in the top-right corner of the plot area. It defines three data series:
1. **RAP-MCTS:** Represented by a black upward-pointing triangle (▲).
2. **SC-MCTS* (Ours):** Represented by a black five-pointed star (★).
3. **Negative Control (c=0):** Represented by a black circle (●).
* **Data Series:** The primary data is plotted as a single, connected green line with circular markers at each data point. The legend symbols (triangle, star, circle) are overlaid on this green line at specific points corresponding to their respective methods.
### Detailed Analysis
**1. Data Series Mapping & Spatial Grounding:**
The green line with circular markers represents the performance of the **SC-MCTS* (Ours)** method across varying C values. The other two legend entries (RAP-MCTS and Negative Control) are not separate lines but specific, single-point annotations on this main green line.
**2. SC-MCTS* (Ours) - Green Line with Circular Markers:**
* **Trend Verification:** The line shows a sharp, non-linear increase in accuracy from C=0 to a peak, followed by a plateau and a slight decline to a stable level.
* **Data Points (Approximate):**
* C=0: Accuracy ≈ 0.538 (lowest point).
* C ≈ 5: Accuracy ≈ 0.545.
* C ≈ 10: Accuracy ≈ 0.555.
* C ≈ 15: Accuracy ≈ 0.565.
* C ≈ 20: Accuracy ≈ 0.572.
* C ≈ 25: Accuracy ≈ 0.579.
* C ≈ 30: Accuracy ≈ 0.593.
* C ≈ 35: Accuracy ≈ 0.607.
* C ≈ 40: Accuracy ≈ 0.600.
* C ≈ 45: Accuracy ≈ 0.614.
* C ≈ 50: Accuracy ≈ 0.620.
* C ≈ 55: Accuracy ≈ 0.628.
* C ≈ 60: Accuracy ≈ 0.621.
* C ≈ 70: Accuracy ≈ 0.635.
* **C = 100: Accuracy ≈ 0.635 (Peak, marked with ★).**
* C ≈ 120: Accuracy ≈ 0.628.
* C ≈ 150: Accuracy ≈ 0.628.
* C ≈ 200: Accuracy ≈ 0.628.
* C ≈ 250: Accuracy ≈ 0.607.
* C ≈ 300: Accuracy ≈ 0.607.
* C ≈ 350: Accuracy ≈ 0.607.
* C ≈ 400: Accuracy ≈ 0.607.
**3. RAP-MCTS (▲):**
* **Placement:** A single black triangle is overlaid on the green line at **C=0**.
* **Value:** Accuracy ≈ 0.551.
**4. Negative Control (c=0) (●):**
* **Placement:** A single black circle is overlaid on the green line at **C=0**, positioned slightly below the RAP-MCTS triangle.
* **Value:** Accuracy ≈ 0.548.
### Key Observations
1. **Dominant Trend:** The SC-MCTS* method shows a strong positive correlation between C and accuracy for low C values (0-100), achieving its peak performance at C=100.
2. **Performance Plateau & Decline:** After the peak at C=100, accuracy slightly decreases and then stabilizes at a lower plateau (≈0.607) from C=250 onwards.
3. **Baseline Comparison at C=0:** At the starting point (C=0), all methods have low accuracy. The Negative Control (c=0) performs worst (≈0.548), RAP-MCTS is slightly better (≈0.551), and the initial point for SC-MCTS* is the lowest (≈0.538).
4. **Significant Improvement:** The SC-MCTS* method demonstrates a substantial improvement over its own starting point and the other baselines, with its peak accuracy (≈0.635) being notably higher than the values at C=0.
### Interpretation
The data suggests that the parameter **C is a critical hyperparameter for the SC-MCTS* method**, with an optimal value around **C=100** for maximizing accuracy on this task. The method's performance is highly sensitive to C in the lower range, showing rapid gains. The plateau after C=200 indicates diminishing returns or a performance ceiling for higher C values.
The inclusion of the **Negative Control (c=0)** serves as a baseline, confirming that the observed improvements are not due to random chance but are linked to the method and the tuning of C. The **RAP-MCTS** point at C=0 provides a comparison to another method at the default parameter setting, which SC-MCTS* surpasses significantly at its optimal point.
The chart effectively argues for the efficacy of the SC-MCTS* method and identifies its optimal operational range. The slight decline after the peak could indicate potential overfitting or a change in the search dynamics at very high C values, warranting further investigation.
</details>
Figure 4: Accuracy comparison of different constant $C$ of UCT on Blocksworld multi-step reasoning dataset.
<details>
<summary>extracted/6087579/fig/iter.png Details</summary>

### Visual Description
## Line Chart: Accuracy over Iterations for Easy and Hard Modes
### Overview
The image displays a line chart comparing the accuracy performance of two distinct modes, labeled "Easy Mode" and "Hard Mode," across a series of 10 iterations. The chart illustrates how accuracy improves with each iteration for both modes, with Easy Mode consistently performing at a higher accuracy level than Hard Mode throughout the observed period.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, from 1 to 10.
* **Markers:** Integers 1 through 10.
* **Y-Axis:**
* **Label:** "Accuracy"
* **Scale:** Linear, from approximately 0.35 to 0.60.
* **Markers:** 0.35, 0.40, 0.45, 0.50, 0.55, 0.60.
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Items:**
1. **Blue line with circular markers:** "Easy Mode"
2. **Red line with circular markers:** "Hard Mode"
* **Grid:** A light gray grid is present, aligned with the major tick marks on both axes.
### Detailed Analysis
**Data Series: Easy Mode (Blue Line)**
* **Trend:** The line shows a consistent, positive slope, indicating a steady increase in accuracy with each iteration. The rate of increase is steepest between iterations 2 and 6, after which it begins to plateau.
* **Approximate Data Points:**
* Iteration 1: ~0.415
* Iteration 2: ~0.420
* Iteration 3: ~0.465
* Iteration 4: ~0.500
* Iteration 5: ~0.565
* Iteration 6: ~0.600
* Iteration 7: ~0.610
* Iteration 8: ~0.615
* Iteration 9: ~0.625
* Iteration 10: ~0.625
**Data Series: Hard Mode (Red Line)**
* **Trend:** The line starts flat, then exhibits a strong positive slope from iteration 2 to 7, after which the rate of increase slows significantly, approaching a plateau.
* **Approximate Data Points:**
* Iteration 1: ~0.345
* Iteration 2: ~0.345
* Iteration 3: ~0.435
* Iteration 4: ~0.480
* Iteration 5: ~0.530
* Iteration 6: ~0.565
* Iteration 7: ~0.585
* Iteration 8: ~0.600
* Iteration 9: ~0.605
* Iteration 10: ~0.605
### Key Observations
1. **Performance Gap:** Easy Mode maintains a higher accuracy than Hard Mode at every single iteration point.
2. **Convergence:** The performance gap between the two modes is widest at the start (Iteration 1: ~0.07 difference) and narrows considerably by the end (Iteration 10: ~0.02 difference).
3. **Plateau Behavior:** Both modes show signs of reaching a performance ceiling. Easy Mode's accuracy stabilizes around 0.625 from iteration 9 onward. Hard Mode's accuracy stabilizes around 0.605 from iteration 9 onward.
4. **Initial Conditions:** Hard Mode begins with a distinct performance lag, showing no improvement between iterations 1 and 2, while Easy Mode shows immediate, albeit small, gains.
### Interpretation
The chart demonstrates a classic learning or optimization curve for two tasks of differing difficulty. The "Easy Mode" task allows for faster initial learning and a higher ultimate performance ceiling. The "Hard Mode" task presents a greater initial challenge (evidenced by the flat start), but the system is able to learn and improve rapidly once it begins to make progress.
The narrowing gap suggests that with sufficient iterations (training, practice, or optimization cycles), the performance disparity between easy and hard tasks can be significantly reduced, though not entirely eliminated within the scope of this data. The plateauing of both curves indicates that further iterations beyond 10 are unlikely to yield substantial accuracy gains under the current conditions, implying the models or systems have reached their capacity for improvement on these specific tasks. This data could be crucial for resource allocation, suggesting that investing iterations into the Hard Mode yields high returns initially, but both modes eventually require new strategies or increased complexity to break through their respective performance ceilings.
</details>
Figure 5: Accuracy comparison of different numbers of iteration on Blocksworld multi-step reasoning dataset.
As discussed in Section 4.2, the constant $C$ is a crucial part of UCT strategy, which completely determines whether the exploration term takes effect. Therefore, we conducted quantitative experiments on the constant $C$ , to eliminate interference from other factors, we only use MCTS base with the common reward model $R_{\text{LL}}$ for both RAP-MCTS and SC-MCTS ∗. From Figure 5 we can observe that the constant $C$ of RAP-MCTS is too small to function effectively, while the constant $C$ of SC-MCTS ∗ is the value most suited to the values of reward model derived from extensive experimental data. After introducing new datasets, this hyperparameter may need to be re-tuned.
From Figure 5, it can be observed that the accuracy of SC-MCTS ∗ on multi-step reasoning increases steadily with the number of iterations. During the first 1-7 iterations, the accuracy rises consistently. After the 7th iteration, the improvement in accuracy becomes relatively smaller, indicating that under the experimental setting with depth limitations, the exponentially growing exploration nodes in later iterations bring diminishing returns in accuracy.
### 5.5 Ablation Study
| Parts of SC-MCTS ∗ | Accuracy (%) | Improvement (%) |
| --- | --- | --- |
| MCTS base | 55.92 | — |
| + $R_{\text{JSD}}$ | 62.50 | +6.58 |
| + $R_{\text{LL}}$ | 67.76 | +5.26 |
| + $R_{\text{SE}}$ | 70.39 | +2.63 |
| + Multi-RM Method | 73.68 | +3.29 |
| + Improved $C$ of UCT | 78.95 | +5.27 |
| + BP Refinement | 80.92 | +1.97 |
| SC-MCTS ∗ | 80.92 | Overall +25.00 |
Table 2: Ablation Study on the Blocksworld dataset at Step 6 under difficult mode. For a more thorough ablation study, the reward model for the MCTS base was set to pseudo-random numbers.
As shown in Table 2, the results of the ablation study demonstrate that each component of SC-MCTS ∗ contributes significantly to performance improvements. Starting from a base MCTS accuracy of 55.92%, adding $R_{\text{JSD}}$ , $R_{\text{LL}}$ , and $R_{\text{SE}}$ yields a combined improvement of 14.47%. Multi-RM method further boosts performance by 3.29%, while optimizing the $C$ parameter in UCT adds 5.27%, and the backpropagation refinement increases accuracy by 1.97%. Overall, SC-MCTS ∗ achieves an accuracy of 80.92%, a 25% improvement over the base, demonstrating the effectiveness of these enhancements for complex reasoning tasks.
### 5.6 Interpretability Study
In the Blocksworld multi-step reasoning dataset, we utilize a built-in ground truth verifier to measure the percentage of progress toward achieving the goal at a given step, denoted as $P$ . The value of $P$ ranges between $[0,1]$ . For any arbitrary non-root node $N_{i}$ , the progress is defined as:
$$
P(N_{i})=\text{Verifier}(N_{i}).
$$
For instance, in a 10-step Blocksworld reasoning task, the initial node $A$ has $P(A)=0$ . After executing one correct action and transitioning to the next node $B$ , the progress becomes $P(B)=0.1$ .
Given a non-root node $N_{i}$ , transitioning to its parent node $\text{Parent}(N_{i})$ through a specific action $a$ , the contribution of $a$ toward the final goal state is defined as:
$$
\Delta_{a}=P(\text{Parent}(N_{i}))-P(N_{i}).
$$
Next, by analyzing the relationship between $\Delta_{a}$ and the reward value $R_{a}$ assigned by the reward model for action $a$ , we aim to reveal how our designed reward model provides highly interpretable reward signals for the selection of each node in MCTS. We also compare the performance of our reward model against a baseline reward model. Specifically, the alignment between $\Delta_{a}$ and $R_{a}$ demonstrates the interpretability of the reward model in guiding the reasoning process toward the goal state. Since Section 5.5 has already demonstrated that the reasoning performance of MCTS reasoning is almost entirely determined by the reward model, using interpretable reward models greatly enhances the interpretability of our algorithm SC-MCTS ∗.
<details>
<summary>extracted/6087579/fig/reward.png Details</summary>

### Visual Description
## Histograms: Reward Distributions of Two Algorithms
### Overview
The image displays two side-by-side histograms comparing the reward distributions of two different algorithms or methods. The left histogram represents a "Baseline" method labeled "RAP-MCTS," and the right histogram represents a method labeled "SC-MCTS*." A shared color bar on the far right provides a third dimension of data, indicating the "Proportion of Positive Δa" for the histogram bars.
### Components/Axes
**Left Histogram: "Reward Distribution of Baseline (RAP-MCTS)"**
* **X-axis:** Label is not explicitly written, but the tick marks and context indicate it represents "Reward" values. The scale ranges from approximately -650 to -550, with major tick marks at -640, -620, -600, -580, and -560.
* **Y-axis:** Label is "Frequency." The scale ranges from 0 to 2000, with major tick marks at 0, 250, 500, 750, 1000, 1250, 1500, 1750, and 2000.
* **Text Box (Top-Right Corner):** Contains statistical correlation data.
* `Spearman: 0.01`
* `Pearson: 0.01`
* `P-value: 0.2624`
**Right Histogram: "Reward Distribution of SC-MCTS*"**
* **X-axis:** Label is not explicitly written, but represents "Reward" values. The scale ranges from approximately -5 to 5, with major tick marks at -4, -2, 0, 2, and 4.
* **Y-axis:** Label is "Frequency." The scale ranges from 0 to 2500, with major tick marks at 0, 500, 1000, 1500, 2000, and 2500.
* **Text Box (Top-Right Corner):** Contains statistical correlation data.
* `Spearman: 0.32`
* `Pearson: 0.32`
* `P-value: <0.0001`
**Shared Color Bar (Far Right)**
* **Label:** "Proportion of Positive Δa"
* **Scale:** Ranges from 0.0 to 0.6, with major tick marks at 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, and 0.6.
* **Color Gradient:** A vertical gradient from dark purple (0.0) through blue and green to bright yellow (0.6). This color is applied to the bars in both histograms.
### Detailed Analysis
**Left Histogram (RAP-MCTS):**
* **Trend:** The distribution is roughly unimodal and approximately normal (bell-shaped), centered around a reward value of -600.
* **Data Points:** The highest frequency bars (peak) are located between -605 and -595, reaching a frequency of approximately 2000. The distribution has a visible spread, with significant frequencies from about -630 to -570. There are smaller, secondary clusters of bars around -630 and -560.
* **Color Coding:** The bars are predominantly dark blue/purple, indicating a low "Proportion of Positive Δa" (close to 0.0 to 0.1) across most of the reward range. A few bars on the far right tail (around -560) show a slightly lighter blue/green hue, suggesting a marginally higher proportion.
**Right Histogram (SC-MCTS*):**
* **Trend:** The distribution is unimodal and approximately normal, centered very close to a reward value of 0. It appears slightly right-skewed.
* **Data Points:** The peak frequency is located between -0.5 and 0.5, reaching a frequency of approximately 2600. The bulk of the data lies between -2 and 2. The distribution has a longer, more populated right tail extending to +4 compared to its left tail.
* **Color Coding:** This histogram shows a clear color gradient. Bars on the left side (negative rewards, e.g., -2 to 0) are dark purple/blue (low proportion). As rewards increase towards 0 and into positive values, the bars transition through teal and green. The bars on the far right (rewards > 2) are bright green to yellow, indicating a high "Proportion of Positive Δa" (approaching 0.5 to 0.6).
### Key Observations
1. **Dramatic Shift in Reward Scale:** The baseline (RAP-MCTS) operates in a deeply negative reward space (centered at -600), while SC-MCTS* operates around zero. This suggests SC-MCTS* achieves significantly higher raw reward values.
2. **Correlation with Δa:** The color coding reveals a strong relationship between reward value and the "Proportion of Positive Δa" for SC-MCTS*, which is absent in the baseline. Higher rewards in SC-MCTS* are strongly associated with a higher proportion of positive Δa.
3. **Statistical Significance:** The text boxes show that the correlation (Spearman/Pearson = 0.32) in SC-MCTS* is statistically significant (p < 0.0001), while the near-zero correlation in the baseline is not (p = 0.2624).
4. **Distribution Shape:** Both distributions are roughly normal, but the SC-MCTS* distribution is tighter around its mean (0) relative to its scale and has a more pronounced right tail.
### Interpretation
This visualization demonstrates the superior performance and a key mechanistic insight of the SC-MCTS* algorithm compared to the RAP-MCTS baseline.
* **Performance Improvement:** The shift from a reward distribution centered at -600 to one centered at 0 indicates that SC-MCTS* is far more effective at the task, achieving rewards that are orders of magnitude higher (less negative/more positive).
* **Mechanistic Insight:** The color gradient in the SC-MCTS* plot is the critical finding. It shows that the algorithm's success (higher reward) is directly linked to a higher "Proportion of Positive Δa." Δa likely represents a change in an action or advantage estimate. Therefore, the data suggests that SC-MCTS*'s improvement stems from its ability to generate and select actions that lead to positive changes (Δa), and this ability is quantitatively correlated with the final reward outcome. The baseline shows no such relationship, implying its search process does not effectively leverage this signal.
* **Conclusion:** SC-MCTS* is not only better in outcome but also operates on a more effective and interpretable principle, where progress (positive Δa) is directly tied to success (higher reward). The statistically significant correlation provides strong evidence for this relationship.
</details>
Figure 6: Reward distribution and interpretability analysis. The left histogram shows the baseline reward model (RAP-MCTS), while the right represents SC-MCTS ∗. Bin colors indicate the proportion of positive $\Delta_{a}$ (lighter colors means higher proportions). Spearman and Pearson correlations along with p-values are shown in the top right of each histogram.
From Figure 6, shows that SC-MCTS* reward values correlate significantly with $\Delta_{a}$ , as indicated by the high Spearman and Pearson coefficients. Additionally, the mapping between the reward value bins and the proportion of positive $\Delta_{a}$ (indicated by the color gradient from light to dark) is highly consistent and intuitive. This strong alignment suggests that our reward model effectively captures the progress toward the goal state, providing interpretable signals for action selection during reasoning.
These results highlight the exceptional interpretability of our designed reward model, which ensures that SC-MCTS* not only achieves superior reasoning performance but is also highly interpretable. This interpretability is crucial for understanding and improving the decision-making process in multi-step reasoning tasks, further validating transparency of our proposed algorithm.
## 6 Conclusion
In this paper, we present SC-MCTS ∗, a novel and effective algorithm to enhancing the reasoning capabilities of LLMs. With extensive improvements in reward modeling, node selection strategy and backpropagation, SC-MCTS ∗ boosts both accuracy and speed, outperforming OpenAI’s o1-mini model by 17.4% on average using Llama-3.1-70B on the Blocksworld dataset. Experiments demonstrate its strong performance, making it a promising approach for multi-step reasoning tasks. For future work please refer to Appendix J. The synthesis of interpretability, efficiency and generalizability positions SC-MCTS ∗ as a valuable contribution to advancing LLMs multi-step reasoning.
## References
- Bellman (1957) Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957. ISSN 00959057, 19435274. URL http://www.jstor.org/stable/24900506.
- Chen et al. (2023) Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/2302.01318.
- Chen et al. (2024) Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the boundaries of thought: A reasoning granularity framework to quantify and optimize chain-of-thought, 2024. URL https://arxiv.org/abs/2410.05695.
- Coquelin & Munos (2007) Pierre-Arnaud Coquelin and Rémi Munos. Bandit algorithms for tree search. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI’07, pp. 67–74, Arlington, Virginia, USA, 2007. AUAI Press. ISBN 0974903930.
- Frantar et al. (2022) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2022.
- Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 8154–8173, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.507. URL https://aclanthology.org/2023.emnlp-main.507.
- Hao et al. (2024) Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, Zhen Wang, and Zhiting Hu. LLM reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https://openreview.net/forum?id=h1mvwbQiXR.
- Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, and Trevor Back. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, Jul 2021. doi: https://doi.org/10.1038/s41586-021-03819-2. URL https://www.nature.com/articles/s41586-021-03819-2.
- Leviathan et al. (2023) Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Li et al. (2023) Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12286–12312, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
- Liu et al. (2021) Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6691–6706, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522.
- McAleese et al. (2024) Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. Llm critics help catch llm bugs, 2024.
- O’Brien & Lewis (2023) Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models, 2023. URL https://arxiv.org/abs/2309.09117.
- OpenAI (2024a) OpenAI. Introducing openai o1. https://openai.com/o1/, 2024a. Accessed: 2024-10-02.
- OpenAI (2024b) OpenAI. How reasoning works. https://platform.openai.com/docs/guides/reasoning/how-reasoning-works, 2024b. Accessed: 2024-10-02.
- Qi et al. (2024) Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers, 2024. URL https://arxiv.org/abs/2408.06195.
- Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 53728–53741. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.
- Ren et al. (2023) Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models. In Javier Antorán, Arno Blaas, Kelly Buchanan, Fan Feng, Vincent Fortuin, Sahra Ghalebikesabi, Andreas Kriegler, Ian Mason, David Rohde, Francisco J. R. Ruiz, Tobias Uelwer, Yubin Xie, and Rui Yang (eds.), Proceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, volume 239 of Proceedings of Machine Learning Research, pp. 49–64. PMLR, 16 Dec 2023. URL https://proceedings.mlr.press/v239/ren23a.html.
- Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, Jan 2016. doi: https://doi.org/10.1038/nature16961.
- Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URL https://arxiv.org/abs/1712.01815.
- Sprague et al. (2024) Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning, 2024. URL https://arxiv.org/abs/2409.12183.
- Tian et al. (2024) Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. ArXiv, abs/2404.12253, 2024. URL https://api.semanticscholar.org/CorpusID:269214525.
- Valmeekam et al. (2023) Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - a critical investigation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=X6dEqXIsEW.
- Valmeekam et al. (2024) Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc.
- Wei et al. (2024) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088.
- Xie et al. (2024) Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024. URL https://arxiv.org/abs/2405.00451.
- Xin et al. (2024a) Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu (Benjamin Liu), Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data. ArXiv, abs/2405.14333, 2024a. URL https://api.semanticscholar.org/CorpusID:269983755.
- Xin et al. (2024b) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024b. URL https://arxiv.org/abs/2408.08152.
- Xu (2023) Haotian Xu. No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, 2023. URL https://arxiv.org/abs/2309.03224.
- Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2024. Curran Associates Inc.
- Yuan et al. (2024a) Hongyi Yuan, Keming Lu, Fei Huang, Zheng Yuan, and Chang Zhou. Speculative contrastive decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 56–64, Bangkok, Thailand, August 2024a. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-short.5.
- Yuan et al. (2024b) Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing llm reasoning generalists with preference trees, 2024b.
- Zhang et al. (2024a) Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search, 2024a. URL https://arxiv.org/abs/2406.03816.
- Zhang et al. (2024b) Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, and Wanli Ouyang. Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, 2024b. URL https://arxiv.org/abs/2406.07394.
- Zhao et al. (2024) Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Xinrong Zhang, Yewei Fang, Kaihuo Zhang, Zhiyuan Liu, and Maosong Sun. Ouroboros: Generating longer drafts phrase by phrase for faster speculative decoding, 2024. URL https://arxiv.org/abs/2402.13720.
## Appendix A Action-Level Contrastive Reward
We made the distinction between action-level variables and token-level variables: action-level (or step-level) variables are those that aggregate over all tokens in a reasoning step, and is typically utilized by the reasoning algorithm directly; token-level variables, by contrast, operates in a more microscopic and low-level environment, such as speculative decoding.
We found that the traditional contrastive decoding using the difference in logits, when aggregated over the sequence gives a unstable reward signal compared to JS divergence. We suspected this is due to the unbounded nature of logit difference, and the potential failure modes associated with it that needs extra care and more hyperparameter tuning.
## Appendix B More Related Work
#### Large Language Models Multi-Step Reasoning
Deepseek Prover (Xin et al., 2024a; b) relied on Lean4 as an external verification tool to provide dense reward signals in the RL stage. ReST-MCTS ∗ (Zhang et al., 2024a) employed self-training to collect high-quality reasoning trajectories for iteratively improving the value model. AlphaLLM (Tian et al., 2024) used critic models initialized from the policy model as the MCTS reward model. rStar (Qi et al., 2024) utilized mutual consistency of SLMs and an additional math-specific action space. Xu (2023) proposed reconstructing fine-tuned LLMs into residual-based energy models to guide MCTS.
#### Speculative Decoding
Speculative decoding was first introduced in Leviathan et al. (2023), as a method to accelerate sampling from large autoregressive models by computing multiple tokens in parallel without retraining or changing the model structure. It enhances computational efficiency, especially in large-scale generation tasks, by recognizing that hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models. Similarly, DeepMind introduced speculative sampling (Chen et al., 2023), which expands on this idea by generating a short draft sequence using a faster draft model and then scoring this draft with a larger target model.
#### Contrastive Decoding
Contrastive decoding, as proposed by Li et al. (2023), is a simple, computationally light, and training-free method for text generation that can enhancethe quality and quantity by identifying strings that highlight potential differences between strong models and weak models. In this context, the weak models typically employ conventional greedy decoding techniques such as basic sampling methods, while the strong models are often well-trained large language models. This approach has demonstrated notable performance improvements in various inference tasks, including arithmetic reasoning and multiple-choice ranking tasks, thereby increasing the accuracy of language models. According to experiments conducted by O’Brien & Lewis (2023), applying contrastive decoding across various tasks has proven effective in enhancing the reasoning capabilities of LLMs.
## Appendix C Reward Functions Correlation
<details>
<summary>extracted/6087579/fig/heatmap.png Details</summary>

### Visual Description
## Correlation Heatmap: Reward Functions
### Overview
The image is a square correlation heatmap titled "Correlation Heatmap of Reward Functions." It visualizes the pairwise Pearson correlation coefficients between three distinct reward functions, labeled as \(R_{LL}\), \(R_{SE}\), and \(R_{JSD}\). The heatmap uses a diverging color scale from blue (negative correlation) to red (positive correlation) to represent the strength and direction of the relationships.
### Components/Axes
* **Title:** "Correlation Heatmap of Reward Functions" (centered at the top).
* **Y-Axis (Left):** Labels for the three reward functions, listed vertically from top to bottom: \(R_{LL}\), \(R_{SE}\), \(R_{JSD}\).
* **X-Axis (Bottom):** Labels for the same three reward functions, listed horizontally from left to right: \(R_{LL}\), \(R_{SE}\), \(R_{JSD}\).
* **Color Bar/Legend (Right):** A vertical bar indicating the correlation scale. It ranges from **-1.00** (dark blue) at the bottom to **1.00** (dark red) at the top, with labeled ticks at intervals of 0.25 (e.g., -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75).
* **Heatmap Grid:** A 3x3 grid of colored cells. Each cell contains a numerical correlation coefficient printed in its center. The diagonal cells (top-left to bottom-right) are all dark red with a value of `1.00`.
### Detailed Analysis
The correlation matrix is symmetric. The values extracted from each cell, cross-referenced with the axis labels and color scale, are as follows:
| Correlation Pair | Cell Position (Row, Column) | Correlation Value | Visual Color & Trend Description |
| :--- | :--- | :--- | :--- |
| **\(R_{LL}\) vs. \(R_{LL}\)** | (1, 1) | **1.00** | Dark red. Perfect positive self-correlation. |
| **\(R_{LL}\) vs. \(R_{SE}\)** | (1, 2) | **0.13** | Light peach/orange. Weak positive correlation. |
| **\(R_{LL}\) vs. \(R_{JSD}\)** | (1, 3) | **0.02** | Very light gray, almost neutral. Near-zero, negligible positive correlation. |
| **\(R_{SE}\) vs. \(R_{LL}\)** | (2, 1) | **0.13** | Light peach/orange. (Mirror of cell (1,2)). |
| **\(R_{SE}\) vs. \(R_{SE}\)** | (2, 2) | **1.00** | Dark red. Perfect positive self-correlation. |
| **\(R_{SE}\) vs. \(R_{JSD}\)** | (2, 3) | **-0.10** | Light blue. Weak negative correlation. |
| **\(R_{JSD}\) vs. \(R_{LL}\)** | (3, 1) | **0.02** | Very light gray. (Mirror of cell (1,3)). |
| **\(R_{JSD}\) vs. \(R_{SE}\)** | (3, 2) | **-0.10** | Light blue. (Mirror of cell (2,3)). |
| **\(R_{JSD}\) vs. \(R_{JSD}\)** | (3, 3) | **1.00** | Dark red. Perfect positive self-correlation. |
### Key Observations
1. **Diagonal Dominance:** The strongest correlations (1.00) are on the diagonal, as expected, confirming each function's perfect correlation with itself.
2. **Weak Overall Correlations:** All off-diagonal correlations are weak, with absolute values ≤ 0.13. This indicates the three reward functions measure largely distinct or independent aspects of performance.
3. **Direction of Weak Relationships:**
* \(R_{LL}\) and \(R_{SE}\) have a **weak positive** relationship (0.13).
* \(R_{SE}\) and \(R_{JSD}\) have a **weak negative** relationship (-0.10).
* \(R_{LL}\) and \(R_{JSD}\) are essentially **uncorrelated** (0.02).
4. **Color-Value Consistency:** The color of each cell accurately reflects its numerical value according to the legend. The near-zero values (0.02, -0.10) are represented by very light, desaturated colors close to the neutral gray/white midpoint of the scale.
### Interpretation
This heatmap provides critical insight into the relationship between different reward functions, likely used in a machine learning or optimization context (e.g., reinforcement learning).
* **What the data suggests:** The very low correlation coefficients imply that optimizing for one of these reward functions (\(R_{LL}\), \(R_{SE}\), or \(R_{JSD}\)) would not automatically lead to optimization for the others. They are not redundant metrics.
* **How elements relate:** The matrix structure allows for immediate comparison of any two functions. The symmetry confirms the correlation is a mutual, pairwise property.
* **Notable implications:** The weak negative correlation between \(R_{SE}\) and \(R_{JSD}\) (-0.10) is particularly interesting. It suggests a slight trade-off: conditions that lead to a higher \(R_{SE}\) score may be associated with a marginally lower \(R_{JSD}\) score, and vice-versa. This could inform multi-objective optimization strategies, indicating that a combined reward function might need to explicitly balance these two slightly opposing signals.
* **Underlying information:** The choice of these three specific acronyms (LL, SE, JSD) hints at their technical nature. "JSD" likely stands for Jensen-Shannon Divergence, a measure of similarity between probability distributions. "LL" could be Log-Likelihood, and "SE" could be Squared Error or another metric. The heatmap validates that these mathematically distinct measures indeed capture different information in practice.
</details>
Figure 7: Reward Functions Correlation Heatmap.
It can be seen from Figure 7 that the correlations between the three reward functions are relatively low, absolute values all below 0.15. These low correlations of reward functions make them ideal for Multi-RM method.
## Appendix D Algorithm Details of SC-MCTS ∗
The pseudocode inside MCTS reasoning of SC-MCTS ∗ is shown in Algorithm 2, based on Zhang et al. (2024a). The complete version of SC-MCTS ∗ is: first sample a subset of problems to obtain the prior data for reward values (Algorithm 1), then use it and two SLMs, one for providing contrastive reward signals, another for speculative decoding speedup, to perform MCTS reasoning. The changes of SC-MCTS ∗ compared to previous works are highlighted in teal.
Algorithm 2 SC-MCTS ∗, reasoning
1: expert LLM $\pi_{\text{e}}$ , amatuer SLM $\pi_{\text{a}}$ , speculative SLM $\pi_{\text{s}}$ , problem $q$ , reward model $R$ , reward factor statistics ${\mathcal{S}}$ , max iterations $T$ , threshold $l$ , branch $b$ , rollout steps $m$ , roll branch $d$ , weight parameter $\alpha$ , exploration constant $C$
2: $T_{q}\leftarrow$ Initialize-tree $(q)$
3: for $i=1\ldots T$ do
4: $n\leftarrow$ Root $(T_{q})$
5: while $n$ is not leaf node do $\triangleright$ Node selection
6: $n\leftarrow$ $\operatorname*{arg\,max}_{n^{\prime}\in\text{children}(n)}(v_{n^{\prime}}+C \sqrt{\frac{\ln{N_{n}}}{N_{n^{\prime}}}})$ $\triangleright$ Select child node based on UCT
7: end while
8: if $v_{n}\geq l$ then break $\triangleright$ Output solution
9: end if
10: if $n$ is not End of Inference then
11: for $j=1\ldots b$ do $\triangleright$ Thought expansion
12: $n_{j}\leftarrow$ Get-new-child $(A_{n},q,\pi_{\text{e}})$ $\triangleright$ Expand based on previous steps
13: $v_{n_{j}},{\mathcal{S}}\leftarrow$ $R(A_{n_{j}},q,\pi_{\text{e}},\pi_{\text{a}},{\mathcal{S}})$ $\triangleright$ Evaluate contrastive reward and update reward factor statistics
14: end for
15: $n^{\prime}\leftarrow$ $\operatorname*{arg\,max}_{n^{\prime}\in\text{children}(n)}(v_{n^{\prime}})$
16: $v_{\max}\leftarrow$ 0
17: for $k=1\ldots m$ do $\triangleright$ Greedy MC rollout
18: $A,v_{\max}\leftarrow$ Get-next-step-with-best-value $(A,q,\pi_{\text{e}},\pi_{\text{s}},d)$ $\triangleright$ Sample new children using speculative decoding and record the best observed value
19: end for
20: $v_{n^{\prime}}\leftarrow$ $\alpha v_{n^{\prime}}+(1-\alpha)v_{\max}$
21: $N_{n^{\prime}}\leftarrow$ $N_{n^{\prime}}+1$ $\triangleright$ Update value and visit count of the rollout node
22: end if
23: Back-propagate $(n)$ $\triangleright$ Update value of parent nodes (Equation 3)
24: end for
25: $n\leftarrow$ Get-best-node $(T_{q})$ $\triangleright$ Fetch the node with the highest value in the search tree
26: $A_{n}$
Although we sampled a small portion of the dataset as prior data for reward values, distribution shift may still occur when normalizing reward values during reasoning. Therefore, we use the following algorithm to incrementally update the mean and standard deviation of the online reward distribution:
Algorithm 3 Online incremental update of reward factor statistics
1: reward factors $\mathcal{R}(=\{\text{JSD},\text{LL},\text{SE}\})$ , statistics $\{\mu_{r}^{(k)},\sigma_{r}^{(k)},n_{r}^{(k)}\}_{r\in\mathcal{R},k\in\{1,\ldots ,K\}}$ , cluster assignment function $f$
2: for $r\in\mathcal{R}$ do
3: $k^{*}\leftarrow f(x)$ $\triangleright$ Assign sample to cluster
4: $v_{r}\leftarrow r(x)$ $\triangleright$ Compute reward factor value
5: $n_{r}^{(k^{*})}\leftarrow n_{r}^{(k^{*})}+1$ $\triangleright$ Update sample count
6: $\delta\leftarrow v_{r}-\mu_{r}^{(k^{*})}$ $\triangleright$ Compute difference from mean
7: $\mu_{r}^{(k^{*})}\leftarrow\mu_{r}^{(k^{*})}+\delta/n_{r}^{(k^{*})}$ $\triangleright$ Update mean
8: $M_{2}\leftarrow(n_{r}^{(k^{*})}-1)(\sigma_{r}^{(k^{*})})^{2}+\delta(v_{r}-\mu_ {r}^{(k^{*})})$
9: $\sigma_{r}^{(k^{*})}\leftarrow\sqrt{M_{2}/n_{r}^{(k^{*})}}$ $\triangleright$ Update standard deviation
10: end for
11: updated statistics $\{\mu_{r}^{(k)},\sigma_{r}^{(k)},n_{r}^{(k)}\}_{r\in\mathcal{R},k\in\{1,\ldots ,K\}}$
## Appendix E Experimental Settings
For reproducibility, you can download the checkpoints from the Huggingface repository below and use the hyperparameters below. We utilized 4-bit quantized checkpoints in all experiments, as they only result in around 2% performance loss while providing several-fold reductions in memory usage and significantly improving inference speed (Frantar et al., 2022). For better output formatting to capture a single step and convert it into an MCTS node, we used the LLM’s completion mode so we set LLM to greedy sampling, and we don’t have to set an additional system prompt, simply apply prompts in Appendix F. Our experiments were all conducted on exllamav2 inference framework.
### E.1 Checkpoints
| Usage | Models | Links |
| --- | --- | --- |
| Expert | Llama-3.1-405B | https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 |
| Llama-3.1-70B | https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 | |
| Llama-3-70B | https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ | |
| Amateur | Llama-3.1-8B | https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 |
| Llama-3-8B | https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit | |
| Llama-3.2-1B | https://huggingface.co/meta-llama/Llama-3.2-1B | |
| OpenAI | GPT-4o | https://platform.openai.com/docs/models/gpt-4o |
| o1-mini | https://platform.openai.com/docs/models/o1 | |
Table 3: Checkpoints used in experiments and their links.
### E.2 Hyperparameters
| Hyperparameter | Value |
| --- | --- |
| temperature | 1.0 |
| top-k | 1.0 |
| top-p | 1.0 |
| repetition_penalty | 1.0 |
| max_new_tokens | 200 |
| max_seq_len | 32768 |
| MCTS EOS: Llama-3 family | "\n[" |
| CoT EOS: Llama-3 family | "\n", "<|eot_id|>" |
Table 4: LLM Hyperparameters and EOS tokens used in experiments.
## Appendix F Blocksworld Dataset
The Blocksworld dataset comprises 600 instances with varying block numbers and plan lengths. Simpler instances have 3-5 blocks, while more complex cases involve up to 25 blocks, introducing additional goals and obstacles. This setup covers a range of problem difficulties for evaluating planning algorithms.
### F.1 Difficulty Settings
According to settings of LLM Reasoners (Hao et al., 2024), we divide the original 600 instances of Blocksworld (Valmeekam et al., 2024) into two parts, Easy and Hard settings.
In the Easy Blocksworld setting, we use more friendly demonstration cases. If a problem requires a specific minimum number of steps to solve, we select other problems that require the same number of steps as demonstration cases in the context. For example, if a problem requires at least 4 steps to solve, we use other 4-step problems as demonstration examples. For each group of problems, we randomly select 10 cases to create a pool of demonstration cases, while the remaining cases form the test set (a total of 540 cases). During inference, we randomly sample 4-shot demonstration cases from this pool to construct the prompts.
In the Hard Blocksworld setting, we randomly select 10 cases from the entire dataset to create the demonstration pool. These selected cases are then excluded from the test set, leaving a total of 590 cases for testing. During inference, we randomly sample 4-shot demonstration cases from this global pool, without considering the minimum number of actions required for the test case. For example, if a problem requires at least 4 steps to solve, we may still use demonstration cases that require a different number of steps, such as 2 or 12, as there is no restriction based on the number of actions.
| domain_intro: |
| --- |
| I am playing with a set of objects. Here are the actions I can do: |
| pick up a block |
| unstack a block from on top of another block |
| put down a block |
| stack a block on top of another block |
| I have the following restrictions on my actions: To perform the Pick Up action, the block must be clear, on the table, and my hand must be empty. Once the Pick Up action is performed, I am holding the block, and my hand is no longer empty. |
| To perform the Unstack action, the block must be clear, on top of another block, and my hand must be empty. Once the Unstack action is performed, I am holding the block, and my hand is no longer empty. |
| To perform the Put Down action, I must be holding a block. Once the Put Down action is performed, the block is on the table, my hand is empty, and the block becomes clear. |
| To perform the Stack action, I must be holding a block, and the block I want to stack it on must be clear. Once the Stack action is performed, the block is on top of another block, my hand is empty, and the block on top is no longer clear. |
Table 5: Normal Blocksworld Task Setting
### F.2 Prompts Settings of Easy Blocksworld
Input Instructions: I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do: 1.
Pick up a block 2.
Unstack a block from on top of another block 3.
Put down a block 4.
Stack a block on top of another block I have the following restrictions on my actions: 1.
I can only pick up or unstack one block at a time. 2.
I can only pick up or unstack a block if my hand is empty. 3.
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. 4.
I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. 5.
I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. 1.
I can only put down a block that I am holding. 2.
I can only stack a block on top of another block if I am holding the block being stacked. 3.
I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. [STATEMENT] As initial conditions I have that, the red block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table. My goal is to have that the orange block is on top of the blue block. My plan is as follows: [End Of STATEMENT] [PLAN] unstack the blue block from on top of the orange block put down the blue block pick up the orange block stack the orange block on top of the blue block [PLAN END] [STATEMENT] As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is empty, the red block is on top of the blue block, the yellow block is on top of the orange block, the blue block is on the table and the orange block is on the table. My goal is to have that the orange block is on top of the red block. My plan is as follows: [End Of STATEMENT] Output format: [PLAN] [LLM Completion] [PLAN_END]
Table 6: The Prompt Settings for Easy Blocksworld
### F.3 Prompts Settings of Hard Blocksworld
| Input Instructions: |
| --- |
| I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do: 1.
Pick up a block 2.
Unstack a block from on top of another block 3.
Put down a block 4.
Stack a block on top of another block I have the following restrictions on my actions: 1.
I can only pick up or unstack one block at a time. 2.
I can only pick up or unstack a block if my hand is empty. 3.
I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. 4.
I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. 5.
I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. 1.
I can only put down a block that I am holding. 2.
I can only stack a block on top of another block if I am holding the block being stacked. 3.
I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. |
| [STATEMENT] |
| As initial conditions I have that, the blue block is clear, the hand is empty, the blue block is on top of the red block, the red block is on the table, the orange block is on the table and the yellow block is on the table. |
| My goal is to have that the blue block is on top of the orange block. My plan is as follows: |
| [End Of STATEMENT] |
| [PLAN] |
| unstack the blue block from on top of the red block |
| stack the blue block on top of the orange block |
| [PLAN END] |
| [STATEMENT] |
| As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is empty, the red block is on top of the blue block, the yellow block is on top of the orange block, the blue block is on the table and the orange block is on the table. |
| My goal is to have that the orange block is on top of the red block. My plan is as follows: |
| [End Of STATEMENT] |
| Output format: |
| [PLAN] |
| [LLM Completion] |
| [PLAN_END] |
Table 7: The Prompt Settings for Hard Blocksworld
## Appendix G Example Trees of Different $c$ of UCT
<details>
<summary>extracted/6087579/fig/uct_2.png Details</summary>

### Visual Description
## Directed Graph Diagram: Hierarchical Network Structure
### Overview
The image displays a directed graph (or tree-like network) with numbered nodes connected by labeled edges. The graph originates from a single root node (0) on the left and expands rightward through multiple hierarchical levels. The structure is not a perfect tree, as some nodes have multiple parents (e.g., node 40 has two incoming edges). The layout is organized to minimize edge crossings, with nodes arranged in vertical columns representing different levels of depth from the root.
### Components/Axes
* **Nodes:** Rectangular boxes containing integer identifiers. There are no other labels or attributes on the nodes.
* **Edges:** Directed lines connecting nodes, each labeled with text in the format `edge [Source Node] -> [Target Node]`.
* **Layout:** The graph flows from left (root) to right (leaf nodes). Nodes are positioned in approximate vertical columns.
* **Column 1 (Leftmost):** Node 0.
* **Column 2:** Nodes 1, 2.
* **Column 3:** Nodes 3, 4, 5, 17, 18, 19.
* **Column 4:** Nodes 6, 7, 8, 20, 21, 22, 31, 32, 38, 39, 45, 46.
* **Column 5:** Nodes 9, 10, 11, 23, 24, 25, 33, 34, 40, 41, 47, 48, 49.
* **Column 6:** Nodes 12, 13, 26, 27, 35, 42, 50, 51.
* **Column 7 (Rightmost):** Nodes 14, 15, 16, 28, 29, 30, 36, 37, 43, 44, 52, 53, 54.
### Detailed Analysis
The graph's connectivity is defined entirely by the labeled edges. Below is a complete reconstruction of the network structure, tracing paths from the root.
**Root and First Level:**
* Node `0` has two outgoing edges:
* `edge 0 -> 1` (to Node 1)
* `edge 0 -> 2` (to Node 2)
**Branch from Node 1:**
* Node `1` has three outgoing edges:
* `edge 1 -> 3` (to Node 3)
* `edge 1 -> 4` (to Node 4)
* `edge 1 -> 5` (to Node 5)
**Branch from Node 2:**
* Node `2` has three outgoing edges:
* `edge 2 -> 17` (to Node 17)
* `edge 2 -> 18` (to Node 18)
* `edge 2 -> 19` (to Node 19)
**Sub-branches (Continuing from Level 3 nodes):**
* **From Node 3:**
* `edge 3 -> 38` (to Node 38)
* `edge 3 -> 39` (to Node 39)
* **From Node 4:**
* `edge 4 -> 45` (to Node 45)
* `edge 4 -> 46` (to Node 46)
* **From Node 5:**
* `edge 5 -> 6` (to Node 6)
* `edge 5 -> 7` (to Node 7)
* `edge 5 -> 8` (to Node 8)
* **From Node 17:**
* `edge 17 -> 31` (to Node 31)
* `edge 17 -> 32` (to Node 32)
* **From Node 19:**
* `edge 19 -> 20` (to Node 20)
* `edge 19 -> 21` (to Node 21)
* `edge 19 -> 22` (to Node 22)
**Deeper Connections (Level 4 and beyond):**
* **From Node 39:** `edge 39 -> 40` (to Node 40), `edge 39 -> 41` (to Node 41)
* **From Node 45:** `edge 45 -> 47` (to Node 47), `edge 45 -> 48` (to Node 48), `edge 45 -> 49` (to Node 49)
* **From Node 6:** `edge 6 -> 9` (to Node 9), `edge 6 -> 10` (to Node 10), `edge 6 -> 11` (to Node 11)
* **From Node 32:** `edge 32 -> 33` (to Node 33), `edge 32 -> 34` (to Node 34)
* **From Node 21:** `edge 21 -> 23` (to Node 23), `edge 21 -> 24` (to Node 24), `edge 21 -> 25` (to Node 25)
* **From Node 40:** `edge 40 -> 42` (to Node 42)
* **From Node 48:** `edge 48 -> 50` (to Node 50), `edge 48 -> 51` (to Node 51)
* **From Node 10:** `edge 10 -> 12` (to Node 12), `edge 10 -> 13` (to Node 13)
* **From Node 33:** `edge 33 -> 35` (to Node 35)
* **From Node 24:** `edge 24 -> 26` (to Node 26), `edge 24 -> 27` (to Node 27)
* **From Node 42:** `edge 42 -> 43` (to Node 43), `edge 42 -> 44` (to Node 44)
* **From Node 50:** `edge 50 -> 52` (to Node 52), `edge 50 -> 53` (to Node 53), `edge 50 -> 54` (to Node 54)
* **From Node 12:** `edge 12 -> 14` (to Node 14), `edge 12 -> 15` (to Node 15), `edge 12 -> 16` (to Node 16)
* **From Node 35:** `edge 35 -> 36` (to Node 36), `edge 35 -> 37` (to Node 37)
* **From Node 26:** `edge 26 -> 28` (to Node 28), `edge 26 -> 29` (to Node 29), `edge 26 -> 30` (to Node 30)
**Leaf Nodes (No Outgoing Edges):**
The following nodes appear at the far right of the diagram and have no outgoing edges: 38, 41, 46, 47, 49, 7, 8, 9, 11, 20, 22, 23, 25, 34, 43, 44, 51, 52, 53, 54, 13, 14, 15, 16, 27, 28, 29, 30, 36, 37.
### Key Observations
1. **Asymmetric Branching:** The branching factor varies significantly. Node 0 has 2 children, Node 1 has 3, Node 5 has 3, but Node 45 has 3 and Node 39 has 2. There is no uniform pattern.
2. **Depth Variation:** The maximum depth from the root is 7 levels (e.g., path 0 -> 1 -> 4 -> 45 -> 48 -> 50 -> 52). Some branches terminate much earlier (e.g., 0 -> 2 -> 18 ends at depth 3).
3. **Node Numbering:** Node numbers are not assigned sequentially by level or branch. They appear somewhat arbitrary, with low numbers (e.g., 6, 7, 8, 9, 10) appearing deep in the structure alongside high numbers (e.g., 45, 46, 47).
4. **Multiple Parents:** Node 40 is a convergence point, receiving edges from both Node 39 and Node 41 (though the edge from 41 is not explicitly drawn, the label `edge 39 -> 41` suggests 41 is a sibling, not a parent). *Correction:* Upon closer inspection, Node 41 is a sibling of Node 40, both children of Node 39. The only node with an apparent dual parent in the drawn edges is Node 40, which receives `edge 39 -> 40`. No other node has two incoming edge labels pointing to it.
5. **Visual Grouping:** Nodes are clustered visually by their parent, creating distinct sub-trees (e.g., the sub-tree under Node 5 is visually separate from the sub-tree under Node 4).
### Interpretation
This diagram represents a complex, directed acyclic graph (DAG) or a general graph (cycles cannot be ruled out without a full adjacency list, but none are visually apparent). It likely models a process flow, a dependency network, a state machine, or a organizational/software architecture hierarchy.
* **Information Flow:** The direction of edges indicates a one-way flow of control, data, or dependency from the root (0) towards the leaf nodes. The root is the sole source, and the leaves are terminal states or outputs.
* **Structural Complexity:** The non-uniform branching and varying depth suggest a system with heterogeneous components or processes. Some pathways are simple and direct, while others are highly elaborated.
* **Potential Anomaly:** The node numbering scheme is non-intuitive. In a typical technical diagram, numbering often follows a logical pattern (e.g., breadth-first or depth-first order). The apparent randomness here could indicate these are unique identifiers from a database or system, not labels chosen for diagrammatic clarity.
* **Purpose:** Without external context, the specific domain is unknown. However, the structure is consistent with representing: a compiler's abstract syntax tree (AST) with optimizations, a business process with parallel paths, a network routing topology, or the call graph of a complex software system. The key takeaway is the existence of multiple, independent pathways originating from a common source, with some convergence (Node 40) and significant variation in complexity across branches.
</details>
Figure 8: Monte Carlo Tree with origin parameter $c$ of UCT
<details>
<summary>extracted/6087579/fig/uct_1.png Details</summary>

### Visual Description
## Directed Graph Diagram: Hierarchical Node-Edge Structure
### Overview
The image displays a complex directed graph or tree diagram, structured as a hierarchical flowchart. It originates from a single root node (labeled "0") on the far left and expands rightward through multiple levels of branching. The diagram consists of rectangular nodes containing numerical identifiers and directed edges (arrows) connecting them. Each edge is labeled with text specifying the connection, formatted as "edge [source node] -> [target node]". The overall layout is organized into vertical columns, with each column representing a deeper level in the hierarchy.
### Components/Axes
* **Nodes:** Rectangular boxes with rounded corners, each containing a unique integer identifier. The nodes are arranged in vertical columns.
* **Edges:** Curved, directed arrows connecting nodes from left to right. Each edge has a text label placed along its path.
* **Edge Labels:** Text strings in the format "edge X -> Y", where X is the source node number and Y is the target node number. These labels explicitly define the relationship between connected nodes.
* **Spatial Layout:**
* **Column 1 (Root):** Contains node `0`.
* **Column 2:** Contains nodes `1` and `2`.
* **Column 3:** Contains nodes `3`, `4`, `5`, `6`, `7`, `8`, `17`, `18`, `19`.
* **Column 4:** Contains a dense set of nodes including `38`, `39`, `45`, `46`, `9`, `10`, `11`, `71`, `72`, `73`, `74`, `63`, `64`, `65`, `31`, `32`, `20`, `21`, `22`.
* **Column 5:** Contains nodes such as `55`, `56`, `57`, `40`, `41`, `47`, `48`, `49`, `12`, `13`, `75`, `76`, `77`, `66`, `67`, `33`, `34`, `23`, `24`, `25`.
* **Column 6:** Contains nodes like `58`, `59`, `42`, `50`, `51`, `35`, `26`, `27`.
* **Column 7 (Leaf Nodes):** The rightmost column contains terminal nodes: `60`, `61`, `62`, `43`, `44`, `52`, `53`, `54`, `14`, `15`, `16`, `78`, `79`, `80`, `68`, `69`, `70`, `36`, `37`, `28`, `29`, `30`.
### Detailed Analysis
The graph's structure is defined by the following parent-child relationships, as indicated by the edge labels:
* **Root Node `0`** branches to:
* Node `1` (via `edge 0 -> 1`)
* Node `2` (via `edge 0 -> 2`)
* **Node `1`** branches to:
* Node `3` (via `edge 1 -> 3`)
* Node `4` (via `edge 1 -> 4`)
* Node `5` (via `edge 1 -> 5`)
* **Node `2`** branches to:
* Node `17` (via `edge 2 -> 17`)
* Node `18` (via `edge 2 -> 18`)
* Node `19` (via `edge 2 -> 19`)
* **Sub-branches from Node `3`:**
* To Node `38` (via `edge 3 -> 38`)
* To Node `39` (via `edge 3 -> 39`)
* **Sub-branches from Node `4`:**
* To Node `45` (via `edge 4 -> 45`)
* To Node `46` (via `edge 4 -> 46`)
* **Sub-branches from Node `5`:**
* To Node `6` (via `edge 5 -> 6`)
* To Node `7` (via `edge 5 -> 7`)
* To Node `8` (via `edge 5 -> 8`)
* **Sub-branches from Node `6`:**
* To Node `9` (via `edge 6 -> 9`)
* To Node `10` (via `edge 6 -> 10`)
* To Node `11` (via `edge 6 -> 11`)
* **Sub-branches from Node `7`:**
* To Node `71` (via `edge 7 -> 71`)
* To Node `72` (via `edge 7 -> 72`)
* To Node `73` (via `edge 7 -> 73`)
* To Node `74` (via `edge 7 -> 74`)
* **Sub-branches from Node `8`:**
* To Node `63` (via `edge 8 -> 63`)
* To Node `64` (via `edge 8 -> 64`)
* To Node `65` (via `edge 8 -> 65`)
* **Sub-branches from Node `17`:**
* To Node `31` (via `edge 17 -> 31`)
* To Node `32` (via `edge 17 -> 32`)
* **Sub-branches from Node `19`:**
* To Node `20` (via `edge 19 -> 20`)
* To Node `21` (via `edge 19 -> 21`)
* To Node `22` (via `edge 19 -> 22`)
* **Further branching continues** from nodes like `38`, `39`, `45`, `10`, `12`, `71`, `76`, `63`, `66`, `32`, `33`, `21`, `24`, `26`, etc., ultimately terminating in the leaf nodes listed in Column 7. For example:
* Node `38` branches to `55`, `56`, `57`.
* Node `56` branches to `58`, `59`.
* Node `58` branches to `60`, `61`, `62`.
### Key Observations
1. **Hierarchical Depth:** The graph has a maximum depth of 7 levels (from Node `0` to the leaf nodes).
2. **Branching Factor:** The branching is non-uniform. Some nodes (e.g., `0`, `1`, `5`, `7`) have multiple children (3-4), while others (e.g., many leaf nodes) have none.
3. **Node Numbering:** Node numbers are not assigned sequentially according to their position in the tree. They appear to be arbitrary or based on an external indexing system (e.g., `0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `10`, `11`, `12`, `13`, `14`, `15`, `16`, `17`, `18`, `19`, `20`, `21`, `22`, `23`, `24`, `25`, `26`, `27`, `28`, `29`, `30`, `31`, `32`, `33`, `34`, `35`, `36`, `37`, `38`, `39`, `40`, `41`, `42`, `43`, `44`, `45`, `46`, `47`, `48`, `49`, `50`, `51`, `52`, `53`, `54`, `55`, `56`, `57`, `58`, `59`, `60`, `61`, `62`, `63`, `64`, `65`, `66`, `67`, `68`, `69`, `70`, `71`, `72`, `73`, `74`, `75`, `76`, `77`, `78`, `79`, `80`).
4. **Edge Label Consistency:** Every visible edge has a corresponding label in the "edge X -> Y" format, providing unambiguous connection data.
5. **Visual Organization:** The layout uses vertical alignment to denote hierarchy levels and curved lines to manage edge crossings, improving readability in dense sections.
### Interpretation
This diagram is a formal representation of a **directed acyclic graph (DAG)** or a **tree structure**. It models a system of relationships where each node (except the root) has exactly one incoming edge (from its parent) and can have zero or more outgoing edges (to its children).
* **What it represents:** It could depict a workflow, a decision tree, a organizational chart, a data taxonomy, a process flow, or a dependency graph in software or logistics. The numerical labels suggest nodes are identified by an ID rather than a descriptive name.
* **Relationships:** The structure clearly defines parent-child relationships and the flow of direction from the root (`0`) outward. The explicit edge labels remove any ambiguity about connections.
* **Notable Patterns:** The graph is **not balanced**; some branches are much deeper and more complex than others. For instance, the branch starting with `0 -> 1 -> 5 -> 7 -> 71 -> 76 -> 78` is deep, while `0 -> 2 -> 18` terminates quickly. This suggests an uneven distribution of processes, decisions, or categories within the modeled system.
* **Utility:** The primary value of this diagram is in visualizing the complete topology and connectivity of the system. It allows one to trace all possible paths from the root to any terminal point, which is essential for analysis, debugging, or optimization of the underlying process it represents.
</details>
Figure 9: Monte Carlo Tree with our optimized parameter $c$ of UCT
From Figure 8 and 9 we can observed that with our optimized parameter $c$ of UCT, MCTS algorithm in node selection decisions tends to prioritize exploring new nodes rather than repeatedly following old paths, which may often lead to dead ends.
## Appendix H OpenAI API Data
| Difficulty | Model | USD per instance | Total Experiment Cost (USD) |
| --- | --- | --- | --- |
| Easy (0-shot) | GPT-4o | $0.0032 | $1.73 |
| o1-mini | $0.0136 | $7.34 | |
| Easy (4-shot) | GPT-4o | $0.0062 | $3.35 |
| o1-mini | $0.0171 | $9.23 | |
| Hard (0-shot) | GPT-4o | $0.0032 | $1.89 |
| o1-mini | $0.0177 | $10.44 | |
| Hard (4-shot) | GPT-4o | $0.0063 | $3.70 |
| o1-mini | $0.0172 | $10.15 | |
Table 8: OpenAI API cost of experiments on the Blocksworld dataset.
<details>
<summary>extracted/6087579/fig/Step_Length_vs_Reasoning_Tokens_for_Zero_Shot_Easy_Blocksworld.png Details</summary>

### Visual Description
## Line Chart with Confidence Interval: Step Length vs Reasoning Tokens for Zero Shot Easy Blocksworld
### Overview
The image displays a line chart with a shaded confidence interval, illustrating the relationship between "Step length" and "Average Reasoning Tokens" in a "Zero Shot Easy Blocksworld" context. The chart shows a generally increasing trend in reasoning tokens as step length increases, with a peak around step length 10, followed by a slight decline. The shaded region indicates the variability or confidence interval around the mean trend line.
### Components/Axes
* **Chart Title:** "Step Length vs Reasoning Tokens for Zero Shot Easy Blocksworld" (Centered at the top).
* **X-Axis (Horizontal):**
* **Label:** "Step length" (Centered below the axis).
* **Scale:** Linear scale with major tick marks and labels at 2, 4, 6, 8, 10, and 12.
* **Y-Axis (Vertical):**
* **Label:** "Average Reasoning Tokens" (Centered to the left, rotated 90 degrees).
* **Scale:** Linear scale with major tick marks and labels at 600, 800, 1000, 1200, 1400, and 1600.
* **Data Series:**
* A single, solid purple line represents the mean or average value of "Reasoning Tokens" for each "Step length."
* A light blue, semi-transparent shaded area surrounds the purple line, representing the confidence interval, standard deviation, or range of the data.
* **Legend:** No explicit legend is present within the chart area. The single data series and its associated shaded region are self-explanatory from the axis labels and title.
* **Grid:** A light gray grid is present, with both horizontal and vertical lines aligned with the major tick marks.
### Detailed Analysis
**Data Points (Approximate Values from Visual Inspection):**
The following table reconstructs the approximate data points for the mean line (purple) and the bounds of the shaded region (light blue).
| Step Length | Average Reasoning Tokens (Mean - Purple Line) | Lower Bound (Shaded Area) | Upper Bound (Shaded Area) |
| :--- | :--- | :--- | :--- |
| 2 | ~640 | ~550 | ~730 |
| 4 | ~750 | ~680 | ~820 |
| 6 | ~940 | ~860 | ~1020 |
| 8 | ~1230 | ~1150 | ~1310 |
| 10 | ~1470 | ~1350 | ~1590 |
| 12 | ~1420 | ~1220 | ~1620 |
**Trend Verification:**
* **Visual Trend:** The purple line slopes upward from step length 2 to 10, indicating a positive correlation. The slope is gentle between steps 2-4, steepens between 4-8, and is steepest between 8-10. After step 10, the line slopes slightly downward to step 12.
* **Shaded Region Trend:** The width of the light blue shaded area (representing variability) appears to increase as the step length increases. It is narrowest at step length 2 and widest at step length 12.
### Key Observations
1. **Positive Correlation:** There is a clear positive relationship between step length and the average number of reasoning tokens required, up to a point.
2. **Peak and Plateau:** The average reasoning tokens peak at a step length of 10 (~1470 tokens) before showing a slight decrease at step length 12 (~1420 tokens). This suggests a potential plateau or optimal point in the relationship.
3. **Increasing Variability:** The confidence interval (shaded area) widens significantly as step length increases. This indicates that the model's reasoning token usage becomes less consistent or predictable for longer step lengths.
4. **Non-Linear Growth:** The increase in average tokens is not linear. The rate of increase accelerates between step lengths 4 and 10.
### Interpretation
The data suggests that in the "Zero Shot Easy Blocksworld" task, planning or reasoning over longer sequences of steps (higher step length) generally requires a greater computational effort, measured in reasoning tokens. This aligns with the intuitive understanding that more complex plans are harder to formulate.
The peak at step length 10 followed by a slight decline is a notable anomaly. It could indicate several possibilities:
* A **ceiling effect** in the model's reasoning capacity for this specific task difficulty.
* That step lengths beyond 10 may involve more repetitive or predictable patterns, slightly reducing the marginal reasoning cost.
* A potential artifact of the specific dataset or evaluation method used.
The most critical observation is the **increasing variance** (widening shaded area). This implies that while the *average* cost rises, the *reliability* of the reasoning process decreases for longer steps. Some instances at high step lengths may require vastly more tokens than others, making performance less stable. This is a crucial insight for understanding the model's limitations and scaling behavior. The chart effectively communicates not just the central tendency but also the growing uncertainty in the model's performance as task complexity (step length) increases.
</details>
Figure 10: o1-mini Step Length vs Reasoning Tokens for Zero Shot in Easy Blocksworld
<details>
<summary>extracted/6087579/fig/Step_Length_vs_Reasoning_Tokens_for_Four_Shot_Easy_Blocksworld.png Details</summary>

### Visual Description
## Line Chart with Confidence Band: Step Length vs Reasoning Tokens for Four Shot Easy Blocksworld
### Overview
This is a line chart displaying the relationship between "Step length" and "Average Reasoning Tokens" for a task identified as "Four Shot Easy Blocksworld." The chart shows a clear positive correlation, with the average number of reasoning tokens increasing as the step length increases. A shaded region around the central trend line indicates the variability or confidence interval of the data.
### Components/Axes
* **Chart Title:** "Step Length vs Reasoning Tokens for Four Shot Easy Blocksworld" (centered at the top).
* **X-Axis (Horizontal):**
* **Label:** "Step length"
* **Scale:** Linear scale with major tick marks and labels at 2, 4, 6, 8, 10, and 12.
* **Y-Axis (Vertical):**
* **Label:** "Average Reasoning Tokens"
* **Scale:** Linear scale with major tick marks and labels at 600, 800, 1000, 1200, 1400, and 1600.
* **Data Series:**
* A single, solid purple line representing the mean or average trend.
* A light blue shaded area surrounding the purple line, representing the spread, standard deviation, or confidence interval of the data. The band is narrower at lower step lengths and widens significantly as step length increases.
* **Legend:** No separate legend is present. The title and axis labels provide the necessary context.
### Detailed Analysis
**Trend Verification:** The purple line exhibits a consistent, near-linear upward slope from left to right, confirming a positive relationship between the variables.
**Data Point Extraction (Approximate Values from the Purple Mean Line):**
* At Step length = 2: Average Reasoning Tokens ≈ 640
* At Step length = 4: Average Reasoning Tokens ≈ 790
* At Step length = 6: Average Reasoning Tokens ≈ 950
* At Step length = 8: Average Reasoning Tokens ≈ 1160
* At Step length = 10: Average Reasoning Tokens ≈ 1310
* At Step length = 12: Average Reasoning Tokens ≈ 1400
**Confidence Band Analysis (Approximate Bounds):**
* **Step length = 2:** The band spans roughly from 560 to 720 tokens (width ≈ 160).
* **Step length = 6:** The band spans roughly from 880 to 1020 tokens (width ≈ 140).
* **Step length = 12:** The band spans roughly from 1220 to 1580 tokens (width ≈ 360).
* **Observation:** The width of the confidence band is not constant. It appears relatively narrow at step lengths 2-6, then expands considerably at step lengths 8, 10, and 12, indicating greater variance or uncertainty in the average reasoning token count for longer step lengths.
### Key Observations
1. **Strong Positive Correlation:** There is a direct and strong positive correlation between step length and the average number of reasoning tokens required.
2. **Increasing Variance:** The uncertainty or variability in the token count (represented by the width of the blue band) increases substantially with step length. The prediction interval is much wider for a step length of 12 than for a step length of 2.
3. **Non-Perfect Linearity:** While the overall trend is linear, the slope appears to steepen slightly between step lengths 6 and 8 before possibly leveling off marginally between 10 and 12.
### Interpretation
The data suggests that for the "Four Shot Easy Blocksworld" task, solving problems that require more steps (longer step length) necessitates a proportionally greater amount of "reasoning" as measured by token generation. This implies a computational or cognitive cost that scales with problem complexity.
The most critical insight is the **widening confidence band**. This indicates that while the *average* reasoning cost increases predictably, the *specific* cost for any given instance becomes much less predictable as the problem gets longer. For short problems (step length 2-6), the model's reasoning effort is relatively consistent. For longer problems (step length 8+), the reasoning process becomes highly variable—some long problems may be solved with reasoning close to the average, while others may require significantly more or fewer tokens. This could point to factors like the specific configuration of the block world, the chosen solution path, or the model's strategy becoming more influential and variable as the problem scale increases.
</details>
Figure 11: o1-mini Step Length vs Reasoning Tokens for Four Shot in Easy Blocksworld
<details>
<summary>extracted/6087579/fig/Step_Length_vs_Reasoning_Tokens_for_Zero_Shot_Hard_Blocksworld.png Details</summary>

### Visual Description
## Line Chart with Confidence Band: Step Length vs Reasoning Tokens for Zero Shot Hard Blocksworld
### Overview
The image displays a line chart illustrating the relationship between "Step length" (x-axis) and "Average Reasoning Tokens" (y-axis) for a task identified as "Zero Shot Hard Blocksworld." The chart features a central trend line (purple) surrounded by a shaded light blue region, which likely represents a confidence interval, standard deviation, or range of variability around the average.
### Components/Axes
* **Title:** "Step Length vs Reasoning Tokens for Zero Shot Hard Blocksworld" (centered at the top).
* **X-Axis:**
* **Label:** "Step length" (centered below the axis).
* **Scale:** Linear scale with major tick marks and labels at 2, 4, 6, 8, 10, and 12.
* **Y-Axis:**
* **Label:** "Average Reasoning Tokens" (centered to the left, rotated 90 degrees).
* **Scale:** Linear scale with major tick marks and labels at 700, 800, 900, 1000, 1100, 1200, 1300, and 1400.
* **Data Series:**
* A single purple line representing the average trend.
* A light blue shaded area surrounding the purple line, representing the variability or confidence band.
* **Legend:** No explicit legend is present within the chart area. The single data series and its associated band are self-explanatory from the title and axes.
* **Grid:** A light gray grid is present, with both horizontal and vertical lines aligned with the major tick marks.
### Detailed Analysis
**Trend Verification:** The purple line shows a clear, consistent upward trend from left to right. The slope is gentle initially and becomes steeper after step length 6, before flattening slightly between step lengths 10 and 12.
**Data Point Extraction (Approximate Values):**
* **Step Length 2:**
* Average (Purple Line): ~760 tokens
* Range (Shaded Area): ~690 to ~835 tokens
* **Step Length 4:**
* Average (Purple Line): ~785 tokens
* Range (Shaded Area): ~725 to ~845 tokens
* **Step Length 6:**
* Average (Purple Line): ~855 tokens
* Range (Shaded Area): ~800 to ~905 tokens
* **Step Length 8:**
* Average (Purple Line): ~980 tokens
* Range (Shaded Area): ~915 to ~1040 tokens
* **Step Length 10:**
* Average (Purple Line): ~1185 tokens
* Range (Shaded Area): ~1090 to ~1275 tokens
* **Step Length 12:**
* Average (Purple Line): ~1180 tokens (appears to plateau or slightly decrease from step 10)
* Range (Shaded Area): ~980 to ~1375 tokens (widest range)
### Key Observations
1. **Positive Correlation:** There is a strong positive correlation between step length and the average number of reasoning tokens required. Longer step lengths demand more reasoning.
2. **Accelerating Increase:** The rate of increase in reasoning tokens accelerates. The slope is relatively shallow between steps 2-6, steepens significantly between steps 6-10, and then plateaus.
3. **Increasing Variability:** The width of the shaded blue region (variability) increases substantially with step length. The range is narrowest at step length 2 (~145 tokens) and widest at step length 12 (~395 tokens). This indicates that predictions become less certain or outcomes more variable as the problem complexity (step length) grows.
4. **Potential Saturation/Plateau:** The average reasoning tokens peak at step length 10 (~1185) and show a very slight decrease or plateau at step length 12 (~1180). However, the massive increase in variability at step 12 makes this point less reliable.
### Interpretation
This chart suggests that for the "Zero Shot Hard Blocksworld" task, the computational or cognitive effort (measured in reasoning tokens) required to solve a problem increases non-linearly with the problem's inherent complexity (step length). The initial increase is modest, but beyond a certain point (around step length 6), the effort required grows rapidly.
The most critical insight is the **dramatic increase in uncertainty** (wider shaded band) for longer step lengths. This implies that while the *average* effort plateaus, the *actual* effort for any given instance of a long-step problem becomes highly unpredictable. Some long problems might be solved with near-average effort, while others could require vastly more reasoning tokens (up to ~1375 at step 12), or occasionally fewer. This high variance could be due to the specific configuration of the blocksworld problem at longer step lengths, where some are deceptively simple and others are exceptionally tricky, even within the "hard" category.
In summary, the data demonstrates that scaling problem length in this domain leads to both higher average cost and, more importantly, a significant loss of predictability in the reasoning process required.
</details>
Figure 12: o1-mini Step Length vs Reasoning Tokens for Zero Shot in Hard Blocksworld
<details>
<summary>extracted/6087579/fig/Step_Length_vs_Reasoning_Tokens_for_Four_Shot_Hard_Blocksworld.png Details</summary>

### Visual Description
## Line Chart with Confidence Interval: Step Length vs Reasoning Tokens for Four Shot Hard Blocksworld
### Overview
The image displays a line chart illustrating the relationship between "Step length" and "Average Reasoning Tokens" for a task or model referred to as "Four Shot Hard Blocksworld." The chart features a central trend line (purple) surrounded by a shaded light blue region, which likely represents a confidence interval or standard deviation around the mean. The overall trend shows a positive, non-linear correlation: as the step length increases, the average number of reasoning tokens required also increases, with the rate of increase accelerating after a step length of 6.
### Components/Axes
* **Chart Title:** "Step Length vs Reasoning Tokens for Four Shot Hard Blocksworld" (centered at the top).
* **X-Axis (Horizontal):**
* **Label:** "Step length" (centered below the axis).
* **Scale:** Linear scale with major tick marks and labels at 2, 4, 6, 8, 10, and 12.
* **Y-Axis (Vertical):**
* **Label:** "Average Reasoning Tokens" (centered to the left of the axis, rotated 90 degrees).
* **Scale:** Linear scale with major tick marks and labels at 800, 1000, 1200, 1400, and 1600.
* **Data Series:**
* A single purple line representing the mean or average value.
* A light blue shaded area surrounding the purple line, representing the variability (e.g., confidence interval, standard error, or standard deviation).
* **Grid:** A light gray grid is present, with vertical lines at each x-axis tick and horizontal lines at each y-axis tick.
### Detailed Analysis
**Trend Verification:** The purple line exhibits a clear upward slope. The slope is relatively gentle from step length 2 to 6 and becomes noticeably steeper from step length 6 to 12.
**Data Point Extraction (Approximate Values):**
* **Step Length 2:** Mean ≈ 720 tokens. Shaded interval ≈ [650, 790].
* **Step Length 4:** Mean ≈ 800 tokens. Shaded interval ≈ [740, 870].
* **Step Length 6:** Mean ≈ 900 tokens. Shaded interval ≈ [840, 950].
* **Step Length 8:** Mean ≈ 1150 tokens. Shaded interval ≈ [1080, 1220].
* **Step Length 10:** Mean ≈ 1360 tokens. Shaded interval ≈ [1240, 1480].
* **Step Length 12:** Mean ≈ 1470 tokens. Shaded interval ≈ [1280, 1660].
**Spatial Grounding & Uncertainty:** The shaded blue region (uncertainty band) is narrowest at the lower step lengths (2-6) and widens significantly as the step length increases, particularly from 8 to 12. This indicates greater variability or less certainty in the average reasoning token count for longer step lengths. The band is roughly symmetric around the central purple line.
### Key Observations
1. **Positive Correlation:** There is a direct, positive relationship between step length and the average reasoning tokens required.
2. **Non-Linear Increase:** The relationship is not perfectly linear. The increase in reasoning tokens per unit of step length is greater after step length 6 than before it.
3. **Increasing Variability:** The spread of the data (as indicated by the shaded interval) increases substantially with step length. The uncertainty at step length 12 is more than double the uncertainty at step length 2.
4. **No Plateau:** Within the observed range (2 to 12), the curve does not show signs of plateauing; the average continues to rise.
### Interpretation
The data suggests that for the "Four Shot Hard Blocksworld" task, the computational or cognitive effort (proxied by "reasoning tokens") scales more than proportionally with the problem's step length. The initial, gentler slope (steps 2-6) might represent a baseline reasoning overhead, while the steeper slope (steps 6-12) indicates that each additional step beyond a certain complexity threshold requires a disproportionately larger amount of reasoning.
The widening confidence interval is a critical finding. It implies that for longer, more complex problems (higher step length), the model's performance becomes less predictable. Some long problems may still be solved with relatively efficient reasoning, while others may require a vastly greater number of tokens, leading to high variance. This could be due to the model encountering more diverse or difficult sub-problems within longer solution paths.
From a Peircean investigative perspective, this chart is an *icon* representing a direct similarity between step length and reasoning cost. It is also an *index* pointing to an underlying causal relationship: increased problem complexity (step length) causes increased resource consumption (tokens). The pattern invites further *abduction*: the most plausible hypothesis is that the "Blocksworld" planning problem exhibits combinatorial or branching complexity that becomes significantly more challenging to navigate as the solution path lengthens. The chart does not provide the "why" at a mechanistic level but strongly signals where model limitations or inefficiencies are most pronounced.
</details>
Figure 13: o1-mini Step Length vs Reasoning Tokens for Four Shot in Hard Blocksworld
## Appendix I GPU Usage
In the main experiments, the total GPU usage (measured in GPU hours) for different models on NVIDIA H800 SXM5 80GB GPUs shows a clear progression with model size. For RAP-MCTS, Llama-3 70B requires approximately 420 GPU hours across all steps and difficulty modes, Llama-3.1 70B model requires approximately 450 GPU hours. For SC-MCTS ∗, Llama-3 70B requires approximately 280 GPU hours across all steps and difficulty modes and difficulty modes, Llama-3.1 70B model requires approximately 300 GPU hours. For CoT, Llama-3-70B and Llama-3.1-70B both takes approximately 7 GPU hours across all steps and difficulty modes, while Llama-3.1 405B model exhibits significantly higher GPU usage, amounting to approximately 75 GPU hours. In the parameter research and algorithm development phase before main experiments, we consumed a total of around 800 GPU hours on NVIDIA A100 SXM4 80GB GPUs.
## Appendix J Future Work
In future work, we can explore utilizing more metrics-based reward models (such as the three reward models discussed in this paper) with LM-based reward models (such as Critic LLM (McAleese et al., 2024) and Eurus (Yuan et al., 2024b)). Additionally, there is potential to design more general methods for splitting steps in other tasks and datasets. Since step-splitting is the most challenging part of MCTS multi-step reasoning generalization, although we conducted extensive experiments on the Blocksworld multi-step reasoning dataset, which is the most suitable dataset for studying MCTS multi-step reasoning as far as we know. Some previous works have attempted to use datasets like GSM8K and MATH through extensive adaptation efforts on the datasets themselves, however, we aim to design a more general method from the perspective of step-splitting. We hope that MCTS multi-step reasoning will achieve the same level of generalization as CoT, which remains a fundamental area for future research. Future work can also attempt to combine this approach with the fine-grained compositional reasoning framework (Chen et al., 2024) to further explore the boundaries of MCTS multi-step reasoning capabilities.