# Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
> Correspondence to: Yuxi Xie (xieyuxi@u.nus.edu) and Anirudh Goyal (anirudhgoyal9119@gmail.com).
Abstract
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with substantial increases in accuracy to $81.8\%$ (+ $5.9\%$ ), $34.7\%$ (+ $5.8\%$ ), and $76.4\%$ (+ $15.8\%$ ), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains. Our code is publicly available at https://github.com/YuxiXie/MCTS-DPO.
1 Introduction
Development of Large Language Models (LLMs), has seen a pivotal shift towards aligning these models more closely with human values and preferences (Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a). A critical aspect of this process involves the utilization of preference data. There are two prevailing methodologies for incorporating this data: the first entails the construction of a reward model based on preferences, which is then integrated into a Reinforcement Learning (RL) framework to update the policy (Christiano et al., 2017; Bai et al., 2022b); the second, more stable and scalable method, directly applies preferences to update the model’s policy (Rafailov et al., 2023).
In this context, the concept of “iterative” development is a key, especially when contrasted with the conventional Reinforcement Learning from Human Feedback (RLHF) paradigm (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a), where the reward model is often trained offline and remains static. An iterative approach proposes a dynamic and continuous refinement process (Zelikman et al., 2022; Gülçehre et al., 2023; Huang et al., 2023; Yuan et al., 2024). It involves a cycle that begins with the current policy, progresses through the collection and analysis of data to generate new preference data, and uses this data to update the policy. This approach underlines the importance of ongoing adaptation in LLMs, highlighting the potential for these models to become more attuned to the complexities of human decision-making and reasoning.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Policy Tuning with Monte Carlo Tree Search and Preference Learning
### Overview
The image is a diagram illustrating a policy tuning process that incorporates Monte Carlo Tree Search (MCTS) and preference learning. The diagram shows how a prompt data pool is used to guide MCTS, which then informs preference learning to update a policy. Step-level preferences are also used in the preference learning process.
### Components/Axes
* **Prompt Data Pool:** Located at the top-left, this is the starting point of the process.
* **Monte Carlo Tree Search:** A tree structure in the center-left, with nodes colored according to Q-values.
* **Q Values Legend:** A horizontal color gradient at the bottom, indicating that purple represents low Q-values and green represents high Q-values.
* **Policy from Last Iteration:** Represented by a neural network-like diagram at the top-right, labeled as "π<sub>θ(i-1)</sub>".
* **Preference Learning:** A rounded rectangle in the center-right, indicating the preference learning process.
* **Updated Policy:** Represented by a neural network-like diagram at the right, labeled as "π<sub>θ(i)</sub>".
* **Step-Level Preferences:** A stack of rectangles at the bottom, labeled as "Step-Level Preferences D<sub>i</sub>".
* **Policy Tuning at Iteration i:** A label indicating the overall process.
### Detailed Analysis
* **Prompt Data Pool:** An arrow points from the "Prompt Data Pool" down into the "Monte Carlo Tree Search" structure.
* **Monte Carlo Tree Search:** The tree structure consists of nodes with varying Q-values, indicated by the color gradient from purple (low) to green (high). The nodes are arranged in a hierarchical tree structure. Some nodes have a thicker outline.
* **Policy from Last Iteration:** An arrow points from the "Policy from Last Iteration" down into the "Preference Learning" box.
* **Preference Learning:** The "Preference Learning" box receives input from both the "Monte Carlo Tree Search" and "Step-Level Preferences".
* **Updated Policy:** An arrow points from the "Preference Learning" box to the "Updated Policy".
* **Step-Level Preferences:** An arrow points from the "Step-Level Preferences" up into the "Preference Learning" box.
* **Q Values Legend:** The legend indicates a gradient from purple (low Q values) to green (high Q values). The arrow indicates the direction of increasing Q values.
### Key Observations
* The diagram illustrates an iterative process of policy tuning.
* Monte Carlo Tree Search is used to explore the search space, and the Q-values guide the preference learning process.
* Step-level preferences are incorporated into the preference learning process.
* The policy is updated based on the preference learning outcome.
### Interpretation
The diagram depicts a reinforcement learning approach where a policy is iteratively refined using Monte Carlo Tree Search and preference learning. The "Prompt Data Pool" provides initial data to guide the search. The MCTS explores different actions and their potential outcomes, and the Q-values associated with each node in the tree represent the estimated value of taking that action. The "Policy from Last Iteration" represents the current policy, which is used to guide the MCTS. The "Step-Level Preferences" provide additional information about the desired behavior at each step. The "Preference Learning" process combines the information from the MCTS, the previous policy, and the step-level preferences to update the policy. The "Updated Policy" then becomes the "Policy from Last Iteration" in the next iteration. The color gradient in the MCTS tree visually represents the Q-values, with green nodes indicating higher values and purple nodes indicating lower values. This suggests that the algorithm is exploring actions that lead to higher rewards or more desirable outcomes.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: Accuracy on ARC-C vs. Training Data
### Overview
The image is a line chart comparing the accuracy of three different methods (Online, Offline, and SFT Baseline) on the ARC-C dataset as a function of the percentage of training data used. The x-axis represents the percentage of training data, and the y-axis represents the accuracy on ARC-C.
### Components/Axes
* **X-axis:** Training Data (%), with markers at 0, 10, 20, 30, 40, 50, and 60.
* **Y-axis:** Accuracy on ARC-C, with markers at 60, 65, 70, 75, 80, 85, and 90.
* **Legend:** Located in the top-right corner, it identifies the lines as:
* Green: Online
* Blue: Offline
* Purple: SFT Baseline
### Detailed Analysis
* **Online (Green):** The accuracy increases rapidly from 60% to approximately 74% as the training data increases from 0% to 10%. It then gradually increases to approximately 76% at 30% training data, and remains relatively stable around 76% with minor fluctuations until 65% training data.
* **Offline (Blue):** The accuracy increases from 60% to approximately 71% as the training data increases from 0% to 20%. It then decreases to approximately 65% at 30% training data, and further decreases to approximately 63% at 50% training data, before increasing slightly to approximately 66% at 65% training data.
* **SFT Baseline (Purple):** The accuracy remains constant at approximately 60% across all percentages of training data.
### Key Observations
* The Online method consistently outperforms the Offline method across all training data percentages.
* The SFT Baseline method has a constant accuracy, unaffected by the amount of training data.
* The Offline method's accuracy initially increases but then decreases significantly before slightly recovering.
### Interpretation
The chart suggests that the Online method is more effective than the Offline method for this task, as it achieves higher accuracy and maintains it with increasing training data. The Offline method's performance is more volatile, suggesting it may be more sensitive to the specific training data used. The SFT Baseline method's constant accuracy indicates that it is not benefiting from increased training data, possibly due to limitations in the model's capacity or training procedure. The Online method appears to benefit most from the initial increase in training data, with diminishing returns after approximately 30% of the data is used.
</details>
Figure 1: Monte Carlo Tree Search (MCTS) boosts model performance via iterative preference learning. Each iteration of our framework (on the left) consists of two stages: MCTS to collect step-level preferences and preference learning to update the policy. Specifically, we use action values $Q$ estimated by MCTS to assign the preferences, where steps of higher and lower $Q$ values will be labeled as positive and negative data, respectively. The scale of $Q$ is visualized in the colormap. We show the advantage of the online manner in our iterative learning framework using the validation accuracy curves as training progresses on the right. The performance of ARC-C validation illustrates the effectiveness and efficiency of our proposed method compared to its offline variant.
A compelling illustration of the success of such an iterative approach can be seen in the case of AlphaZero (Silver et al., 2017) for its superhuman performance across various domains, which combines the strengths of neural networks, RL techniques, and Monte Carlo Tree Search (MCTS) (Coulom, 2006; Kocsis and Szepesvári, 2006). The integration of MCTS as a policy improvement operator that transforms the current policy into an improved policy (Grill et al., 2020). The effectiveness of AlphaZero underscores the potential of combining these advanced techniques in LLMs. By integrating MCTS into the iterative process of policy development, it is plausible to achieve significant strides in LLMs, particularly in the realm of reasoning and decision-making aligned with human-like preferences (Zhu et al., 2023; Hao et al., 2023).
The integration of MCTS in collecting preference data to improve the current policy iteratively is nuanced and demands careful consideration. One primary challenge lies in determining the appropriate granularity for applying MCTS. Conventionally, preference data is collected at the instance level. The instance-level approach employs sparse supervision, which can lose important information and may not optimally leverage the potential of MCTS in improving the LLMs (Wu et al., 2023). Another challenge is the reliance of MCTS on a critic or a learned reward function. This function is crucial for providing meaningful feedback on different rollouts generated by MCTS, thus guiding the policy improvement process (Liu et al., 2023a).
Addressing this granularity issue, evidence from LLM research indicates the superiority of process-level or stepwise evaluations over instance-level ones (Lightman et al., 2023; Li et al., 2023; Xie et al., 2023; Yao et al., 2023). Our approach utilizes MCTS rollouts for step-level guidance, aligning with a more granular application of MCTS. Moreover, we employ self-evaluation, where the model assesses its outputs, fostering a more efficient policy improvement pipeline by acting as both policy and critic (Kadavath et al., 2022; Xie et al., 2023). This method streamlines the process and ensures more cohesive policy updates, aligning with the iterative nature of policy enhancement and potentially leading to more robust and aligned LLMs.
To summarize, we propose an algorithm based on Monte Carlo Tree Search (MCTS) that breaks down the instance-level preference signals into step-level. MCTS allows us to use the current LLM policy to generate preference data instead of a predetermined set of human preference data, enabling the LLM to receive real-time training signals. During training, we generate sequences of text on the fly and label the preference via MCTS based on feedback from self-evaluation (Figure 1). To update the LLM policy using the preference data, we use Direct Preference Optimization (DPO) (Rafailov et al., 2023). We extensively evaluate the proposed approach on various arithmetic and commonsense reasoning tasks and observe significant performance improvements. For instance, the proposed approach outperforms the Mistral-7B SFT baseline by $81.8\%$ (+ $5.9\%$ ), $34.7\%$ (+ $5.8\%$ ), and $76.4\%$ (+ $15.8\%$ ) on GSM8K, MATH, and SciQ, respectively. Further analysis of the training and test compute tradeoff shows that our method can effectively push the performance gains in a more efficient way compared to sampling-only approaches.
2 MCTS-Enhanced Iterative Preference Learning
In this paper, we introduce an approach for improving LLM reasoning, centered around an iterative preference learning process. The proposed method begins with an initial policy $\pi_{\theta^{(0)}}$ , and a dataset of prompts $\mathcal{D}_{\mathcal{P}}$ . Each iteration $i$ involves selecting a batch of prompts from $\mathcal{D}_{\mathcal{P}}$ , from which the model, guided by its current policy $\pi_{\theta^{(i-1)}}$ , generates potential responses for each prompt. We then apply a set of dynamically evolving reward criteria to extract preference data $\mathcal{D}_{i}$ from these responses. The model’s policy is subsequently tuned using this preference data, leading to an updated policy $\pi_{\theta^{(i)}}$ , for the next iteration. This cycle of sampling, response generation, preference extraction, and policy tuning is repeated, allowing for continuous self-improvement and alignment with evolving preferences. In addressing the critical aspects of this methodology, two key challenges emerge: the effective collection of preference data and the process of updating the policy post-collection.
We draw upon the concept that MCTS can act as an approximate policy improvement operator, transforming the current policy into an improved one. Our work leverages MCTS to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we incorporate stepwise self-evaluation, continually updating the quality assessment of newly generated data. This process, as depicted in Figure 1, enables MCTS to balance quality exploitation and diversity exploration during preference data sampling at each iteration. Detailed in section 2.1, our approach utilizes MCTS for step-level preference data collection. Once this data is collected, the policy is updated using DPO, as outlined in section 2.2. Our method can be viewed as an online version of DPO, where the updated policy is iteratively employed to collect preferences via MCTS. Our methodology, thus, not only addresses the challenges in preference data collection and policy updating but also introduces a dynamic, iterative framework that significantly enhances LLM reasoning.
{algorithm*}
[tb] . MCTS-Enhanced Iterative Preference Learning. Given an initial policy $\pi_{\theta^{(0)}}=\pi_{\mathrm{sft}}$ , our algorithm iteratively conducts step-level preference data sampling via MCTS and preference learning via DPO to update the policy. {algorithmic} \STATE Input: $\mathcal{D}_{\mathcal{P}}$ : prompt dataset; $q(·\mid x)$ : MCTS sampling strategy that constructs a tree-structured set of possible responses given a prompt $x$ , where $q_{\pi}$ represents that the strategy is based on the policy $\pi$ for both response generation and self-evaluation; $\ell_{i}(x,y_{w},y_{l};\theta)$ : loss function of preference learning at the $i$ -th iteration, where the corresponding sampling policy is $\pi^{(i)}$ ; $M$ : number of iterations; $B$ : number of samples per iteration; $T$ : average number of steps per sample \STATE Train $\pi_{\theta}$ on $\mathcal{D}_{\mathcal{P}}$ using step-level preference learning. \FOR $i=1$ to $M$ \STATE $\pi^{(i)}←\pi_{\theta}←\pi_{\theta^{(i-1)}}$ \STATE Sample a batch of $B$ samples from $\mathcal{D}_{\mathcal{P}}$ as $\mathcal{D}_{\mathcal{P}}^{(i)}$ . \STATE \COMMENT MCTS for Step-Level Preference Data Collection \STATE For each $x∈\mathcal{D}_{\mathcal{P}}^{(i)}$ , elicit a search tree of depth $T$ via $q_{\pi_{\theta}}(·\mid x)$ . \STATE Collect a batch of preferences $\mathcal{D}_{i}=\{$ $\{(x^{j},y_{l}^{(j,t)},y_{l}^{(j,t)})|_{t=1}^{T}\}|_{j=1}^{B}$ s.t. $x^{j}\sim\mathcal{D}_{\mathcal{P}}^{(i)},y_{w}^{(j,t)}≠ y_{w}^{(j,t)}\sim q%
_{\pi_{\theta}}(·\mid x^{j})$ $\}$ , where $y_{w}^{(j,t)}$ and $y_{l}^{(j,t)}$ is the nodes at depth $t$ , with the highest and lowest $Q$ values, respectively, among all the children nodes of their parent node. \STATE \COMMENT Preference Learning for Policy Improvement \STATE Optimize $\theta$ by minimizing $J(\theta)=\mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}_{i}}\ell_{i}(x,y_{w},y_{l%
};\theta)$ . \STATE Obtain the updated policy $\pi_{\theta^{(i)}}$ \ENDFOR \STATE $\pi_{\theta}←\pi_{\theta^{(M)}}$ \STATE Output: Policy $\pi_{\theta}$
2.1 MCTS for Step-Level Preference
To transform instance-level rewards into granular, step-level signals, we dissect the reasoning process into discrete steps, each represented by a token sequence. We define the state at step $t$ , $s_{t}$ , as the prefix of the reasoning chain, with the addition of a new reasoning step $a$ transitioning the state to $s_{t+1}$ , where $s_{t+1}$ is the concatenation of $s_{t}$ and $a$ . Utilizing the model’s current policy $\pi_{\theta}$ , we sample candidate steps from its probability distribution $\pi_{\theta}(a\mid x,s_{t})$ For tasks (e.g., MATH) where the initial policy performs poorly, we also include the ground-truth reasoning steps for training. We detail the step definition for different tasks with examples in Appendices C and D., with $x$ being the task’s input prompt. MCTS serves as an approximate policy improvement operator by leveraging its look-ahead capability to predict the expected future reward. This prediction is refined through stepwise self-evaluation (Kadavath et al., 2022; Xie et al., 2023), enhancing process consistency and decision accuracy. The tree-structured search supports a balance between exploring diverse possibilities and exploiting promising paths, essential for navigating the vast search space in LLM reasoning.
The MCTS process begins from a root node, $s_{0}$ , as the sentence start or incomplete response, and unfolds in three iterative stages: selection, expansion, and backup, which we detail further.
Select.
The objective of this phase is to identify nodes that balance search quality and computational efficiency. The selection is guided by two key variables: $Q(s_{t},a)$ , the value of taking action $a$ in state $s_{t}$ , and $N(s_{t})$ , the visitation frequency of state $s_{t}$ . These variables are crucial for updating the search strategy, as explained in the backup section. To navigate the trade-off between exploring new nodes and exploiting visited ones, we employ the Predictor + Upper Confidence bounds applied to Trees (PUCT) (Rosin, 2011). At node $s_{t}$ , the choice of the subsequent node follows the formula:
$$
\displaystyle{s_{t+1}}^{*} \displaystyle=\arg\max_{s_{t}}\Bigl{[}Q(s_{t},a)+c_{\mathrm{puct}}\cdot p(a%
\mid s_{t})\frac{\sqrt{N(s_{t})}}{1+N(s_{t+1})}\Bigl{]} \tag{1}
$$
where $p(a\mid s_{t})=\pi_{\theta}(a\mid x,s_{t})/|a|^{\lambda}$ denotes the policy $\pi_{\theta}$ ’s probability distribution for generating a step $a$ , adjusted by a $\lambda$ -weighted length penalty to prevent overly long reasoning chains.
Expand.
Expansion occurs at a leaf node during the selection process to integrate new nodes and assess rewards. The reward $r(s_{t},a)$ for executing step $a$ in state $s_{t}$ is quantified by the reward difference between states $R(s_{t})$ and $R(s_{t+1})$ , highlighting the advantage of action $a$ at $s_{t}$ . As defined in Eq. (2), reward computation merges outcome correctness $\mathcal{O}$ with self-evaluation $\mathcal{C}$ . We assign the outcome correctness to be $1$ , $-1$ , and $0 0$ for correct terminal, incorrect terminal, and unfinished intermediate states, respectively. Following Xie et al. (2023), we define self-evaluation as Eq. (3), where $\mathrm{A}$ denotes the confidence score in token-level probability for the option indicating correctness We show an example of evaluation prompt in Table 6.. Future rewards are anticipated by simulating upcoming scenarios through roll-outs, following the selection and expansion process until reaching a terminal state The terminal state is reached when the whole response is complete or exceeds the maximum length..
$$
R(s_{t})=\mathcal{O}(s_{t})+\mathcal{C}(s_{t}) \tag{2}
$$
$$
\mathcal{C}(s_{t})=\pi_{\theta}(\mathrm{A}\mid{\mathrm{prompt}}_{\mathrm{eval}%
},x,s_{t}) \tag{3}
$$
Backup.
Once a terminal state is reached, we carry out a bottom-up update from the terminal node back to the root. We update the visit count $N$ , the state value $V$ , and the transition value $Q$ :
$$
Q(s_{t},a)\leftarrow r(s_{t},a)+\gamma V(s_{t+1}) \tag{4}
$$
$$
V(s_{t})\leftarrow\sum_{a}N(s_{t+1})Q(s_{t},a)/\sum_{a}N(s_{t+1}) \tag{5}
$$
$$
N(s_{t})\leftarrow N(s_{t})+1 \tag{6}
$$
where $\gamma$ is the discount for future state values.
For each step in the response generation, we conduct $K$ iterations of MCTS to construct the search tree while updating $Q$ values and visit counts $N$ . To balance the diversity, quality, and efficiency of the tree construction, we initialize the search breadth as $b_{1}$ and anneal it to be a smaller $b_{2}<b_{1}$ for the subsequent steps. We use the result $Q$ value corresponding to each candidate step to label its preference, where higher $Q$ values indicate preferred next steps. For a result search tree of depth $T$ , we obtain $T$ pairs of step-level preference data. Specifically, we select the candidate steps of highest and lowest $Q$ values as positive and negative samples at each tree depth, respectively. The parent node selected at each tree depth has the highest value calculated by multiplying its visit count and the range of its children nodes’ visit counts, indicating both the quality and diversity of the generations.
2.2 Iterative Preference Learning
Given the step-level preferences collected via MCTS, we tune the policy via DPO (Rafailov et al., 2023). Considering the noise in the preference labels determined by $Q$ values, we employ the conservative version of DPO (Mitchell, 2023) and use the visit counts simulated in MCTS to apply adaptive label smoothing on each preference pair. Using the shorthand $h_{\pi_{\theta}}^{y_{w},y_{l}}=\log\frac{\pi_{\theta}(y_{w}\mid x)}{\pi_{%
\mathrm{ref}}(y_{w}\mid x)}-\log\frac{\pi_{\theta}(y_{l}\mid x)}{\pi_{\mathrm{%
ref}}(y_{l}\mid x)}$ , at the $i$ -th iteration, given a batch of preference data $\mathcal{D}_{i}$ sampled with the latest policy $\pi_{\theta^{(i-1)}}$ , we denote the policy objective $\ell_{i}(\theta)$ as follows:
$$
\displaystyle\ell_{i}(\theta)= \displaystyle-\mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}_{i}}\Bigr{[}(1-\alpha%
_{x,y_{w},y_{l}})\log\sigma(\beta\left.h_{\pi_{\theta}}^{y_{w},y_{l}}\right)+ \displaystyle\alpha_{x,y_{w},y_{l}}\log\sigma(-\beta h_{\pi_{\theta}}^{y_{w},y%
_{l}})\Bigr{]} \tag{7}
$$
where $y_{w}$ and $y_{l}$ represent the step-level preferred and dispreferred responses, respectively, and the hyperparameter $\beta$ scales the KL constraint. Here, $\alpha_{x,y_{w},y_{l}}$ is a label smoothing variable calculated using the visit counts at the corresponding states of the preference data $y_{w}$ , $y_{l}$ in the search tree:
$$
\alpha_{x,y_{w},y_{l}}=\frac{1}{N(x,y_{w})/N(x,y_{l})+1} \tag{8}
$$
where $N(x,y_{w})$ and $N(x,y_{l})$ represent the states taking the actions of generating $y_{w}$ and $y_{l}$ , respectively, from their previous state as input $x$ .
After optimization, we obtain the updated policy $\pi_{\theta^{(i)}}$ and repeat the data collection process in Section 2.1 to iteratively update the LLM policy. We outline the full algorithm of our MCTS-enhanced Iterative Preference Learning in Algorithm 2.
3 Theoretical Analysis
Our approach can be viewed as an online version of DPO, where we iteratively use the updated policy to sample preferences via MCTS. In this section, we provide theoretical analysis to interpret the advantages of our online learning framework compared to the conventional alignment techniques that critically depend on offline preference data. We review the typical RLHF and DPO paradigms in Appendix B.
We now consider the following abstract formulation for clean theoretical insights to analyze our online setting of preference learning. Given a prompt $x$ , there exist $n$ possible suboptimal responses $\{\bar{y}_{1},...,\bar{y}_{n}\}=Y$ and an optimal outcome $y^{*}$ . As specified in Equation 7, at the $i$ -th iteration, a pair of responses $(y,{y^{\prime}})$ are sampled from some sampling policy $\pi^{(i)}$ without replacement so that $y≠{y^{\prime}}$ as $y\sim\pi^{(i)}(·\mid x)$ and ${y^{\prime}}\sim\pi^{(i)}(·\mid x,y)$ . Then, these are labeled to be $y_{w}$ and $y_{l}$ according to the preference. Define $\Theta$ be a set of all global optimizers of the preference loss for all $M$ iterations, i.e., for any $\theta∈\Theta$ , $\ell_{i}(\theta)=0$ for all $i∈\{1,2,·s,M\}$ . Similarly, let ${\theta^{(i)}}$ be a parameter vector such that $\ell_{j}(\theta^{(i)})=0$ for all $j∈\{1,2,·s,i-1\}$ for $i≥ 1$ whereas $\theta^{(0)}$ is the initial parameter vector.
This abstract formulation covers both the offline and online settings. The offline setting in previous works is obtained by setting $\pi^{(i)}=\pi$ for some fixed distribution $\pi$ . The online setting is obtained by setting $\pi^{(i)}=\pi_{\theta^{(i-1)}}$ where $\pi_{\theta^{(i-1)}}$ is the latest policy at beginning of the $i$ -th iteration.
The following theorem shows that the offline setting can fail with high probability if the sampling policy $\pi^{(i)}$ differs too much from the current policy $\pi_{\theta^{(i-1)}}$ :
**Theorem 3.1 (Offline setting can fail with high probability)**
*Let $\pi$ be any distribution for which there exists $\bar{y}∈ Y$ such that $\pi(\bar{y}\mid x),\pi(\bar{y}\mid x,y)≤\epsilon$ for all $y∈(Y\setminus\bar{y})\cup\{y^{*}\}$ and $\pi_{\theta^{(i-1)}}(\bar{y}\mid x)≥ c$ for some $i∈\{1,2,·s,M\}$ . Set $\pi^{(i)}=\pi$ for all $i∈\{1,2,·s,M\}$ . Then, there exists $\theta∈\Theta$ such that with probability at least $1-2\epsilon M$ (over the samples of $\pi^{(i)}=\pi$ ), the following holds: $\pi_{\theta}(y^{*}\mid x)≤ 1-c$ .*
If the current policy and the sampling policy differ too much, it is possible that $\epsilon=0$ and $c≈ 1.0$ , for which Theorem 3.1 can conclude $\pi_{\theta}(y^{*}\mid x)≈ 0$ with probability $1$ for any number of steps $M$ . When $\epsilon≠ 0$ , the lower bound of the failure probability decreases towards zero as we increase $M$ . Thus, it is important to make sure that $\epsilon≠ 0$ and $\epsilon$ is not too low. This is achieved by using the online setting, i.e., $\pi^{(i)}=\pi_{\theta^{(i)}}$ . Therefore, Theorem 3.1 motivates us to use the online setting. Theorem 3.2 confirms that we can indeed avoid this failure case in the online setting.
**Theorem 3.2 (Online setting can avoid offline failure case)**
*Let $\pi^{(i)}=\pi_{\theta^{(i-1)}}$ . Then, for any $\theta∈\Theta$ , it holds that $\pi_{\theta}(y^{*}\mid x)=1$ if $M≥ n+1$ .*
See Appendix B for the proofs of Theorems 3.1 and 3.2. As suggested by the theorems, a better sampling policy is to use both the latest policy and the optimal policy for preference sampling. However, since we cannot access the optimal policy $\pi^{*}$ in practice, we adopt online DPO via sampling from the latest policy $\pi_{\theta^{(i-1)}}$ . The key insight of our iterative preference learning approach is that online DPO is proven to enable us to converge to an optimal policy even if it is inaccessible to sample outputs. We provide further discussion and additional insights in Appendix B.
4 Experiments
We evaluate the effectiveness of MCTS-enhanced iterative preference learning on arithmetic and commonsense reasoning tasks. We employ Mistral-7B (Jiang et al., 2023) as the base pre-trained model. We conduct supervised training using Arithmo https://huggingface.co/datasets/akjindal53244/Arithmo-Data which comprises approximately $540$ K mathematical and coding problems. Detailed information regarding the task formats, specific implementation procedures, and parameter settings of our experiments can be found in Appendix C.
Datasets.
We aim to demonstrate the effectiveness and versatility of our approach by focusing on two types of reasoning: arithmetic and commonsense reasoning. For arithmetic reasoning, we utilize two datasets: GSM8K (Cobbe et al., 2021), which consists of grade school math word problems, and MATH (Hendrycks et al., 2021), featuring challenging competition math problems. Specifically, in the GSM8K dataset, we assess both chain-of-thought (CoT) and program-of-thought (PoT) reasoning abilities. We integrate the training data from GSM8K and MATH to construct the prompt data for our preference learning framework, aligning with a subset of the Arithmo data used for Supervised Fine-Tuning (SFT). This approach allows us to evaluate whether our method enhances reasoning abilities on specific arithmetic tasks. For commonsense reasoning, we use four multiple-choice datasets: ARC (easy and challenge splits) (Clark et al., 2018), focusing on science exams; AI2Science (elementary and middle splits) (Clark et al., 2018), containing science questions from student assessments; OpenBookQA (OBQA) (Mihaylov et al., 2018), which involves open book exams requiring broad common knowledge; and CommonSenseQA (CSQA) (Talmor et al., 2019), featuring commonsense questions necessitating prior world knowledge. The diversity of these datasets, with different splits representing various grade levels, enables a comprehensive assessment of our method’s generalizability in learning various reasoning tasks through self-distillation. Performance evaluation is conducted using the corresponding validation sets of each dataset. Furthermore, we employ an unseen evaluation using the validation set of an additional dataset, SciQ (Welbl et al., 2017), following the approach of Liu et al. (2023b), to test our model’s ability to generalize to novel reasoning contexts.
Baselines. Our study involves a comparative evaluation of our method against several prominent approaches and fair comparison against variants including instance-level iterative preference learning and offline MCTS-enhanced learning. We use instance-level sampling as a counterpart of step-level preference collection via MCTS. For a fair comparison, we also apply self-evaluation and correctness assessment and control the number of samples under a comparable compute budget with MCTS in instance-level sampling. The offline version uses the initial policy for sampling rather than the updated one at each iteration.
We contrast our approach with the Self-Taught Reasoner (STaR) (Zelikman et al., 2022), an iterated learning model based on instance-level rationale generation, and Crystal (Liu et al., 2023b), an RL-tuned model with a focus on knowledge introspection in commonsense reasoning. Considering the variation in base models used by these methods, we include comparisons with Direct Tuning, which entails fine-tuning base models directly bypassing chain-of-thought reasoning. In the context of arithmetic reasoning tasks, our analysis includes Language Model Self-Improvement (LMSI) (Huang et al., 2023), a self-training method using self-consistency to gather positive data, and Math-Shepherd (Wang et al., 2023a), which integrates process supervision within Proximal Policy Optimization (PPO). To account for differences in base models and experimental setups across these methods, we also present result performance of SFT models as baselines for each respective approach.
Table 1: Result comparison (accuracy $\%$ ) on arithmetic tasks. We supervised fine-tune the base model Mistral-7B on Arithmo data, while Math-Shepherd (Wang et al., 2023a) use MetaMATH (Yu et al., 2023b) for SFT. We highlight the advantages of our approach via conceptual comparison with other methods, where NR, OG, OF, and NS represent “w/o Reward Model”, “On-policy Generation”, “Online Feedback”, and “w/ Negative Samples”.
| Approach | Base Model | Conceptual Comparison | GSM8K | MATH | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| NR | OG | OF | NS | | | | |
| LMSI | PaLM-540B | ✓ | ✓ | ✗ | ✗ | $73.5$ | $-$ |
| SFT (MetaMath) | Mistral-7B | $-$ | $-$ | $-$ | $-$ | $77.7$ | $28.2$ |
| Math-Shepherd | ✗ | ✓ | ✗ | ✓ | $\mathbf{84.1}$ | $33.0$ | |
| SFT (Arithmo) | Mistral-7B | $-$ | $-$ | $-$ | $-$ | $75.9$ | $28.9$ |
| MCTS Offline-DPO | ✓ | ✗ | ✗ | ✓ | $79.9$ | $31.9$ | |
| Instance-level Online-DPO | ✓ | ✓ | ✓ | ✓ | $79.7$ | $32.9$ | |
| Ours | ✓ | ✓ | ✓ | ✓ | $80.7$ | $32.2$ | |
| Ours (w/ G.T.) | | ✓ | ✓ | ✓ | ✓ | $81.8$ | $\mathbf{34.7}$ |
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: ARC-C
### Overview
The image is a line chart comparing the accuracy of different models (Step-Level Online, Instance-Level Online, Step-Level Offline, and SFT Baseline) against varying percentages of training data. The x-axis represents the percentage of training data, and the y-axis represents accuracy.
### Components/Axes
* **Title:** ARC-C
* **X-axis:** Training Data % (with ticks at 10, 20, 30, 40, and 50)
* **Y-axis:** Accuracy (with ticks at 55, 60, 65, 70, 75)
* **Legend:** Located in the bottom-left corner.
* Step-Level (Online) - Green line with star markers
* Instance-Level (Online) - Blue line with triangle markers
* Step-Level (Offline) - Yellow line with star markers
* SFT Baseline - Dashed magenta line
### Detailed Analysis
* **Step-Level (Online) - Green:** The line starts at approximately 72.2% accuracy with 10% training data. It increases to 74.7% at 20% training data, peaks at 76.4% at 30% training data, then decreases slightly to 75.6% at 40% training data, and ends at 75.8% at 50% training data.
* **Instance-Level (Online) - Blue:** The line starts at 66.5% accuracy with 10% training data. It increases to 72.2% at 20% training data, peaks at 73.3% at 30% training data, increases to 75.2% at 40% training data, and then decreases to 73.4% at 50% training data.
* **Step-Level (Offline) - Yellow:** The line starts at 69.2% accuracy with 10% training data. It increases to 70.8% at 20% training data, then decreases to 67.3% at 30% training data, and further decreases to 66.5% at 40% training data.
* **SFT Baseline - Magenta:** The line is horizontal and constant at 60.6% accuracy across all training data percentages.
### Key Observations
* Step-Level (Online) consistently outperforms the other models, maintaining the highest accuracy across all training data percentages.
* Instance-Level (Online) shows a significant increase in accuracy with increasing training data up to 40%, after which it slightly decreases.
* Step-Level (Offline) initially increases in accuracy but then decreases as the training data percentage increases.
* SFT Baseline remains constant, indicating no improvement with increased training data.
### Interpretation
The data suggests that Step-Level (Online) is the most effective model for this task, as it consistently achieves the highest accuracy. Instance-Level (Online) also performs well, showing improvement with more training data. Step-Level (Offline) appears to be less effective, as its accuracy decreases with higher training data percentages. The SFT Baseline serves as a control, demonstrating a fixed level of accuracy regardless of the amount of training data. The relationship between the models highlights the importance of the chosen approach (Step-Level vs. Instance-Level) and whether the training is done online or offline. The trends indicate that online step-level learning is the most beneficial for this particular task and dataset.
</details>
Figure 2: Performance on the validation set of ARC-C via training with different settings.
Table 2: Result comparisons (accuracy $\%$ ) on commonsense reasoning tasks. The results based on GPT-3-curie (Brown et al., 2020) and T5 (Raffel et al., 2020) are reported from Liu et al. (2023b). For CSQA, we also include the GPT-J (Wang and Komatsuzaki, 2021) results reported by Zelikman et al. (2022). We follow Liu et al. (2023b) to combine the training data of ARC, AI2Sci, OBQA, and CSQA for training , while STaR (Zelikman et al., 2022) only use CSQA for training.
| Approach | Base Model | Conceptual Comparison | ARC-c | AI2Sci-m | CSQA | SciQ | Train Data Used ( $\%$ ) | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| NR | OG | OF | NS | | | | | | | |
| CoT Tuning | GPT-3-curie (6.7B) | ✓ | ✗ | ✗ | ✗ | $-$ | $-$ | $56.8$ | $-$ | $100$ |
| Direct Tuning | GPT-J (6B) | ✓ | ✗ | ✗ | ✗ | $-$ | $-$ | $60.0$ | $-$ | $100$ |
| STaR | ✓ | ✓ | ✓ | ✗ | $-$ | $-$ | $72.5$ | $-$ | $86.7$ | |
| Direct TUning | T5-11B | ✓ | ✗ | ✗ | ✗ | $72.9$ | $84.0$ | $82.0$ | $83.2$ | $100$ |
| Crystal | ✗ | ✓ | ✓ | ✓ | $73.2$ | $84.8$ | $\mathbf{82.3}$ | $85.3$ | $100$ | |
| SFT Base (Arithmo) | Mistral-7B | $-$ | $-$ | $-$ | $-$ | $60.6$ | $70.9$ | $54.1$ | $80.8$ | $-$ |
| Direct Tuning | ✓ | ✗ | ✗ | ✗ | $73.9$ | $85.2$ | $79.3$ | $86.4$ | $100$ | |
| MCTS Offline-DPO | ✓ | ✗ | ✗ | ✓ | $70.8$ | $82.6$ | $68.5$ | $87.4$ | $19.2$ | |
| Instance-level Online-DPO | ✓ | ✓ | ✓ | ✓ | $75.3$ | $87.3$ | $63.1$ | $87.6$ | $45.6$ | |
| Ours | ✓ | ✓ | ✓ | ✓ | $\mathbf{76.4}$ | $\mathbf{88.2}$ | $74.8$ | $\mathbf{88.5}$ | $47.8$ | |
4.1 Main Results
Arithmetic Reasoning. In Table 1, we present a comparative analysis of performance gains in arithmetic reasoning tasks. Our method demonstrates substantial improvements, notably on GSM8K, increasing from $75.9\%→ 81.8\%$ , and on MATH, enhancing from $28.9\%→ 34.7\%$ . When compared to Math-Shepherd, which also utilizes process supervision in preference learning, our approach achieves similar performance enhancements without the necessity of training separate reward or value networks. This suggests the potential of integrating trained reward model signals into our MCTS stage to further augment performance. Furthermore, we observe significant performance gain on MATH when incorporating the ground-truth solutions in the MCTS process for preference data collection, illustrating an effective way to refine the preference data quality with G.T. guidance.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Charts: ARC-C Performance Comparison
### Overview
The image contains two line charts comparing the performance of different models on the ARC-C dataset. The top chart shows the "Pass Rate" as a function of "# Checkpoints" for Iterative Learning and Sampling Only methods, along with an SFT Baseline. The bottom chart shows "Accuracy" as a function of "k" for Sampling Only and the SFT Baseline.
### Components/Axes
**Top Chart:**
* **Title:** ARC-C
* **Y-axis:** Pass Rate, ranging from 60 to 95 in increments of 5.
* **X-axis:** # Checkpoints, ranging from 0 to 7 in increments of 1.
* **Legend (Top-Right):**
* Green Line with Triangle Markers: Iterative Learning (Pass@1)
* Dark Green Line with Star Markers: Iterative Learning (Cumulative)
* Blue Line with Star Markers: Sampling Only (Cumulative)
* Dashed Maroon Line: SFT Baseline (Pass@1)
**Bottom Chart:**
* **Y-axis:** Accuracy, ranging from 60 to 95 in increments of 5.
* **X-axis:** k, ranging from 10 to 60 (approximately)
* **Legend (Top-Right):**
* Blue Line with Triangle Markers: Sampling Only (SC@k)
* Dashed Maroon Line: SFT Baseline (Pass@1)
### Detailed Analysis
**Top Chart:**
* **Iterative Learning (Pass@1) - Green Triangles:** This line shows a generally increasing trend, starting at approximately 72.2 at checkpoint 1, and ending at 76.2 at checkpoint 7.
* Checkpoint 1: 72.2
* Checkpoint 2: 73.6
* Checkpoint 3: 74.7
* Checkpoint 4: 75.1
* Checkpoint 5: 76.4
* Checkpoint 6: 75.8
* Checkpoint 7: 76.2
* **Iterative Learning (Cumulative) - Dark Green Stars:** This line shows a strong upward trend, starting at approximately 79.7 at checkpoint 1, and ending at 93.5 at checkpoint 7.
* Checkpoint 0: 60.6
* Checkpoint 1: 79.7
* Checkpoint 2: 86.9
* Checkpoint 3: 90.0
* Checkpoint 4: 91.3
* Checkpoint 5: 92.4
* Checkpoint 6: 93.3
* Checkpoint 7: 93.5
* **Sampling Only (Cumulative) - Blue Stars:** This line shows an upward trend, starting at approximately 71.9 at checkpoint 1, and ending at 94.1 at checkpoint 7.
* Checkpoint 0: 60.6
* Checkpoint 1: 71.9
* Checkpoint 2: 80.6
* Checkpoint 3: 86.6
* Checkpoint 4: 89.3
* Checkpoint 5: 91.7
* Checkpoint 6: 92.9
* Checkpoint 7: 94.1
* **SFT Baseline (Pass@1) - Dashed Maroon:** This line remains constant at approximately 60.6 across all checkpoints.
**Bottom Chart:**
* **Sampling Only (SC@k) - Blue Triangles:** This line shows an increasing trend from k=5 to k=30, then plateaus.
* k = 5: 61.9
* k = 10: 70.0
* k = 20: 72.2
* k = 30: 73.4
* k = 60: 74.1
* **SFT Baseline (Pass@1) - Dashed Maroon:** This line remains constant at approximately 60.6 across all values of k.
### Key Observations
* In the top chart, Iterative Learning (Cumulative) and Sampling Only (Cumulative) significantly outperform Iterative Learning (Pass@1) and the SFT Baseline.
* The SFT Baseline consistently performs at approximately 60.6 in both charts, regardless of the number of checkpoints or the value of k.
* In the bottom chart, Sampling Only (SC@k) shows improvement as k increases from 5 to 30, but the improvement diminishes beyond that point.
### Interpretation
The data suggests that cumulative learning methods (Iterative Learning and Sampling Only) are more effective than the non-cumulative Iterative Learning (Pass@1) approach for the ARC-C dataset, as shown in the top chart. The SFT Baseline provides a consistent but lower level of performance.
The bottom chart indicates that increasing the value of 'k' in the Sampling Only (SC@k) method initially improves accuracy, but there are diminishing returns as 'k' gets larger. This suggests that there is an optimal value of 'k' beyond which further increases do not significantly enhance performance.
The consistent performance of the SFT Baseline across both charts highlights its stability but also its limitations compared to the other methods.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
## Chart: SciQ Performance Comparison
### Overview
The image contains two line charts comparing the performance of different models on the SciQ dataset. The top chart compares "Iterative Learning" and "Sampling Only" models with a baseline, while the bottom chart focuses on "Sampling Only" with a different parameter (SC@k) and the same baseline. The y-axis represents performance, likely accuracy or a similar metric, while the x-axis represents either the number of checkpoints or the parameter 'k'.
### Components/Axes
**Top Chart:**
* **Title:** SciQ
* **X-axis:** "# Checkpoints"
* **Y-axis:** Values range from 80.0 to 100.0, with increments of 2.5.
* **Legend (Top-Left):**
* Green: Iterative Learning (Pass@1)
* Dark Green: Iterative Learning (Cumulative)
* Blue: Sampling Only (Cumulative)
* Dashed Purple: SFT Baseline (Pass@1)
**Bottom Chart:**
* **X-axis:** "k"
* **Y-axis:** Values range from 80.0 to 100.0, with increments of 2.5.
* **Legend (Top-Right):**
* Blue: Sampling Only (SC@k)
* Dashed Purple: SFT Baseline (Pass@1)
### Detailed Analysis
**Top Chart:**
* **Iterative Learning (Pass@1) - Green:** Starts at 80.8 at checkpoint 0, increases sharply to 89.6 at checkpoint 1, then to 86.0 at checkpoint 2, then to 88.3 at checkpoint 3, then to 88.5 at checkpoint 5, then to 87.7 at checkpoint 6, and ends at 88.6 at checkpoint 7.
* **Iterative Learning (Cumulative) - Dark Green:** Starts at 80.9 at checkpoint 0, increases sharply to 95.2 at checkpoint 5, and ends at 96.3 at checkpoint 7.
* **Sampling Only (Cumulative) - Blue:** Starts at 80.8 at checkpoint 0, increases to 82.8 at checkpoint 1, then to 86.4 at checkpoint 3, then to 86.5 at checkpoint 4, then to 89.8 at checkpoint 5, and ends at 91.5 at checkpoint 7.
* **SFT Baseline (Pass@1) - Dashed Purple:** Remains constant at approximately 80.8 across all checkpoints.
**Bottom Chart:**
* **Sampling Only (SC@k) - Blue:** Starts at 81.6 at k=10, increases to 82.8 at k=15, then to 83.8 at k=20, then to 84.1 at k=25, then to 84.2 at k=35, and ends at 84.4 at k=65.
* **SFT Baseline (Pass@1) - Dashed Purple:** Remains constant at approximately 80.8 across all values of k.
### Key Observations
* In the top chart, "Iterative Learning (Cumulative)" significantly outperforms the other models as the number of checkpoints increases.
* The "SFT Baseline (Pass@1)" consistently performs at around 80.8 in both charts, indicating a stable but lower performance level.
* In the bottom chart, "Sampling Only (SC@k)" shows a slight increase in performance as 'k' increases, but the improvement plateaus after k=25.
### Interpretation
The data suggests that iterative learning, especially in its cumulative form, is highly effective for the SciQ dataset when evaluated by number of checkpoints. The "Sampling Only" method shows improvement over the baseline, but its performance is significantly lower than "Iterative Learning (Cumulative)". The parameter 'k' in the bottom chart has a diminishing effect on the performance of "Sampling Only (SC@k)". The SFT Baseline provides a consistent but relatively low performance level, serving as a benchmark for the other models. The top chart shows the number of checkpoints has a large impact on the Iterative Learning models, while the bottom chart shows the parameter 'k' has a small impact on the Sampling Only model.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: MATH Performance Comparison
### Overview
The image contains two line charts comparing the performance of different models on a MATH task. The top chart shows the "Pass Rate" as a function of "# Checkpoints" for "Iterative Learning" (Pass@1 and Cumulative), "Sampling Only" (Cumulative), and "SFT Baseline" (Pass@1). The bottom chart shows "Accuracy" as a function of "k" for "Sampling Only" (SC@k) and "SFT Baseline" (Pass@1).
### Components/Axes
**Top Chart:**
* **Title:** MATH
* **Y-axis:** Pass Rate, ranging from 30 to 65 in increments of 5.
* **X-axis:** # Checkpoints, ranging from 0 to 10 in increments of 2.
* **Legend (Top-Left):**
* Green triangle: Iterative Learning (Pass@1)
* Dark green star: Iterative Learning (Cumulative)
* Blue star: Sampling Only (Cumulative)
* Purple dashed line: SFT Baseline (Pass@1)
**Bottom Chart:**
* **Y-axis:** Accuracy, ranging from 30 to 65 in increments of 5.
* **X-axis:** k, ranging from 4 to 10 in increments of 1.
* **Legend (Top-Left):**
* Blue triangle: Sampling Only (SC@k)
* Purple dashed line: SFT Baseline (Pass@1)
### Detailed Analysis
**Top Chart:**
* **Iterative Learning (Pass@1) - Green Triangles:** The pass rate generally increases with the number of checkpoints.
* 0 Checkpoints: 29.0
* 2 Checkpoints: 38.4
* 4 Checkpoints: 46.7
* 6 Checkpoints: 51.6
* 10 Checkpoints: 57.9
* **Iterative Learning (Cumulative) - Dark Green Stars:** The pass rate is relatively stable, hovering around 30.
* 0 Checkpoints: 29.2
* 2 Checkpoints: 30.4
* 6 Checkpoints: 32.2
* 10 Checkpoints: 31.2
* **Sampling Only (Cumulative) - Blue Stars:** The pass rate increases with the number of checkpoints.
* 0 Checkpoints: 34.9
* 2 Checkpoints: 42.4
* 6 Checkpoints: 49.6
* 10 Checkpoints: 57.1
* **SFT Baseline (Pass@1) - Purple Dashed Line:** The pass rate is constant.
* Value: 29.0
**Bottom Chart:**
* **Sampling Only (SC@k) - Blue Triangles:** The accuracy increases with k.
* k = 4: 30.0
* k = 5: 31.5
* k = 6: 32.2
* k = 7: 33.3
* k = 8: 34.2
* k = 10: 35.1
* **SFT Baseline (Pass@1) - Purple Dashed Line:** The accuracy is constant.
* Value: 29.0
### Key Observations
* In the top chart, both "Iterative Learning (Pass@1)" and "Sampling Only (Cumulative)" show a positive correlation between the number of checkpoints and the pass rate. "Iterative Learning (Cumulative)" remains relatively flat.
* In the bottom chart, "Sampling Only (SC@k)" shows a positive correlation between k and accuracy.
* "SFT Baseline (Pass@1)" remains constant in both charts.
### Interpretation
The data suggests that increasing the number of checkpoints improves the performance of "Iterative Learning (Pass@1)" and "Sampling Only (Cumulative)" on the MATH task, as indicated by the increasing pass rates. The "Iterative Learning (Cumulative)" method does not seem to benefit from additional checkpoints. In the bottom chart, increasing 'k' for "Sampling Only (SC@k)" also improves accuracy. The "SFT Baseline (Pass@1)" serves as a constant benchmark for comparison. The "Iterative Learning (Pass@1)" and "Sampling Only (Cumulative)" methods outperform the baseline with a sufficient number of checkpoints.
</details>
Figure 3: Training- vs. Test- Time Compute Scaling on ARC-C, SciQ, and MATH evaluation sets. The cumulative pass rate of our iterative learning method can be seen as the pass rate of an ensemble of different model checkpoints. We use greedy decoding to obtain the inference time performance of our method of iterative learning.
Table 3: Ablation of “EXAMPLE ANSWER” in self-evaluation on GSM8K, MATH, and ARC-C. We report AUC and accuracy ( $\%$ ) to compare the discriminative abilities of self-evaluation scores.
| w/ example answer-w/o example answer | $74.7$ $62.0$ | $72.5$ $69.5$ | $76.6$ $48.1$ | $48.8$ $42.3$ | $65.2$ $55.8$ | $57.5$ $48.4$ |
| --- | --- | --- | --- | --- | --- | --- |
Commonsense Reasoning. In Table 2, we report the performance on commonsense reasoning tasks, where our method shows consistent improvements. Notably, we achieve absolute accuracy increases of $2.5\%$ , $3.0\%$ , and $2.1\%$ on ARC-Challenge (ARC-C), AI2Sci-Middle (AI2Sci-M), and SciQ, respectively, surpassing the results of direct tuning. However, in tasks like OBQA and CSQA, our method, focusing on intermediate reasoning refinement, is less efficient compared to direct tuning. Despite significant improvements over the Supervised Fine-Tuning (SFT) baseline (for instance, from $59.8\%$ to $79.2\%$ on OBQA, and from $54.1\%$ to $74.8\%$ on CSQA), the gains are modest relative to direct tuning. This discrepancy could be attributed to the base model’s lack of specific knowledge, where eliciting intermediate reasoning chains may introduce increased uncertainty in model generations, leading to incorrect predictions. We delve deeper into this issue of hallucination and its implications in our qualitative analysis, as detailed in Section 4.2.
4.2 Further Analysis
Training- vs. Test- Time Compute Scaling. Our method integrates MCTS with preference learning, aiming to enhance both preference quality and policy reasoning via step-level alignment. We analyze the impact of training-time compute scaling versus increased inference-time sampling.
We measure success by the pass rate, indicating the percentage of correctly elicited answers. Figure 3 displays the cumulative pass rate at each checkpoint, aggregating the pass rates up to that point. For test-time scaling, we increase the number of sampled reasoning chains. Additionally, we compare the inference performance of our checkpoints with a sampling-only method, self-consistency, to assess their potential performance ceilings. The pass rate curves on ARC-C, SciQ, and MATH datasets reveal that our MCTS-enhanced approach yields a higher training compute scaling exponent. This effect is particularly pronounced on the unseen SciQ dataset, highlighting our method’s efficiency and effectiveness in enhancing specific reasoning abilities with broad applicability. Inference-time performance analysis shows higher performance upper bounds of our method on ARC-C and SciQ. For instance, while self-consistency on SciQ plateaus at around $84\%$ , our framework pushes performance to $88.6\%$ . However, on MATH, the sampling-only approach outperforms training compute scaling: more sampling consistently enhances performance beyond $35\%$ , whereas post-training performance hovers around $32.2\%$ . This observation suggests that in-domain SFT already aligns the model well with task-specific requirements.
Functions of Self-Evaluation Mechanism.
As illustrated in Section 2.1, the self-evaluation score inherently revises the $Q$ value estimation for subsequent preference data collection. In practice, we find that the ground-truth information, i.e., the “EXAMPLE ANSWER” in Table 6, is crucial to ensure the reliability of self-evaluation. We now compare the score distribution and discriminative abilities when including v.s. excluding this ground-truth information in Table 3. With this information , the accuracy of self-evaluation significantly improves across GSM8K, MATH, and ARC-C datasets.
Ablation Study.
We ablate the impact of step-level supervision signals and the online learning aspect of our MCTS-based approach. Tables 1 and 2 shows performance comparisons across commonsense and arithmetic reasoning tasks under different settings. Our method, focusing on step-level online preference learning, consistently outperforms both instance-level and offline approaches in commonsense reasoning. For example, we achieve $76.4\%$ on ARC-C and $88.5\%$ on SciQ, surpassing $70.8\%$ and $87.4\%$ of the offline variant, and $75.3\%$ and $87.6\%$ of the instance-level approach.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison
### Overview
The image is a bar chart comparing the accuracy of three different models (Initial Model, Trained Model, and G.T. Supervision) on two datasets (ARC-C and MATH) under different conditions (step-level, instance-level, and step-level with G.T.). The y-axis represents accuracy, ranging from 0 to 100.
### Components/Axes
* **Y-axis:** Accuracy (ranging from 0 to 100, with gridlines at intervals of 20)
* **X-axis:** Categorical axis representing different datasets and conditions:
* ARC-C (step-level)
* ARC-C (instance-level)
* MATH (step-level)
* MATH (instance-level)
* MATH (step-level w/ G.T.)
* **Legend (top-right):**
* Orange: Initial Model
* Blue: Trained Model
* Green: G.T. Supervision
### Detailed Analysis
The chart presents accuracy values for each model across different datasets and conditions.
* **ARC-C (step-level):**
* Initial Model (Orange): 60.6
* Trained Model (Blue): 76.4
* G.T. Supervision (Green): 94.4
* **ARC-C (instance-level):**
* Initial Model (Orange): 60.6
* Trained Model (Blue): 75.3
* G.T. Supervision (Green): 89.2
* **MATH (step-level):**
* Initial Model (Orange): 29.0
* Trained Model (Blue): 32.2
* G.T. Supervision (Green): 81.2
* **MATH (instance-level):**
* Initial Model (Orange): 29.0
* Trained Model (Blue): 32.9
* G.T. Supervision (Green): 87.9
* **MATH (step-level w/ G.T.):**
* Initial Model (Orange): 29.0
* Trained Model (Blue): 35.4
* G.T. Supervision (Green): 97.9
### Key Observations
* The G.T. Supervision model consistently outperforms the Initial and Trained Models across all datasets and conditions.
* The Trained Model shows improvement over the Initial Model in all cases.
* The MATH dataset generally has lower accuracy for the Initial and Trained Models compared to the ARC-C dataset.
* The highest accuracy is achieved by the G.T. Supervision model on the MATH dataset with step-level supervision (97.9).
### Interpretation
The data suggests that training improves model accuracy, but providing ground truth supervision (G.T. Supervision) leads to significantly higher performance. The difference in accuracy between the ARC-C and MATH datasets for the Initial and Trained Models indicates that the MATH dataset may be more challenging. The G.T. Supervision model's high accuracy on the MATH dataset, especially with step-level supervision, suggests that providing detailed guidance during training can effectively address the challenges posed by this dataset. The "step-level w/ G.T." condition for the MATH dataset yields the best performance, indicating that providing ground truth at each step of the problem-solving process is highly beneficial.
</details>
Figure 4: Abltation study on step- vs. instance-level supervision, online vs. offline learning, and w/ vs. w/o G.T. supervision. We also compare the accuracy of the training signals collected via MCTS or instance-level sampling.
Table 4: Qualitative Analysis of model outputs on OBQA, AI2Sci, and MATH. We highlight correct and incorrect keywords. The predicted answers on MATH are boxed, where our method solves the complex problem while SFT baseline failed to elicit a complete reasoning chain.
$$
f(x)=x^{7}+ax^{3}+cx f(5)=3, f(-5) f(5)=5^{7}+5a+5c=3. f(-5), (-5)^{7}+(-5a) +(-5c). (-5)^{7}=-5^{7}=-3, f(-5)=-3-5a-5c. f(-5)=-3-5(-3)-5c=-3+15-5c=\boxed{{\color[rgb]{0.609375,0.37890625,0.71875}12-%
5c}}. f(5)=5^{7}+5a+5c=3. f(x) f(-5)=(-5)^{7}-5a-5c=\boxed{{\color[rgb]{0,0.88,0}-3}}. f(c)=\frac{3}{2c-3} \frac{kn^{2}}{lm} f^{-1}(c)\times c\times f(c) \frac{kc+l}{mc+n} k,l,m,\text{ and }n f^{-1}(c) f(c)=x c \frac{3}{2c-3}=x 2c-3=\frac{3}{x} 2c=\frac{3}{x}+3 c=\frac{\frac{3}{x}+3}{2} f^{-1}(c)=\frac{\frac{3}{c}+3}{2c} f^{-1}(c) f^{-1}(c)\times c\times f(c) \frac{\frac{3}{c}+3}{2c}\times c\times\frac{3}{2c-3} \frac{3+3c}{2c}\times\frac{3}{2c-3} \frac{9+9c}{4c^{2}-6c} \frac{9+9c}{4c(c-1)} \frac{kc+l}{mc+n} k,l,m, n \frac{9+9c}{4c(c-1)}=\frac{kc+l}{mc+n} 9+9c=4c(c-1)(kc+l) 9+9c=4c^{2}k-4ck-4cl+l 4c^{2}k-4 f(f^{-1}(c))=c c f^{-1} f(f^{-1}(c))=f(c) c f^{-1} f(f^{-1}(c))=f(c) f(c)=c c f^{-1} f(c)=\frac{3}{2c-3} f(c)=c c f^{-1} 2c-3=c c f^{-1} c=3 c f^{-1} f^{-1}(c)=3 c f^{-1} f^{-1}(c)=3 f^{-1}(c)\times c\times f(c) 3\times 3\times\frac{3}{2(3)-3} 9\times\frac{3}{3}=9 \frac{kn^{2}}{lm}=\boxed{{\color[rgb]{0,0.88,0}9}} \tag{5}
$$
In arithmetic reasoning, performance differences among settings are less pronounced for challenging task such as MATH without the incorporation of ground-truth solutions (e.g., $32.2\%$ for our method (w/o G.T.) vs. $31.9\%$ and $32.9\%$ for offline and instance-level on MATH). The comparable performance of offline learning aligns with our theoretical analysis that offline approaches can be effective when the initial policy is already well-tuned with high-quality, in-domain data. We further interpret how G.T. guidance integration to enhance the effectiveness of our framework in Figure 4. With G.T. supervision, the accuracy of training signals improve significantly from $81.2\%$ to $97.9\%$ , leading to substantial performance gain on model performance. This also explains the similar performance (w/o G.T.) between corresponding using step- and instance-level supervision, where our step-level approach shows effectiveness in narrowing the gap between accuracies of corresponding supervisions.
Training Dynamics in Iterative Learning.
As shown in Figure 2, online learning exhibits cyclic performance fluctuations, with validation performance peaking before dipping. We conduct theoretical analysis on this in Appendix B and shows that continuous policy updates with the latest models can lead to periodic knowledge loss due to insufficient optimization in iterative updates. We further probe these phenomena qualitatively next.
Qualitative Analysis.
Our qualitative analysis in Table 4 examines the impact of step-level supervision on intermediate reasoning correctness across different tasks. In OBQA, the implementation of MCTS, as discussed in Section 4.1, often leads to longer reasoning chains. This can introduce errors in commonsense reasoning tasks, as seen in our OBQA example, where an extended chain results in an incorrect final prediction. Conversely, in the MATH dataset, our approach successfully guides the model to rectify mistakes and formulates accurate, extended reasoning chains, demonstrating its effectiveness in complex math word problems. This analysis underscores the need to balance reasoning chain length and logical coherence, particularly in tasks with higher uncertainty, such as commonsense reasoning.
5 Related Work
Various studies focus on self-improvement to exploit the model’s capability. One line of work focuses on collecting high-quality positive data from model generations guided by static reward heuristic (Zelikman et al., 2022; Gülçehre et al., 2023; Polu et al., 2023). Recently, Yuan et al. (2024) utilize the continuously updated LLM self-rewarding to collect both positive and negative data for preference learning. Fu et al. (2023) adopt exploration strategy via rejection sampling to do online data collection for iterative preference learning. Different from prior works at instance-level alignment, we leverage MCTS as a policy improvement operator to iteratively facilitate step-level preference learning. We discuss additional related work in Appendix A.
6 Conclusion
In this paper, we propose MCTS-enhanced iterative preference learning, utilizing MCTS as a policy improvement operator to enhance LLM alignment via step-level preference learning. MCTS balances quality exploitation and diversity exploration to produce high-quality training data, efficiently pushing the ceiling performance of the LLM on various reasoning tasks. Theoretical analysis shows that online sampling in our iterative learning framework is key to improving the LLM policy toward optimal alignment. We hope our proposed approach can inspire future research on LLM alignment from both data-centric and algorithm-improving aspects: to explore searching strategies and utilization of history data and policies to augment and diversify training examples; to strategically employ a tradeoff between offline and online learning to address the problem of cyclic performance change of the online learning framework as discussed in our theoretical analysis.
Acknowledgments and Disclosure of Funding
The computational work for this article was partially performed on resources of the National Supercomputing Centre (NSCC), Singapore https://www.nscc.sg/.
References
- Anthony et al. (2017) Thomas Anthony, Zheng Tian, and David Barber. 2017. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5360–5370.
- Azar et al. (2023) Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. 2023. A general theoretical paradigm to understand learning from human preferences. CoRR, abs/2310.12036.
- Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
- Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022b. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Bradley and Terry (1952) Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Christiano et al. (2017) Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Coulom (2006) Rémi Coulom. 2006. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer.
- Fu et al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR, abs/2306.13394.
- Grill et al. (2020) Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, and Rémi Munos. 2020. Monte-carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pages 3769–3778. PMLR.
- Gülçehre et al. (2023) Çaglar Gülçehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. 2023. Reinforced self-training (rest) for language modeling. CoRR, abs/2308.08998.
- Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 8154–8173. Association for Computational Linguistics.
- He et al. (2020) Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio Ranzato. 2020. Revisiting self-training for neural sequence generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- Huang et al. (2023) Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2023. Large language models can self-improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1051–1068. Association for Computational Linguistics.
- III (1965) H. J. Scudder III. 1965. Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory, 11(3):363–371.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. CoRR, abs/2310.06825.
- Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. 2022. Language models (mostly) know what they know. CoRR, abs/2207.05221.
- Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. 2006. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer.
- Lee et al. (2023) Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. RLAIF: scaling reinforcement learning from human feedback with AI feedback. CoRR, abs/2309.00267.
- Li et al. (2023) Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. CoRR, abs/2305.20050.
- Liu et al. (2023a) Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. 2023a. Making PPO even better: Value-guided monte-carlo tree search decoding. CoRR, abs/2309.15028.
- Liu et al. (2023b) Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, and Asli Celikyilmaz. 2023b. Crystal: Introspective reasoners reinforced with self-feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11557–11572, Singapore. Association for Computational Linguistics.
- Liu et al. (2023c) Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. 2023c. Statistical rejection sampling improves preference optimization. CoRR, abs/2309.06657.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2381–2391. Association for Computational Linguistics.
- Mitchell (2023) Eric Mitchell. 2023. A note on dpo with noisy preferences & relationship to ipo.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
- Park et al. (2020) Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. 2020. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629.
- Polu et al. (2023) Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. 2023. Formal mathematics statement curriculum learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Ren et al. (2023) Jie Ren, Yao Zhao, Tu Vu, Peter J Liu, and Balaji Lakshminarayanan. 2023. Self-evaluation improves selective generation in large language models. arXiv preprint arXiv:2312.09300.
- Rosin (2011) Christopher D Rosin. 2011. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
- Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815.
- Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize from human feedback. CoRR, abs/2009.01325.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Uesato et al. (2022) Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Y. Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process- and outcome-based feedback. CoRR, abs/2211.14275.
- Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Wang et al. (2023a) Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. 2023a. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935.
- Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- Welbl et al. (2017) Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017, pages 94–106. Association for Computational Linguistics.
- Wu et al. (2023) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine-grained human feedback gives better rewards for language model training. CoRR, abs/2306.01693.
- Xie et al. (2020) Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. 2020. Self-training with noisy student improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10684–10695. Computer Vision Foundation / IEEE.
- Xie et al. (2023) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. 2023. Decomposition enhances reasoning via self-evaluation guided decoding. CoRR, abs/2305.00633.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. CoRR, abs/2305.10601.
- Yarowsky (1995) David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics, 26-30 June 1995, MIT, Cambridge, Massachusetts, USA, Proceedings, pages 189–196. Morgan Kaufmann Publishers / ACL.
- Yu et al. (2023a) Fei Yu, Anningzhe Gao, and Benyou Wang. 2023a. Outcome-supervised verifiers for planning in mathematical reasoning. CoRR, abs/2311.09724.
- Yu et al. (2023b) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023b. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Yuan et al. (2024) Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
- Zelikman et al. (2022) Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. Star: Bootstrapping reasoning with reasoning. In NeurIPS.
- Zhang et al. (2023) Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, and Joseph E. Gonzalez. 2023. The wisdom of hindsight makes language models better instruction followers. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 41414–41428. PMLR.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
- Zhu et al. (2023) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2023. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4471–4485. Association for Computational Linguistics.
- Zoph et al. (2020) Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, and Quoc Le. 2020. Rethinking pre-training and self-training. Advances in neural information processing systems, 33:3833–3845.
Appendix A Related Work
Iterated Learning.
Typical iterated learning operates in a multi-agent scenario, consisting of a loop where an apprentice self-plays, learns from expert feedback, and replaces the current expert for the new iteration (Anthony et al., 2017). Polu et al. (2023) apply expert iteration on formal mathematical reasoning to conduct proof search interleaved with learning. Zelikman et al. (2022) avoid the need for training a separate value function by directly assessing the final outcomes of reasoning to filter generated examples for iterated learning. Recently, Yuan et al. (2024) leverage the technique of LLM-as-a-Judge (Zheng et al., 2023) and introduce self-rewarding language models to improve LLM alignment with self-feedback. Differently, we combine the feedback of outcome assessment and LLM self-evaluation and further decompose them into fine-grained signals via MCTS for step-level iterative preference learning.
Self-Training.
Self-training uses unlabeled data to improve model training by assigning pseudo labels from a learned labeler (III, 1965; Yarowsky, 1995; Xie et al., 2020; He et al., 2020; Park et al., 2020; Zoph et al., 2020). Recent research has explored several alternatives to label the examples. Zelikman et al. (2022) and Gülçehre et al. (2023) use static reward heuristic to curate high-quality examples, while Huang et al. (2023) collect high-confidence outputs as training data via chain-of-thought prompting (Wei et al., 2022) and self-consistency (Wang et al., 2023b). Lee et al. (2023) and Yuan et al. (2024) utilize the off-the-shelf LLM to reward its generations for preference learning. To mitigate the noise from the sparse instance-level signals, we further refine the preference labels via stepwise tree search and LLM self-evaluation.
Preference Learning.
The empirical achievements of LLMs have identified the benefits of RL techniques to better align with human preferences (Touvron et al., 2023; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a). The standard preference learning process learns a reward model to provide feedback in online RL (Schulman et al., 2017). Recently, a variety of studies avoid training separate reward or value networks by hindsight instruction relabeling (Zhang et al., 2023), direct preference optimization (Rafailov et al., 2023) and LLM self-evaluation (Ren et al., 2023). We further explore automatic supervision with MCTS to collect step-level preferences by breaking down outcome correctness integrated with self-evaluation. Our approach enables the continual collection of better-quality data via iterative learning, mitigating the limit of preference data when using a frozen reward model or offline learning algorithms.
Guided Search for Reasoning.
Recent works improve the LLM reasoning ability by eliciting the intermediate reasoning chain (Wei et al., 2022) and breaking it down into multiple steps via searching (Yao et al., 2023; Hao et al., 2023; Yu et al., 2023a). The decomposition of the reasoning process has also been shown effective in reinforcement learning. Lightman et al. (2023) and Li et al. (2023) apply process supervision to train more reliable reward models than outcome supervision in mathematical reasoning (Uesato et al., 2022). Wang et al. (2023a) reinforce LLMs step-by-step with process supervision data automatically collected via model sampling and annotation. We leverage the look-ahead ability of MCTS and integrate it with step-by-step self-evaluation to provide refined process supervision for reasoning. This improves the generalization ability of our framework to update the policy via real-time collected preferences iteratively.
Appendix B Theoretical Analysis of Online DPO
Preliminaries.
A typical alignment technique begins with a policy $\pi_{\mathrm{sft}}(y\mid x)$ supervisedly fine-tuned on high-quality data from the target domain, where $x$ and $y$ are the prompt and the response, respectively. The SFT policy is used to sample pairs of responses $(y_{1},y_{2})\sim\pi_{\mathrm{sft}}(y\mid x)$ with prompts $x$ , which will be further labeled as pairwise preference data $y_{w}\succ y_{l}\mid x$ , where $y_{w}$ and $y_{l}$ represent the preferred and dispreferred responses respectively. The standard RLHF paradigm trains a reward model (Ouyang et al., 2022) on the preference data and employs PPO (Schulman et al., 2017) to optimize the policy $\pi_{\theta}$ with the feedback provided by the reward model, where $\pi_{\theta}$ is also initialized to $\pi_{\mathrm{sft}}$ in practice. DPO avoids fitting a reward model by optimizing the policy $\pi_{\theta}$ using preferences directly.
Given a reward function $r(x,y)$ and prompt distribution $\mathcal{P}$ , RLHF and DPO optimize the KL-constrained reward maximization objective as follows:
$$
\max_{\pi}\mathbb{E}_{x\sim\mathcal{P},y\sim\pi}[r(x,y)]-\beta\mathbb{D}_{%
\mathrm{KL}}[\pi(y\mid x)\parallel\pi_{\mathrm{sft}}(y\mid x)] \tag{9}
$$
where $\beta$ scales the strength of the KL constraint. Let the ground-truth reward function be $r^{*}$ , then Rafailov et al. (2023) estimate the optimal policy $\pi^{*}$ by fitting the Bradley-Terry model (Bradley and Terry, 1952) on preference data:
$$
\displaystyle p^{*}(y_{1}\succ y_{1}\mid x)=\sigma(r^{*}(x,y_{1})-r^{*}(x,y_{2%
})) \displaystyle= \displaystyle\frac{1}{1+\exp\left(\beta\log\frac{\pi^{*}(y_{2}\mid x)}{\pi_{%
\mathrm{sft}}(y_{2}\mid x)}-\beta\log\frac{\pi^{*}(y_{1}\mid x)}{\pi_{\mathrm{%
sft}}(y_{1}\mid x)}\right)} \tag{10}
$$
As the maximum likelihood estimator (MLE) of the optimal policy requires preferences sampled from the target policy (Liu et al., 2023c), DPO uses a fixed, potentially optimal but unknown policy to collect preference data of good quality. This discrepancy can be a problem when the sampling policy differs dramatically from the current policy. Moreover, the absence of a reward model in DPO presents challenges in learning from additional policy-generated data that lacks explicit preference indicators. We further discuss the offline and online settings of DPO in Section 3.
Additional details on labeling outcomes.
After a pair of outcomes $(y^{(i)},{y^{\prime}}^{(i)})$ are sampled from some sampling policy $\pi^{(i)}$ , these are labeled to be $y_{w}^{(i)}$ and $y_{l}^{(i)}$ according to some preference density $p$ . That is, $\Pr[(y_{w}^{(i)},y_{l}^{(i)})=(y^{(i)},{y^{\prime}}^{(i)})]=p(y^{(i)}\succ y^{%
\prime(i)}\mid x)$ and $\Pr[(y_{w}^{(i)},y_{l}^{(i)})=({y^{\prime}}^{(i)},y^{(i)})]=1-p(y^{(i)}\succ{y%
^{\prime}}^{(i)}\mid x)$ . For simplicity, a preference density is set to be $p(y^{*}\succ\bar{y}\mid x)=1$ for every optima-suboptimal pairs $(y^{*},\bar{y})$ for all $\bar{y}∈ Y$ . We do not specify the preference density for other pairs, i.e., $p(\bar{y}\succ\bar{y}^{\prime}\mid x)$ is arbitrary for $(\bar{y},\bar{y}^{\prime})∈ Y× Y$ .
Abstract formulation for both offline and online settings.
Our abstract formulation covers both the offline and online settings. The offline setting in previous papers is obtained by setting $\pi^{(i)}$ to be a single distribution fixed over $i∈\{1,2,·s,M\}$ , e.g., an initial policy, an optimal policy, or an empirical data distribution of a given preference data. In the case of the empirical data distribution, the preference density $p$ is set to the function outputting only $0 0$ or $1$ to recover the given preference data. The online setting is obtained by setting $\pi^{(i)}=\pi_{\theta^{(i-1)}}$ where $\pi_{\theta^{(i-1)}}$ is the latest policy at the beginning of the $i$ -th iteration, i.e., for $i≥ 1$ , ${\theta^{(i)}}$ satisfies $\ell_{j}(\theta^{(i)})=0$ for $j∈\{1,2,·s,i-1\}$ and $\theta^{(0)}$ is the initialization. Thus, we can analyze both offline and online settings with this abstract framework.
Proof of Theorem 3.1.
* Proof*
The intuition behind the proof of Theorem 3.1 is that the current policy $\pi_{\theta^{(i)}}$ may not be corrected if a fixed sampling policy $\pi$ never samples a suboptimal output $\bar{y}∈ Y$ whose probability is high for the current policy $\pi_{\theta^{(i)}}$ . Let $\bar{y}$ be the suboptimal output such that $\pi(\bar{y}\mid x)≤\epsilon$ and $\pi_{\theta^{(i)}}(\bar{y}\mid x)≥ c$ for some $i∈\{1,2,·s,M\}$ . Denote preferences sampled by policy $\pi^{(i)}$ as $(y_{w}^{(i)},y_{l}^{(i)})$ . From the definition of the logistic function, we can rewrite
| | $\displaystyle\ell_{i}(\theta)$ | $\displaystyle=-\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}^{(i)}\mid x)}%
{\pi_{\mathrm{ref}}(y_{w}^{(i)}\mid x)}-\beta\log\frac{\pi_{\theta}(y_{l}^{(i)%
}\mid x)}{\pi_{\mathrm{ref}}(y_{l}^{(i)}\mid x)}\right)$ | |
| --- | --- | --- | --- |
From this equation, we observe that $\ell_{i}(\theta)$ can be minimized to be zero by minimizing $\pi_{\theta}(y_{l}^{(i)}\mid x)$ to be zero without maximizing $\pi_{\theta}(y_{w}^{(i)}\mid x)$ . That is, for any $\beta>0$ , if $\pi_{\theta}(y_{l}^{(i)}\mid x)=0$ ,
$$
\ell_{i}(\theta)=-\log\frac{\pi_{\theta}(y_{w}^{(i)}\mid x)^{\beta}}{\pi_{%
\theta}(y_{w}^{(i)}\mid x)^{\beta}+0}=-\log 1=0.
$$
Thus, even if we sample $y^{*}$ with the optimal policy, $\ell_{i}(\theta)$ can be minimized without maximizing $\pi_{\theta}(y^{*}\mid x)$ and minimizing $\pi_{\theta}(\bar{y}|x)$ for $\bar{y}≠ y_{l}^{(i)}$ . Thus, if $\bar{y}≠ y_{l}^{(i)}$ for all $i∈\{1,2,·s,M\}$ , there exists $\theta$ such that $\ell_{i}(\theta)≤ 0$ for all $i=1,...,M$ , and
$$
\pi_{\theta}(\bar{y}\mid x)\geq c,
$$
because of the condition that $\pi_{\theta}(\bar{y}\mid x)≥ c$ for some $i∈\{1,2,·s,M\}$ : i.e., $\pi_{\theta}(\bar{y}\mid x)$ is never minimized from the $i$ -th iteration while minimizing $\ell_{i}(\theta)$ arbitrarily well, if $\bar{y}$ is never sampled. Therefore, if $\bar{y}$ is never sampled over $m$ iterations, since the probabilities sums up to one, we have
$$
\pi_{\theta}(y^{*}\mid x)\leq 1-\pi_{\theta}(\bar{y}|x)\leq 1-c.
$$
Moreover,
$$
\Pr[\text{\ $\bar{y}$ being never sampled over $m$ iterations }]\geq(1-2%
\epsilon)^{m}\geq 1-2\epsilon m,
$$
where the last line follows Bernoulli’s inequality. By combining the above two equations, it holds that
$$
\Pr[\pi_{\theta}(y^{*}|x)\leq 1-c]\geq 1-2\epsilon M.
$$
∎
Proof of Theorem 3.2.
* Proof*
From the proof of Theorem 3.1, we have
$$
\ell_{i}(\theta)=-\log\frac{\pi_{\theta}(y_{w}^{(i)}\mid x)^{\beta}}{\pi_{%
\theta}(y_{w}^{(i)}\mid x)^{\beta}+\pi_{\theta}(y_{l}^{(i)}\mid x)^{\beta}(%
\frac{\pi_{\mathrm{ref}}(y_{w}^{(i)}\mid x)}{\pi_{\mathrm{ref}}(y_{l}^{(i)}%
\mid x)})^{\beta}}.
$$
For $\alpha≥ 0$ and $\beta>0$ , the condition $\ell_{i}(\theta)≤\alpha$ implies that
| | $\displaystyle-\log\frac{\pi_{\theta}(y_{w}^{(i)}\mid x)^{\beta}}{\pi_{\theta}(%
y_{w}^{(i)}\mid x)^{\beta}+\pi_{\theta}(y_{l}^{(i)}\mid x)^{\beta}(\frac{\pi_{%
\mathrm{ref}}(y_{w}^{(i)}\mid x)}{\pi_{\mathrm{ref}}(y_{l}^{(i)}\mid x)})^{%
\beta}}≤\alpha$ | |
| --- | --- | --- |
Since $\pi_{\theta}(y_{w}^{(i)}\mid x)≤ 1$ , this implies that
| | $\displaystyle\pi_{\theta}(y_{l}^{(i)}\mid x)$ | $\displaystyle≤\pi_{\theta}(y_{w}^{(i)}\mid x)(\exp(\alpha)-1)^{1/\beta}%
\frac{\pi_{\mathrm{ref}}(y_{l}^{(i)}\mid x)}{\pi_{\mathrm{ref}}(y_{w}^{(i)}%
\mid x)}$ | |
| --- | --- | --- | --- | Thus, while we can prove a similar statement for $\alpha>0$ with this equation, we set $\alpha=0$ for this theorem for a cleaner insight, yielding the following: the condition $\ell_{i}(\theta)≤ 0$ implies that
$$
\pi_{\theta}(y_{l}^{(i)}\mid x)=0.
$$
Since $y^{(i)}$ and ${y^{\prime}}^{(i)}$ are sampled from $\pi_{\theta^{(i)}}$ without replacement, this means that we have $\pi_{\theta^{(i+k)}}(y_{l}^{(i)}\mid x)=0$ for all $k≥ 1$ from the definition of $\pi_{\theta^{(i)}}$ : i.e., $\pi_{\theta^{(i)}}$ is the policy such that $\ell_{j}(\theta^{(i)})=0$ for all $j=1,...,i-1$ . Since $\pi_{\theta^{(i+k)}}$ is then used to sample $y^{(i)}$ and ${y^{\prime}}^{(i)}$ in the followings iterations for $k≥ 1$ , we will never sample this $y_{l}^{(i)}$ again. Thus, at each iteration, we always sample pairs of $y$ and $y^{\prime}$ such that these do not include an output judged to be not preferred in a previous iteration. This implies that at each iteration, we increase the number of suboptimal samples $\bar{y}∈ Y$ such that $\pi_{\theta^{(i)}}(\bar{y}\mid x)=0$ . In other words, we have
$$
|\{\bar{y}\in Y\mid\pi_{\theta^{(i)}}(\bar{y}\mid x)=0\}\mid\geq i-1.
$$
Thus,
$$
\pi_{\theta^{(i)}}(y^{*}\mid x)=1-\sum_{j=1}^{n}\pi_{\theta^{(i)}}(\bar{y}_{j}%
\mid x)=1-\sum_{j\in S}\pi_{\theta^{(i)}}(\bar{y}_{j}\mid x).
$$
where $|S|≤ n+1-i$ . Therefore, $\pi_{\theta^{(i)}}(y^{*}\mid x)=1$ when $i≥ n+1$ . ∎
Additional discussion.
We list the additional insights gained from the theoritical analysis.
$\bullet$ The proofs of Theorems 3.1 – 3.2 suggest that a better sampling policy is to use both the current policy and the optimal policy at the same time in the preference learning loss, i.e., sample $y\sim\pi^{*}$ and $y^{\prime}\sim\pi_{\theta^{(i-1)}}$ . This avoids the failure case of Theorem 3.1 and improves the convergence speed in Theorem 3.2. However, since we cannot access the optimal policy $\pi^{*}$ in practice, Theorems 3.1 – 3.2 motivate online DPO. Online DPO is proven to enable us to converge to an optimal policy even if we cannot sample outputs from the optimal policy.
$\bullet$ The proofs of Theorems 3.1 – 3.2 suggest that if we can sample from the optimal policy, then we can also use the samples of the optimal policy with the negative log-likelihood loss $-\log\pi_{\theta}(y^{*}\mid x)$ instead of DPO loss to avoid the failure case.
$\bullet$ The proofs of Theorems 3.1 – 3.2 suggest that in the online setting, we should minimize the DPO loss to a certain low degree per iteration, i.e., we should take several rounds of minimization of DPO loss per online iteration, instead of only taking one round of minimization per iteration. This is because the proofs of Theorems 3.1 – 3.2 show that we can get into the cyclic situation in the online setting if the DPO loss is not minimized sufficiently per iteration. For example, we can sample $\bar{y}_{1}$ and $\bar{y}_{2}$ in one iteration and $\bar{y}_{2}$ and $\bar{y}_{3}$ in another iteration where $\bar{y}_{1}\succ\bar{y}_{2}\succ\bar{y}_{3}$ . If the probability of sampling $\bar{y}_{2}$ is not minimized sufficiently in the first iteration, it can be sampled again in the second iteration, where the probability of sampling $\bar{y}_{2}$ can be increased as $\bar{y}_{2}\succ\bar{y}_{3}$ . Then, this can repeat indefinitely. Thus, it is important to minimize DPO loss with several optimizer iterations per iteration.
Appendix C Implementation Details
We use Mistral-7B as our base pre-trained model. The supervised fine-tuning and preference learning experiments are conducted with a maximum of $4× 40$ GB GPUs (NVIDIA A100).
We choose the learning rates 5e-6 and 1e-6 for SFT and DPO training, respectively, with a cosine learning rate scheduler. The maximum sequence length of models is $512$ . We train the model with a batch size of $128$ and $32$ for SFT and DPO, respectively. For DPO, we follow the DPO paper to set the KL constraint parameter $\beta$ as $0.1$ . Each sample in DPO is a set of step-level preference data decomposed by MCTS. We set the max length for each step as $64$ . The number of MCTS iterations is set as $K=5$ for all tasks.
For arithmetic reasoning, we combine the problems in GSM8K and MATH training sets as the prompt data containing a total of $24$ K samples for preference learning. For each sample, we conduct MCTS with an initial breadth of $b_{1}=5$ and decrease it to $b_{2}=3$ for the subsequent steps, with a maximum search depth $d=4$ . It takes about $2$ minutes per sample to collect the step-level preferences via MCTS. This requires about 30 A100 days of compute to train one whole epoch. In practice, we can adopt an early stop when the performance saturates, which usually only needs $30\%$ of the training data.
For commonsense reasoning, we combine the training data of ARC, AI2Science, OBQA, and CSQA, which produces a total of $12$ K samples. As the model generations are more diversified on these tasks, we set the initial breadth as $b_{1}=4$ and decrease it to $b_{2}=2$ for subsequent steps. As the intermediate reasoning chains are relatively shorter than those in arithmetic reasoning, we set the maximum search depth $d=3$ . Likewise, we also adopt an early stop at around $50\%$ of the training progress where the performance saturates.
Hyperparameter Tuning of MCTS.
We compare the performance in commonsense reasoning when employing different searching breadths in MCTS. Table 5 shows how different search heuristics impact learning performance. O2 produces better performance, highlighting the importance of increasing the search space at the beginning point of MCTS. One can efficiently reduce compute while maintaining good performance by using a small search space for the subsequent steps. For future work, we will explore the hyperparameter settings in MCTS, including the search breadth, depth, number of steps, and iteration time, to probe the cost–performance tradeoff of our MCTS-enhanced iterative learning framework.
| O1 ( $b_{1}=3,b_{2}=3$ ) O2 ( $b_{1}=4,b_{2}=2$ ) | $88.4$ $88.5$ | $74.7$ $76.4$ | $92.1$ $91.7$ | $88.5$ $88.2$ | $77.8$ $79.2$ | $73.2$ $74.8$ | $88.3$ $88.5$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Table 5: Result comparison of using different search breadths in MCTS. For O2, we have a broader spectrum for the initial step and narrow the search space for the subsequent steps of each path.
Prompt Example.
See an example of the evaluation prompt we use for self-evalution in Table 6. For more details, please refer to our implementation code.
Table 6: Evaluation Prompt Template. The text underlined will be replaced with content from different examples.
| QUESTION: Which of the following is an example of the formation of a mixture? |
| --- |
| Answer Choices: (A) rust forming on an iron nail (B) sugar crystals dissolving in water (C) sodium and |
| chlorine forming table salt (D) hydrogen and oxygen reacting to produce water |
| EXAMPLE ANSWER: The answer is (B) sugar crystals dissolving in water |
| PROPOSED SOLUTION: The formation of a mixture occurs when two or more substances are |
| combined together without changing their individual properties. In the given options, rust forming on an |
| iron nail is an example of the formation of a mixture. The iron nail and the oxygen in the air combine to |
| form iron oxide, which is a mixture. The answer is A. |
| QUESTION: Evaluate if the proposed solution is logically heading in the correct direction. |
| Provide an answer of (A) correct or (B) incorrect. |
| ANSWER: The answer is |
Appendix D Further Analysis
Reward Criteria in MCTS.
We probe the effect of different reward guidance of MCTS in terms of both searching and training. Table 7 shows how different reward signals impact the pass rate of searching. The guidance of outcome correctness is substantially dominant in eliciting correct outcomes. We see that MCTS can produce significant improvement across various tasks with the reward signals integrated of outcome correctness and self-evaluation, increasing the baseline performance from $60.6\%$ to $83.0\%$ on ARC-C, $70.9\%$ to $90.5\%$ on AI2Sci-M, and $75.9\%$ to $85.8\%$ on GSM8K. We observe a significant performance gain from learning when using greedy decoding on commonsense reasoning. For example, learning increases the accuracy to $76.4\%$ (+ $16.4\%$ ) on ARC-C, compared to the increase of $9.1\%$ on MCTS performance. This suggests a substantial improvement in the model’s policy when applying our MCTS-enhanced iterative learning to tasks that the initial policy is not good at. Furthermore, the ablation study on the reward components shows consistent improvement brought by self-evaluation to increase the MCTS performance in both before- and after- learning cases, suggesting the effectiveness of the integration of self-evaluation in our approach.
Table 7: Pass Rates when Ablating MCTS Settings. SE represents the guidance from self-evaluation.
| Greedy Decoding | ✗ | $60.6$ | $70.9$ | $75.9$ |
| --- | --- | --- | --- | --- |
| $\checkmark$ | $76.4$ $\uparrow$ $16.4$ | $88.2$ $\uparrow$ $17.3$ | $80.7$ $\uparrow$ $5.2$ | |
| MCTS w/o SE | ✗ | $82.5$ | $87.3$ | $84.4$ |
| $\checkmark$ | $91.0$ $\uparrow$ $8.5$ | $96.1$ $\uparrow$ $9.8$ | $89.0$ $\uparrow$ $5.6$ | |
| MCTS | ✗ | $83.0$ | $90.5$ | $85.8$ |
| $\checkmark$ | $92.1$ $\uparrow$ $9.1$ | $97.3$ $\uparrow$ $6.8$ | $90.2$ $\uparrow$ $4.4$ | |
Qualitative Analysis on Collected Preferences.
We show examples of the result search trees elicited via MCTS on different tasks in Figures 5 – 9.
Figures 5 and 6 show the result search trees to answer the same science question using MCTS employed with different search breadths. We see that MCTS not only figures out the correct answer (i.e., the option “D”) via broad searching but also serves as a policy improvement optimizer to collect steps along this path as positive samples for preference learning. For example, the $Q$ values of the preference pair at the last step (at the bottom right of Figure 5) are $0.70838$ and $-0.45433$ , compared to the original probability in the policy generation as $0.37989$ and $0.38789$ . Compared to searching with breadth $b_{1}=4,b_{2}=2$ in Figure 5, Figure 6 shows that a higher breadth for the subsequent steps can produce an even larger search tree. However, as we only collect preference pairs alongside the paths leading to correct prediction, these two search heuristics can result in preference data of similar size.
Figure 7 shows the search tree using the trained policy on commonsense reasoning. Compared to the one generated by the initial policy in Figure 5, the policy has a higher chance to elicit correct reasoning chains, as we see more successful predictions of the ground-truth option “D”. We also observe that the policy tends to generate longer reasoning chains after being motivated to conduct chain-of-thought reasoning with fine-grained process supervision.
On arithmetic reasoning, we also probe the impact of diversity in model generations using policies trained for different numbers of epochs in SFT. Figures 8 and 9 show the elicited search trees with data sampled by policies corresponding to different levels of diversity, where the policy used in Figure 8 has generations with higher diversity. With higher diversity, MCTS can explore more alternatives of the correct solutions, as there are more paths of correct predictions, as shown in Figure 8 than Figure 9. Furthermore, higher diversity with reasonable quality also provide more fine-grained supervision signals as there are more branches alongside the reasoning path of correct predictions.
Appendix E Extended Experiments
Loss Function.
DPO is one of the reward-model-free loss functions we can use for preference learning. We now illustrate the generalizability of our approach using another loss function, Identity Preference Optimization (IPO) (Azar et al., 2023), which addresses the overfitting problem of DPO. Table 8 shows that IPO achieves similar performance as DPO. In practice, we find that IPO boosts the reasoning on validation tasks while maintaining a more stable performance on the held-out dataset, as indicated by the higher accuracy $89.8\%$ obtained on SciQ.
Table 8: Result comparison of employing our approach with different loss functions.
| O1 (IPO) O2 (DPO) | $88.1$ $88.5$ | $75.1$ $76.4$ | $92.1$ $91.7$ | $89.6$ $88.2$ | $76.8$ $79.2$ | $74.3$ $74.8$ | $89.8$ $88.5$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Base Model.
We extensively validate the generalizability of our approach on Llama2-13B (Touvron et al., 2023) on arithmetic reasoning. We employ the same process of SFT on Arithmo and preference learning with DPO on GSM8K and MATH. This experiment is done on a maximum of $2× 80$ GB GPUs (NVIDIA A100).
Table 9: Result comparison (accuracy $\%$ ) for Llama2-13B on arithmetic tasks.
| SFT (Arithmo) Ours | Llama2-13B $78.9$ $\uparrow$ $4.4$ | $74.5$ $67.0$ $\uparrow$ $4.7$ | $62.3$ $26.1$ $\uparrow$ $2.3$ | $23.8$ |
| --- | --- | --- | --- | --- |
<details>
<summary>x8.png Details</summary>

### Visual Description
## Diagram: Chemical Reaction Network
### Overview
The image presents a network diagram illustrating different reasoning paths and statements related to balancing a chemical reaction, specifically the combustion of methane (CH4) with oxygen (O2) to produce carbon dioxide (CO2) and water (H2O). Each node contains a statement or deduction, along with associated Q and P values, likely representing some form of quality or probability scores. The diagram explores various approaches to understanding and balancing the chemical equation.
### Components/Axes
* **Nodes:** Each node contains a statement, question, or deduction related to the chemical reaction. Each node also contains a (Q:value)(P:value) pair.
* **Edges:** Arrows connect the nodes, indicating the flow of reasoning or logical connections between statements.
* **Textual Content:** The nodes contain various statements about the chemical reaction, including:
* Balancing the equation
* Identifying the missing molecule
* Determining the number of molecules required
* Analyzing the number of atoms on each side of the equation
* **Color Coding:**
* Red boxes surround statements related to the overall reaction and the need to find a multiple of O2.
* Green boxes surround statements related to the combination of CO2 with water to form carboxylic acid and the number of water molecules.
### Detailed Analysis or ### Content Details
Here's a breakdown of the content within selected nodes:
* **Node [51] (Top-Left):** (Q:0.00000)(P:-1.00000) - This node appears to be a starting point or a general reference.
* **Node [5] (Left):** "When 1 CH4 molecule reacts 2 O2 molecules are needed with 2 CO2 molecules and 4 H2O molecules formed." (Q:-0.14156)(P:0.36265)
* **Node [5] (Center-Left, Red Box):** "CH4 + O2 -> 2CO2 + 4H2O By looking at Answer choices We can see that we need to find a multiple of O2. If we take 2.O2. Then we can have 2 CO2." (Q:-0.28709)(P:0.33311)
* **Node [3] (Bottom-Left):** "The overall reaction is 24 in H, 32 in C, and 0 in O, so it's clearly a hydrocarbon combustion. Balancing the overall equation is not very clear. This unbalanced equation can be balanced to show the oxygen requirement." (Q:-0.02361)(P:0.15714)
* **Node [29] (Bottom-Left, Green Box):** "CH4 + O2 -> 2CO2 + 4H2O, On the RHS, 2CO2 combines with water to form carboxylic acid. Hence, the number of water molecules should be 2" (Q:1.02593)(P:0.36184)
* **Node [2] (Top):** "Since CH4 + (2 O2) > 2 CO2 + 4 H2O So CH4 + O2 + O2 > 2 CO2 + 4 H2O..." (Q:-1.10341)(P:0.35997)
* **Node [0] (Top-Right):** "Thus O2 is the missing molecule. The answer is option A</s>" (Q:-1.07136)(P:0.26702)
* **Node [1] (Top-Right):** "So 2 O2 are needed to balance O2. The answer is B</s>" (Q:-1.06490)(P:0.51277)
* **Node [1] (Center-Right):** "24 in H, 32 in C, and 0 in O and it is already balanced, so the overall balanced is: CH4 + xO2 -> 2 CO2 + 2 H2O The answer is 2. ANSWER: B</s>" (Q:-1.39894)(P:0.33174)
* **Node [32] (Bottom-Center):** "CO2" (Q:0.21432)(P:0:04583)
* **Node [2] (Center):** "So answer is 2. CH4 + 2 O2 -> 2 CO2 + 2 H2O" (Q:-1.41080)(P:0.52668)
### Key Observations
* The diagram explores different approaches to balancing the chemical equation for methane combustion.
* Several nodes focus on the number of oxygen molecules required for the reaction.
* The Q and P values associated with each node likely represent the quality or probability of the statement being correct.
* The color-coding highlights specific aspects of the reaction, such as the formation of carboxylic acid and the overall reaction requirements.
### Interpretation
The diagram represents a problem-solving process for balancing a chemical equation. It shows different lines of reasoning, some of which may be more accurate or efficient than others. The Q and P values could represent the confidence level or accuracy associated with each step in the reasoning process. The connections between nodes indicate how different ideas and deductions relate to each other in solving the problem. The red and green boxes highlight key aspects of the chemical reaction, such as the overall reaction requirements and the potential formation of carboxylic acid. The diagram suggests a multi-faceted approach to understanding and balancing chemical equations, where different perspectives and deductions contribute to the final solution.
</details>
Figure 5: Example of the result search tree of a science question “An unbalanced equation for the reaction of methane gas ( $\mathrm{CH}_{4}$ ) with oxygen is shown below. $\mathrm{CH}_{4}+\Box\mathrm{O}_{2}→ 2\mathrm{CO}_{2}+4\mathrm{H}_{2}%
\mathrm{O}$ How many molecules of oxygen gas ( $\mathrm{O}_{2}$ ) are needed to properly balance this equation? Answer Choices: (A) 1 (B) 2 (C) 3 (D) 4”. The ground truth answer is “(D) 4”. Here, we set the search breadth as $b_{1}=4,b_{2}=2$ . The numbers at the beginning of each sequence indicate the visit count $N$ of the corresponding node, while the $Q$ and $P$ values at the end of the sequence represent the $Q$ values and the sentence probability, respectively.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Decision Tree: Chemical Equation Balancing
### Overview
The image depicts a decision tree diagram related to balancing a chemical equation, specifically the combustion of methane (CH4). The tree branches out based on different approaches and considerations for balancing the equation, ultimately leading to various conclusions and answers. Each node contains text explaining the reasoning and calculations involved, along with associated Q and P values (likely representing some form of score or probability).
### Components/Axes
* **Nodes:** Each node contains a statement or question related to the balancing process.
* **Branches:** Lines connecting the nodes, representing different paths of reasoning.
* **Q and P Values:** Each node has associated (Q:value)(P:value) pairs, which are numerical values. The meaning of Q and P is not explicitly defined in the image, but they likely represent a quality score and probability, respectively.
* **Root Node:** The starting point of the decision tree, labeled "[50] <s> (Q:0.00000)(P:-1.00000)".
* **Leaf Nodes:** The ending points of the branches, representing final answers or conclusions.
### Detailed Analysis or ### Content Details
Here's a breakdown of the tree's content, following the branching logic:
1. **Root Node:** "[50] <s> (Q:0.00000)(P:-1.00000)" - This is the starting point.
2. **First Branch:**
* Node: "[5]. When 1 CH4 molecule reacts 2 O2 molecules are needed with 2 CO2 molecules and 4 H2O molecules formed. (Q:0.16088)(P:0.36265)"
* Branch 1:
* Node: "[0] Balance the equation by writing the correct equation above and then subtracting each part from both in the following: New equation: 2CH4 + 4 O2 -> 4Co2 + 8 H2O (Q:0.01167)(P:0.22393)"
* Branch 1.1:
* Node: "[1] Hence 'excess' 02 molecules need to be formed to form the H2O needed to completely balance the oxygen formed. 2 02 molecules will form 2 H20 molecules that exactly balance the 2 02 molecules that were formed. Thus there is NO (Q:-0.03828)(P:0.18065)"
* Branch 1.1.1:
* Node: "[0] Therefore 2 oxygen molecules are needed. The answer is B</s> (Q:-1.07299)(P:0.57110)"
* Branch 1.1.2:
* Node: "[0] O2 is a di molecular gas so 2. The answer is B</s> (Q:-1.07995)(P:0.13073)"
* Branch 1.1.3:
* Node: "[1] The answer is A</s> (Q:-1.09599)(P:0.92723)"
* Branch 1.1.4:
* Node: "[0] (For a balanced equation Number of atoms of same element on the two sides should be equal) Correct Answer The correct answer is A</s> (Q:-1.10798)(P:0.23028)"
* Branch 1.1.5:
* Node: "[0] The answer is B 2 02.</s> (Q:-1.32715)(P:0.42094)"
* Branch 1.1.6:
* Node: "[0] The answer is 2 ANSWER:B</s> (Q:-1.39801)(P:0.59756)"
* Branch 1.1.7:
* Node: "[1] The answer is B</s> (Q:-1.29846)(P:0.91250)"
* Branch 2:
* Node: "[2] Balanced Equation: CH4 + 2 02-> 2 CO2 + 4 H2O (Q:-0.72269)(P:0.76063)"
* Branch 2.1:
* Node: "[1] Thus the Balanced Equation, CH4 + 1 02 -> 2 CO2 + 4H2O Thus the Balanced Equation, CH4 + 1 02 -> 2 CO2 + 4H2O The equation gets balanced by having Two (Q:-0.00585)(P:0.54334)"
* Branch 2.1.1:
* Node: "[1] B</s> (Q:-0.54880)(P:0.78020)"
* Branch 2.2:
* Node: "[2] So, 2/2 02/1 CH4 => 1 02 for 1 CH4 In the unbalanced equation 1 CH4 + 1 02-> 2 CO2 + 2 H2O So 1 02 would balance the given equation. (Q:-1.16414)(P:0.37983)"
* Branch 2.2.1:
* Node: "[0] We need only 1 02. So Answer is 02. Therefore, Take 2 02. The Equation is Balanced. Take 2 O2 The answer is Option B</s> (Q:-1.55572)(P:0.35634)"
* Branch 2.3:
* Node: "[2] That means we need 2 to balance answer. The next one would be 4 CO2 For Balancing. We need 4 as well. The choice is 2. (Q:-1.36182)(P:0.21155)"
* Branch 2.3.1:
* Node: "[0] And then we need 02. Total Answer, CH4 + 02-> 2CO2 + 4H2O Hence 1 Mole of The answer is A</s> (Q:-1.37009)(P:0.35589)"
* Branch 2.4:
* Node: "[0] Hence Answer. (Q:-0.01028)(P:0.35430)"
* Branch 2.4.1:
* Node: "[0] CH4 + 2 02 -> 2 CO2 + 2 H2O Therefore, 2 molecules of oxygen gas are needed to balance the equation The answer is B</s> (Q:-1.41782)(P:0.63130)"
* Branch 3:
* Node: "[5]. CH4 + O2 -> 2CO2 + 4H2O By looking at Answer choices We can see that we need to find a multiple of 02. If we take 2 02. Then we can have 2 CO2. (Q:0.00566)(P:0.33311)"
* Branch 3.1:
* Node: "[0] The answer is B</s> (Q:-1.43029)(P:0.65357)"
* Branch 4:
* Node: "[2] Thus Balanced Equation, CH4 + 2 02 -> 2 CO2 + 4H2O Take 1 02. Then we need only 1 CO2. Thus, Take 2 CO2. (Q:-1.00641)(P:0.44905)"
* Branch 4.1:
* Node: "[1] CH4 + 2 02 -> 2 CO2 + 2 H2O. The answer is B</s> (Q:-1.40455)(P:0.66416)"
* Branch 5:
* Node: "[2] CH4 + O2 -> CO2 + H2O The total oxygen requirement should be 2. The ratio CH4 to 02 is 1 to 1, suggesting oxygen demand is 1 to 2. (Q:-1.43110)(P:0.23469)"
* Branch 5.1:
* Node: "[4] Oxygen is not the limiting reagent, because hydrogen is. 2C +2 Η 20 + 02 -> 2CO2 + 2 H2O 1+2+1 -2 = 6 3+2+1-4 = 4 Total 6 for (Q:0.02011)(P:0.27802)"
* Branch 5.1.1:
* Node: "[1] left side, 4 for right side; not balanced -- H2O is limiting, only 2 more are needed. 2+2+1-2-6 3+2+1-6 = 2 Answer; H2O is limiting; needs two more molecules. (Q:0.06306)(P:0.22207)"
* Branch 6:
* Node: "[31] 2C +2 Η 20 + O2 -> 2CO2 + 2 H2O The balanced equation uses more oxygen, because now it shows that one molecule of each element on either side is balanced by two molecules of each element. (Q:0.33569)(P:0.20021)"
* Branch 6.1:
* Node: "[18] The answer is D</s> (Q:0.68647)(P:0.46097)"
* Branch 7:
* Node: "[31]. The overall reaction is 24 in H, 32 in C, and 0 in O, so it's clearly a hydrocarbon combustion. Balancing the overall equation is not very clear. This unbalanced equation can be balanced to show the oxygen requirement. (Q:1.18694)(P:0.15714)"
* Branch 7.1:
* Node: "[3] That balances the equation and the two waters from the combustion are now together. The new equation is balanced (because the number next to each element is balanced by the number of times it appears on the other side of the equation) and clearly requires two molecules of 02. Final The answer is (Q:-1.19775)(P:0.19619)"
* Branch 7.1.1:
* Node: "[1] B</s> (Q:-0.97030)(P:0.71450)"
* Branch 8:
* Node: "[3] Here 22C+240H+260(total molecular weight) Then Both sides are divided by 2. 1/212C+1/240H+1/4600(total molecular weight) (Q:-0.02526)(P:0.14846)"
* Branch 8.1:
* Node: "[1] Now Since 202 is required for 4600*1/4. 202 is needed for overall equation. 212C+240H+2600(total molecular weight) 1/212C+1/240 (Q:0.02280)(P:0.33173)"
* Branch 8.1.1:
* Node: "[1] B</s> (Q:-0.54880)(P:0.78020)"
* Branch 9:
* Node: "[1] 1/21 + 1/4 + 1/4600 = 1/4600(total molecular weight) 46 + 0 + 0 = 46 2(46+260)2(46) Therefore, you (Q:0.03093)(P:0.29965)"
* Branch 10:
* Node: "[0] The total molecular weight represents the amount of oxygen gas required when the 120 is completely combusted. (12C+16H+260) Hence 1 mole (1 * 64) of oxygen gas need 12C, 16H, and (Q:0.03317)(P:0.17285)"
* Branch 11:
* Node: "[0] both sides with the balanced equation. N look at the oxygen. Before it was 2, and then it was 4, a doubling of oxygen requirement. If a tripling of oxygen requis is needed to balance the equation, then it will be 8. If 3 is needed, then it goes (Q:0.03925)(P:0.18201)"
### Key Observations
* The tree explores various approaches to balancing the methane combustion equation.
* Many branches lead to the conclusion that "B" is the answer, although the meaning of "B" is not explicitly defined. It likely refers to a specific answer choice in a multiple-choice question.
* The Q and P values vary across the nodes, potentially indicating the confidence or accuracy associated with each step in the reasoning.
* Some branches involve more complex calculations and considerations than others.
### Interpretation
The decision tree illustrates the different paths one might take when balancing a chemical equation. It highlights the importance of understanding stoichiometry and the relationships between reactants and products. The tree also suggests that there may be multiple ways to arrive at the correct answer, depending on the approach taken. The Q and P values could be used to assess the reliability of each path, with higher values indicating more confident or accurate reasoning. The frequent appearance of "B" as the answer suggests that this is the correct choice based on the given context and answer options.
</details>
Figure 6: Example of the result search tree of the same science question as in Figure 5. Here, we set the search breadth as $b_{1}=3,b_{2}=3$ .
<details>
<summary>x10.png Details</summary>

### Visual Description
## Decision Tree: Balancing Chemical Equations
### Overview
The image is a decision tree outlining the steps to balance a chemical equation, specifically the combustion of methane (CH4). Each node contains a statement or question related to the balancing process, along with associated Q and P values. The tree branches out based on the logic of balancing chemical equations.
### Components/Axes
* **Nodes:** Each node contains a statement or question related to balancing the chemical equation. Each node also contains a (Q:value)(P:value).
* **Edges:** Lines connecting the nodes, representing the flow of the decision-making process.
* **Root Node:** The starting point of the decision tree, labeled "[50] <s>".
* **Leaf Nodes:** The ending points of the decision tree, providing specific instructions or answers.
### Detailed Analysis or ### Content Details
Here's a breakdown of the nodes and their connections:
1. **Root Node:**
* "[50] <s>" (Q:0.00000) (P:-1.00000)
* This is the starting point of the tree.
2. **First Branch (Left):**
* "[29]. First, we need to check which element is missing on both sides of the equation. We know that oxygen (O) is missing on the left side." (Q:1.73437) (P:0.39171)
3. **Second Branch (Left from Node 2):**
* "[5]. The chemical formula for methane gas is CH4. The chemical formula for oxygen gas is O2. The chemical formula for carbon dioxide is CO2. The chemical formula for water is H2O." (Q:0.31012) (P:0.81115)
4. **First Branch (Right):**
* "[3]. When an equation is unbalance, we have to add the same molecule to both sides so they will balance. In this case, we need to add equal number of oxygen atoms to both sides. The overall reaction is:" (Q:0.53122) (P:0.24138)
5. **Third Branch (Right from Node 2):**
* "[2]. We figure this out because on the right side, 2(CO2) means 4(O) and 4(H2O) means 8(O), which is more oxygen than what is needed to fulfill the 2(CO2)." (Q:0.17238) (P:0.33288)
6. **Fourth Branch (Right from Node 5):**
* "[33]. So we need to find the number of oxygen atoms we can get from the right side and place that many on the left side." (Q:1.16743) (P:0.35560)
7. **Fifth Branch (Right from Node 6):**
* "[22] 2CO2 has a total of 4 oxygen atoms. So, we can place 4 oxygen atoms on the left side and balance the equation. The equation is now: CH4 + 4O2 -> 2CO2 + 4H2O. The answer is D.</s>" (Q:1.17833) (P:0.63351)
8. **Sixth Branch (Right from Node 7):**
* "[5]. On the right side, we have 2 molecules of carbon dioxide (CO2) and 4 molecules of water (H2O). Each molecule of carbon dioxide has 2 oxygen atoms, so 2 x 2 = 4 oxygen atoms from the carbon dioxide" (Q:0.05621) (P:0.76038)
9. **Seventh Branch (Right from Node 8):**
* "[1] On the right side of the equation, we have 2 molecules of CO2 and 4 molecules of H20. To balance the number of molecules, we need to have 2 molecules of CO2 for every 1 molecule of CH4, and we need to have" (Q:-0.02205) (P:0.81702)
10. **Eighth Branch (Right from Node 9):**
* "[0] On the right side of the equation, we have 2 molecules of CO2 and 4 molecules of H2O. Since the coefficient of CO2 is 2 and the coefficient of H2O is 4, the ratio of the number of molecules of O2 to the number of" (Q:-0.04654) (P:0.79002)
11. **Ninth Branch (Right from Node 10):**
* "[0] Since each oxygen molecule contains 2 oxygen atoms, we need 2 O2 molecules. The answer is B.</s>" (Q:-0.76463) (P:0.71485)
12. **Tenth Branch (Right from Node 11):**
* "[1] Since each oxygen atom Consists of 2 atoms (O2), we need 2 molecules of oxygen gas (O2). The answer is B.</s>" (Q:-0.77903) (P:0.74965)
13. **Eleventh Branch (Right from Node 4):**
* "[3]. Looking at the products of the reaction we can see that there are 4 O atoms on the right side. 4 atoms of either oxygen or hydrogen must come from the left side to balance." (Q:0.81835) (P:0.21888)
14. **Twelfth Branch (Right from Node 13):**
* "[2]. Each molecule of water has 1 oxygen atom, so 4 x 1 = 4 oxygen atoms from the water. So, we can place 4 oxygen atoms from the right side on the left side. Now, we have a balanced equation on the left side, which means we have the right" (Q:0.28120) (P:0.57643)
15. **Thirteenth Branch (Right from Node 14):**
* "[0]. Each molecule of water has 2 hydrogen atoms and 1 oxygen atom, so 4 x1 = 4 oxygen atoms from water. A total of 4 oxygen atoms is available from the right side. Now, we need to place 2 oxygen atoms on the left side of the equation" (Q:-0.08359) (P:0.63161)
16. **Fourteenth Branch (Right from Node 15):**
* "[0] number of each element on both sides. The left side has 4 oxygen atoms, so we need 2 molecules of oxygen gas (O2) to complete the reaction. The answer is B.</s>" (Q:-0.74346) (P:0.58410)
17. **Fifteenth Branch (Right from Node 3):**
* "[2] To balance this equation, we need to determine the ratio of the number of molecules of O2 to the number of molecules of CO2 and H2O. On the left side of the equation, we have 1 molecule of CH4 and x molecules of O2." (Q:0.09074) (P:0.80392)
18. **Sixteenth Branch (Right from Node 17):**
* "[2] To balance this equation, we need to compare the number of oxygen atoms on the two sides. There are 4 oxygen atoms on the right side of the equation. Therefore, we need 4 oxygen atoms on the left side of the equation to properly balance the equation." (Q:-0.71715) (P:0.59589)
19. **Seventeenth Branch (Right from Node 1):**
* "[1] With two hydrogen atoms, we would need 2 * 2 = 4 to balance the oxygen. The answer is D.</s>" (Q:1.20966) (P:0.42097)
20. **Eighteenth Branch (Right from Node 19):**
* "[0] To balance the oxygen atoms we need to add one oxygen atom for each carbon atom. Therefore we need 2 oxygen atoms to balance this equation. The answer is B.</s>" (Q:-0.83019) (P:0.50695)
21. **Nineteenth Branch (Right from Node 20):**
* "[1] carbon atom reacts with 4 oxygen atoms. Therefore, 2CO_{2} will have 4 oxygen atoms. So equation will be: CH {4}+4O {2} -> 2CO_{2} + 4H_{2}O The answer is D." (Q:0.19991) (P:0.49764)
22. **Twentieth Branch (Right from Node 21):**
* "[0] it will need 2 Oxygen atoms just to balance the O atoms. The answer is B</s>" (Q:-0.84370) (P:0.32001)
23. **Twenty-first Branch (Right from Node 22):**
* "[0] Therefore, the left uses 1(O) twice, for a total of 2(O). Second, we must find out how many oxygen molecules are needed to get a balanced equation. We balance by adding the same thing to both sides that can be multiplied the same on each side. In this" (Q:0.03162) (P:0.23481)
24. **Twenty-second Branch (Right from Node 23):**
* "[1] Next, we need to determine if CH4 is a limiting reagent. A limiting reagent is the reactant that is mostly likely to run out before all of the products can be made. This is important because we need to find out how many times we need to multiply the number of moles of oxygen" (Q:-0.01939) (P:0.38211)
### Key Observations
* The tree starts with identifying the imbalance in oxygen atoms.
* It explores different approaches to balancing, including adjusting coefficients and considering the number of atoms in each molecule.
* Several branches lead to similar conclusions about the number of oxygen molecules needed.
* The tree touches on the concept of limiting reagents, indicating a more advanced understanding of chemical reactions.
### Interpretation
The decision tree provides a step-by-step guide to balancing the combustion equation for methane. It highlights the importance of:
* Identifying the missing elements.
* Understanding the stoichiometry of the reaction.
* Adjusting coefficients to ensure the number of atoms is equal on both sides.
* Considering the role of limiting reagents.
The tree demonstrates a logical progression of thought, starting with basic concepts and gradually introducing more complex considerations. The presence of Q and P values at each node might represent some form of confidence or probability associated with each step, but without further context, their exact meaning is unclear. The tree effectively breaks down a complex task into manageable steps, making it easier to understand the process of balancing chemical equations.
</details>
Figure 7: Example of the result search tree of the same science question as in Figure 5. Here, we use the policy after preference learning and set the search breadth as $b_{1}=4,b_{2}=2$ .
<details>
<summary>x11.png Details</summary>

### Visual Description
## Decision Tree: Bread Loaf Calculation
### Overview
The image is a decision tree diagram illustrating different scenarios involving the baking, selling, and returning of bread loaves. Each node represents a step in the process, with calculations and textual descriptions of the events. The tree starts with an initial state and branches out based on different actions taken.
### Components/Axes
* **Nodes:** Each node contains a description of the scenario, calculations, and a Q and P value.
* **Branches:** Lines connecting the nodes, indicating the flow of events.
* **Root Node:** The starting point of the tree, labeled "[49] <5> (Q:0.00000) (P:-1.00000)".
* **Leaf Nodes:** The end points of the tree, providing final answers to the bread loaf calculations.
### Detailed Analysis or ### Content Details
Here's a breakdown of the tree's branches and calculations:
1. **Root Node:** "[49] <5> (Q:0.00000) (P:-1.00000)"
* This is the starting point of the decision tree.
2. **First Level Branches:**
* **Branch 1:** "[6] The bakers baked 200 loaves of bread in total. They sold 93 loaves in the morning and 39 loaves in the afternoon, so they sold 93 + 39 = 132 loaves. (Q:0.74724)(P:0.87045)"
* Calculation: 93 + 39 = 132
* Description: Bakers baked 200 loaves and sold 132.
* **Branch 2:** "[3] The bacon factory produced 200 - 93 = 107 loaves on Monday. Then they produced 200 - 39 = 161 loaves on Tuesday. (Q:-0.06717)(P:0.47814)"
* Calculations: 200 - 93 = 107, 200 - 39 = 161
* Description: Bacon factory produced 107 loaves on Monday and 161 on Tuesday.
* **Branch 3:** "[30] They sold 93 loaves in the morning and 39 loaves in the afternoon, so they sold a total of 93 + 39 = 132 loaves of bread. (Q:1.44411)(P:0.84845)"
* Calculation: 93 + 39 = 132
* Description: 132 loaves sold in total.
3. **Second Level Branches (from Branch 1):**
* **Branch 1.1:** "[0] A grocery store returned 6 unsold loaves, so they have 200 - 132 + 6 = 74 loaves left. (Q:0.02081)(P:0.79960)"
* Calculation: 200 - 132 + 6 = 74
* Description: Grocery store returned 6 loaves, resulting in 74 loaves left.
* **Branch 1.2:** "[2] They also returned 6 unsold loaves, so they had 132 - 6 = 126 loaves left. (Q:-0.88117)(P:0.89317)"
* Calculation: 132 - 6 = 126
* Description: 6 loaves returned, resulting in 126 loaves left.
* **Branch 1.3:** "[3] After selling, they had 200 - 132 = 68 loaves of bread left. The grocery store returned 6 unsold loaves, so they had 68 + 6 = 74 loaves of bread left. (Q:1.18816)(P:0.89653)"
* Calculations: 200 - 132 = 68, 68 + 6 = 74
* Description: After selling and return, 74 loaves left.
4. **Second Level Branches (from Branch 2):**
* **Branch 2.1:** "[0] So the company baked a total of 107 + 161 = 268 loaves of bread. (Q:0.00192)(P:0.62347)"
* Calculation: 107 + 161 = 268
* Description: Total loaves baked is 268.
* **Branch 2.2:** "[0] They had 161 + 107 - 6 = 252 - 6 = 246 loaves on hand They have 246 loaves of bread left. (Q:-0.01176)(P:0.51426)"
* Calculations: 161 + 107 - 6 = 252, 252 - 6 = 246
* Description: 246 loaves of bread left.
* **Branch 2.3:** "[2] They have 107 + 161 - 6 = 254 loaves left. (Q:-1.01392)(P:0.71103)"
* Calculation: 107 + 161 - 6 = 262
* Description: 254 loaves left.
5. **Second Level Branches (from Branch 3):**
* **Branch 3.1:** "[11] So, they have 200 - 132 = 68 loaves of bread left. The grocery store returned 6 unsold loaves, so they have 68 + 6 = 74 loaves of bread left. (Q:1.13120)(P:0.80345)"
* Calculations: 200 - 132 = 68, 68 + 6 = 74
* Description: 74 loaves left.
* **Branch 3.2:** "[2] The grocery store returned 6 unsold loaves, so the total number of loaves of bread they did not sell is 132 + 6 = 138 loaves. (Q:-0.49569)(P:0.65347)"
* Calculation: 132 + 6 = 138
* Description: 138 loaves not sold.
* **Branch 3.3:** "[21] They had 200 loaves baked and sold 132 loaves, so they had 200 - 132 = 68 loaves of bread left. (Q:1.14610)(P:0.89809)"
* Calculation: 200 - 132 = 68
* Description: 68 loaves left.
6. **Leaf Nodes:**
* Several leaf nodes provide final answers, such as "[1] Therefore, they had 126 loaves of bread left. The answer is: 126</s> (Q:-0.89611)(P:0.95100)", "[2] The answer is: 74</s> (Q:1.06829)(P:0.99951)", "[1] The answer is 254</s> (Q:-1.00238)(P:0.99992)", "[10] The answer is: 74</s> (Q:0.99283)(P:0.99903)", "[0] They started with 200 loaves and sold 138 loaves, so they have 200 - 138 = 62 loaves of bread left. The answer is: 62</s> (Q:-0.99298)(P:0.93236)", "[15] The grocery store returned 6 unsold loaves, so they had 68 + 6 = 74 loaves of bread left. The answer is: 74</s> (Q:1.08365)(P:0.98273)".
### Key Observations
* The decision tree explores various scenarios involving bread loaf production, sales, and returns.
* Calculations are performed at each node to determine the number of loaves remaining or produced.
* The tree branches out based on different events, such as selling loaves, returning loaves, or baking more loaves.
* The leaf nodes provide the final answers to the calculations.
### Interpretation
The decision tree provides a structured way to analyze different scenarios related to bread loaf management. It demonstrates how various actions, such as baking, selling, and returning loaves, affect the final number of loaves available. The tree helps to visualize the flow of events and calculate the outcomes based on different conditions. The Q and P values associated with each node might represent some form of quality or probability assessment, but without further context, their exact meaning is unclear. The tree highlights the importance of tracking bread loaf inventory and managing sales and returns effectively.
</details>
Figure 8: Example of the result search tree of a GSM8K question “The bakers at the Beverly Hills Bakery baked 200 loaves of bread on Monday morning. They sold 93 loaves in the morning and 39 loaves in the afternoon. A grocery store returned 6 unsold loaves. How many loaves of bread did they have left?”. The example solution is “The Bakery sold 93 + 39 = 132 loaves. The Bakery made 200 loaves and sold 132, leaving 200 - 132 = 68 loaves remaining. The grocery store returned 6 loaves, so there were 6 + 68 = 74 loaves left.”. The policy we use here is the one only tuned for $1$ epoch on SFT training data. We conduct MCTS with breadth $b_{1}=5,b_{2}=3$ . Duplicate generations are merged into one node.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Decision Tree: Bakery Loaf Calculation
### Overview
The image depicts a decision tree diagram illustrating the calculation of the number of bread loaves remaining at a bakery after various transactions (baking, selling, returns). Each node represents a state in the calculation, with branches representing different events or operations. The tree starts with an initial state and branches out to show different possible outcomes based on the sequence of events.
### Components/Axes
* **Nodes:** Each node contains a text description of the current state of the problem, including the number of loaves and the operations performed. Each node also contains a Q and P value.
* **Branches:** Lines connecting the nodes, representing the flow of the calculation.
* **Root Node:** The starting point of the tree, representing the initial state.
* **Leaf Nodes:** The end points of the tree, representing the final answers.
### Detailed Analysis
Here's a breakdown of the tree, starting from the left and moving to the right:
1. **Root Node:**
* Text: "[46] <s>"
* Q Value: 0.00000
* P Value: -1.00000
2. **First Level Nodes (from left to right):**
* **Top Node:**
* Text: "[3] The bakery baked 200 loaves of bread on Monday morning. They sold 93 loaves in the morning and 39 loaves in the afternoon, so they sold a total of 93 + 39 = 132 loaves."
* Q Value: 0.25430
* P Value: 0.93239
* **Middle Node:**
* Text: "[3] They had 200 - 93 = 107 loaves left after the morning sales. They had 107 - 39 = 68 loaves left after the afternoon sales."
* Q Value: -0.52703
* P Value: 0.86016
* **Bottom Node:**
* Text: "[30] After the morning sales, they had 200 - 93 = 107 loaves of bread left. After the afternoon sales, they had 107 - 39 = 68 loaves of bread left."
* Q Value: 1.65898
* P Value: 0.80497
3. **Second Level Nodes (from left to right):**
* **Top Node:**
* Text: "[0] 200 - 132 + 6 = 78 The bakery has 78 loaves of bread left. The answer is: 78</s>"
* Q Value: -1.07053
* P Value: 0.84111
* **Middle Top Node:**
* Text: "[2] They also had 6 loaves returned by the grocery store. To find out how many loaves of bread they have left, we need to subtract the loaves they sold and the returned loaves from the total number of loaves baked."
* Q Value: -1.05085
* P Value: 0.79282
* **Middle Bottom Node:**
* Text: "[0] A grocery store returned 6 unsold loaves, so they had 132 - 6 = 126 loaves left. Therefore, the bakery had 200 - 126 = 74 loaves of bread left."
* Q Value: -0.03073
* P Value: 0.89432
* **Bottom Node:**
* Text: "[34] After the grocery store returned the loaves, they had 68 + 6 = 74 loaves of bread left."
* Q Value: 1.13097
* P Value: 0.91597
4. **Third Level Nodes (from left to right):**
* **Top Node:**
* Text: "[1] So, 200 - 132 - 6 = 62 Therefore, they have 62 loaves of bread left. The answer is: 62</s>"
* Q Value: -1.07033
* P Value: 0.91906
* **Middle Node:**
* Text: "[0] A grocery store returned 6 unsold loaves, so the total number of loaves they have left is 200 - 132 + 6 = 88 loaves."
* Q Value: -0.02430
* P Value: 0.80175
* **Bottom Node:**
* Text: "[2] They had 68 - 6 = 62 loaves of bread left after the grocery store returned the unsold loaves."
* Q Value: -1.05568
* P Value: 0.80182
5. **Fourth Level Nodes (from left to right):**
* **Top Node:**
* Text: "[1] The answer is 62</s>"
* Q Value: -1.06738
* P Value: 0.99711
* **Bottom Node:**
* Text: "[28] The answer is 74</s>"
* Q Value: 0.98327
* P Value: 0.99718
### Key Observations
* The tree explores different paths based on how the bakery accounts for sales and returns.
* The calculations involve addition and subtraction to track the number of loaves.
* The final answers vary depending on the path taken, reflecting different scenarios.
### Interpretation
The decision tree visualizes the process of calculating the remaining loaves of bread at a bakery. It demonstrates how different sequences of sales and returns affect the final count. The tree highlights the importance of accurately tracking these transactions to determine the correct number of loaves on hand. The Q and P values are not explained in the image, but they likely represent some form of quality and probability metrics associated with each node or calculation.
</details>
Figure 9: Example of the result search tree of the same GSM8K question as in Figure 8 with the same search breadth. We use the policy tuned after $3$ epochs to sample the generations.