# : Reasoning with Intermediate Revision and Search
**Authors**: Yizhou Chi, Kevin Yang, Dan Klein, UC Berkeley
## Abstract
We present , a general reasoning and search method for tasks with outputs that can be decomposed into components. explores a search tree of potential solutions using Monte Carlo Tree Search (MCTS), building solutions one action at a time and evaluating according to any domain-specific heuristic, which in practice is often simply an LLM evaluator. Critically, our action space includes revision actions: may choose to revise part of its previous output rather than continuing to build the rest of its output. Empirically, outperforms state-of-the-art reasoning methods across three challenging tasks: Story Outline Improvement (up to +30% interestingness), Mini-Crosswords Solving (up to +16% word success rate), and Constrained Generation (up to +10% concept coverage).
: Reasoning with Intermediate Revision and Search
Yizhou Chi and Kevin Yang and Dan Klein UC Berkeley {yizhouchi,yangk, klein}@berkeley.edu
## 1 Introduction
While large language models (LLMs) such as GPT (Brown et al., 2020; OpenAI, 2024), LLaMA (Touvron et al., 2023a, b), and Claude (Anthropic, 2024) are increasingly capable at performing a variety of reasoning tasks, recent studies have revealed that the utilization of distinct prompting strategies and instructional guidance can have a notable influence on the performance of LLMs when tackling identical tasks.
Chain-of-Thought (CoT) is a prompting strategy detailed in Wei et al. (2023) that directs LLMs to produce the final task output through intermediate steps of reasoning, referred to as "intermediate thoughts." Notably, CoT has demonstrated a substantial enhancement in the problem-solving proficiency of LLMs without necessitating any model updates. Self-consistency with CoT (CoT-SC) (Wang et al., 2023a) proposes to improve output consistency by generating multiple CoTs and selecting the best outcome. Recently, extending CoT and CoT-SC, Tree-of-Thoughts (Yao et al., 2023a) and Graph-of-Thoughts (Besta et al., 2024) propose to shape the reasoning process of LLMs as a tree or an arbitrary graph structure. These approaches enable LLMs to explore different paths of thought and find better outputs by utilizing backtracking and graph-search algorithms. However, these approaches’ reasoning capabilities are often limited by the set of candidates they generate at earlier steps. They cannot revise and edit their original answers continuously in later steps. As a result, these methods may not be as effective in addressing problems that require frequent revision and modifications.
We propose , a tree-based framework that emulates human reasoning by enabling LLMs to create interconnected thought networks. A key feature is its self-revision mechanism, which iteratively improves outputs while generating new thought nodes. To address the vast search space in text generation, we use Monte Carlo Tree Search (MCTS), which efficiently navigates the search space and provides high-quality solutions, though not necessarily globally optimal. Our method includes three core modules: the thought evaluator, which gives textual and numerical feedback; the thought generator, which produces solutions based on initial instructions and feedback; and the decision simulator, which simulates lines of thought within the MCTS process to assess the potential value of different paths.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Decision Simulator Process Flow
### Overview
The image is a technical diagram illustrating a "Decision Simulator" process, likely for an AI or computational system that generates and refines solutions through iterative evaluation. The diagram is divided into two main sections: a high-level four-stage process at the top, and a detailed breakdown of the "Self Evaluation" component at the bottom.
### Components/Axes
**Top Section - Main Process Flow:**
* **Title:** "Decision Simulator" (centered at the top).
* **Four Sequential Stages:** Represented by beige rectangular boxes connected by right-pointing arrows.
1. **Selection** (leftmost)
2. **Expansion**
3. **Simulation**
4. **Backpropagation** (rightmost)
* **Tree Diagrams:** Below each stage label is a tree structure composed of circles (nodes) connected by lines (edges). A small robot icon sits atop each tree. Nodes are colored either light green or light grey.
* **"Self Evaluation" Box:** A blue rectangular box labeled "Self Evaluation" is connected via a dotted line to a specific green node in the "Expansion" stage tree. This box is the focus of the detailed lower section.
**Bottom Section - Detailed "Self Evaluation" Breakdown:**
This section is enclosed in a large dotted rectangle and details the interaction between two components:
* **Left Component: "Thought Generator"**
* A large grey box containing:
* **Task description:** "Write a short and simple sentence that contains 'bartender', 'tomato', 'spatula', 'boat', 'microphone', 'vest', 'into'..."
* **Current solution:** "The bartender inserts a tomato into the boat using a spatula." (Text in green)
* **Feedback:** "Can you provide a revised solution?"
* Three smaller grey boxes below, each containing a revised sentence:
1. "The bartender drops the microphone, adjusting his vest while throw a tomato into the boat using a spatula"
2. "Using a microphone, the bartender slips a tomato into the boat, wearing a vest and holding a spatula."
3. "The bartender, wearing a vest, uses a spatula to scoop a tomato into the boat while holding a microphone."
* **Right Component: "Thought Evaluator"**
* A grey box containing:
* **Task description**
* **Current solution**
* The prompt: "Can you evaluate the current solution and provide some feedback?"
* A blue box below labeled **"Self evaluation"** containing bullet points:
* "Missing the concepts 'microphone', 'vest'..."
* "It's weird to insert a tomato into a boat"
* **Flow Arrows:**
* A green arrow points from the "Thought Generator" box to the "Thought Evaluator" box.
* A blue arrow points from the "Self evaluation" box back to the "Thought Generator" box, indicating a feedback loop.
### Detailed Analysis
The diagram depicts an iterative refinement process.
1. **High-Level Flow:** The process moves from **Selection** -> **Expansion** -> **Simulation** -> **Backpropagation**. The tree structures suggest a search or exploration of possibilities (like a Monte Carlo Tree Search), where nodes represent states or decisions.
2. **Expansion & Evaluation:** The key action happens during the **Expansion** phase. A specific node (green) is selected for detailed "Self Evaluation."
3. **Iterative Refinement Loop (Detailed View):**
* The **Thought Generator** proposes a solution ("Current solution") to a given **Task description**.
* This solution is passed to the **Thought Evaluator**.
* The **Thought Evaluator** performs a **Self evaluation**, identifying flaws (missing required concepts, logical oddities).
* This feedback is sent back to the **Thought Generator**, which then produces multiple revised candidate solutions (the three example sentences).
* The cycle implies these new candidates would re-enter the tree for further selection and evaluation.
### Key Observations
* **Color Coding:** Green nodes likely represent selected or promising paths/nodes. The blue "Self Evaluation" box highlights the critical feedback mechanism.
* **Task Specificity:** The example task is a constrained sentence-generation problem requiring specific vocabulary. The initial solution fails to include all required words ("microphone", "vest") and creates a semantically awkward phrase ("inserts a tomato into the boat").
* **Evaluation Criteria:** The "Self evaluation" checks for both **completeness** (inclusion of all required concepts) and **plausibility/coherence** ("It's weird...").
* **Output Generation:** The Thought Generator doesn't just produce one alternative; it generates multiple (three shown) revised solutions, suggesting a branching or population-based approach to finding a better answer.
### Interpretation
This diagram illustrates a **self-correcting, iterative reasoning framework** for an AI system. It moves beyond simple single-pass generation by incorporating an explicit evaluation and feedback loop.
* **The Core Mechanism:** The system simulates decision-making (the tree search) but crucially embeds a **meta-cognitive step**—self-evaluation—within the expansion phase. This allows it to critique its own outputs against the task requirements.
* **Purpose:** The goal is to improve solution quality through internal critique and regeneration. The "Backpropagation" stage in the main flow likely uses the evaluation results to update the system's understanding, guiding future selections (reinforcement learning or similar).
* **Broader Implication:** This represents a move towards more robust and reliable AI agents that can perform self-checking and iterative improvement without external human feedback for every step. The example, while simple, demonstrates how such a system could catch omissions and logical inconsistencies in its own reasoning. The process mirrors human problem-solving: try a solution, evaluate it, identify weaknesses, and try again with improvements.
</details>
Figure 1: Illustration of using Monte Carlo Tree Search on the Constrained Generation task. Each circle in the diagram represents a thought node generated by LLMs. Selection: choose a thought node $x$ based on a selection algorithm. Expansion: A new set of child nodes $X$ is generated using the initial instruction, the current node, and self-evaluated textual feedback. The zoom-in of the expansion phase demonstrates the use of the Thought Evaluator and the Thought Generator, which entails assessing and refining the current solution for the task 4.3. Simulation: a single node $x^\prime$ is randomly chosen from the set $X$ . This selected node $x^\prime$ generates further nodes in sequence for several steps, corresponding to our Decision Simulator. Backpropagation: The numerical feedback evaluated at the last node is propagated back to the root node.
We evaluate on three challenging tasks for state-of-the-art language models: Story Outline Improvement, Mini-Crosswords Solving, and Constrained Generation. These tasks require advanced reasoning skills, varying degrees of exploration, and the ability for self-revision to achieve optimal results. Compared to state-of-the-art reasoning strategies as baselines, exhibits an up to 30% interestingness increase in Story Outline Improvement; up to 16% word success rate increase in Mini-Crossword Solving; and up to 10% concept coverage improvement in Constrained Generation. These findings underscore the efficacy of across diverse tasks.
## 2 Related Works
Feedback Guided Generation.
Human feedback has been shown to be effective in improving LLMs’ generation Tandon et al. (2022); Elgohary et al. (2021); Bai et al. (2022). However, human feedback is often costly and unable to be incorporated into an automated generation process. As a result, some works adopt a heuristic function to serve as an alternative to human feedback (Liu et al., 2022; Lu et al., 2022; Le et al., 2022; Welleck et al., 2022).
Madaan et al. (2023); Shinn et al. (2023); Paul et al. (2024) introduce a mechanism for LLMs to produce self-reflective feedback to improve their outputs. Along with the model-generated feedback, Chen et al. (2023) uses execution results to help improve code generation. Likewise, Kim et al. (2023) introduces a critic step to improve the model’s performance in computer tasks. These approaches follow left-to-right linear processes, potentially overlooking alternative directions. In our work, each thought node having multiple children nodes allows for broader exploration, enhancing decision-making comprehensiveness.
Graph Reasoning.
To facilitate broader exploration in problem-solving, Yao et al. (2023a) and Xie et al. (2023) use a tree-search procedure where each node represents a partial solution, requiring a complete solution to combine multiple nodes. This method restricts modifications to intermediate nodes, making the final output reliant on initial candidates. Besta et al. (2024) proposed a graph-based paradigm that models LLM reasoning as an arbitrary graph, allowing combinations of connecting nodes. Our approach differs by permitting review and modification of intermediate nodes, even allowing them to be revised or expanded if initially complete. This flexibility improves expressivity and enables language models to correct initial mistakes. Several concurrent works Hui and Tu (2024); Tian et al. (2024); Chen et al. (2024) have recently explored integrating Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs). However, these approaches primarily focus on mathematical reasoning tasks or rely on external feedback, fine-tuned policies, or reward models. In contrast, our method operates entirely at inference time, requiring no additional model training or external feedback.
LM Planning.
Long-form generation and complex problem-solving often require high-level planning or outlining. Natural language outliners and structured schemas play integral roles in generating long-form content (Tian and Peng, 2022; Mirowski et al., 2022; Yang et al., 2022, 2023). There are also works that utilize LLMs to tackle complex tasks such as video games, fact-checking, housekeeping, and code optimization with planning using natural languages (Yao et al., 2023b; Huang et al., 2022a; Wang et al., 2023b; Huang et al., 2022b). Our work could also be seen as a generic task planner using LLMs that leverages Monte Carlo Tree Search to facilitate various tasks in diverse domains.
## 3 Method
We treat each formal output of LMs as a thought node $x∈\{x^0,x^1,...x^i\}$ , where $x^0$ is the root node and the initial output provided by LMs given the task instruction $I$ . For instance, a thought node can be a few lines of items (Story Outline Improvement), a couple of words (Mini-Crosswords), or a sentence (Constrained Generation). To process the thought node and look for a better output, our method consists of three modules: thought evaluator, thought generator, and decision simulator.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flowchart: Story Outline Evaluation and Revision
### Overview
The image is a flowchart diagram illustrating a process for evaluating and revising a story outline to increase its "interestingness." It begins with an initial outline, which is then analyzed by an "Itemized Evaluation" system (represented by a robot icon). This evaluation identifies specific weaknesses, leading to three distinct revised outlines, each addressing a different critique and achieving a higher interestingness score.
### Components/Axes
The diagram is structured hierarchically with connecting lines indicating flow.
1. **Instruction Box (Top-Left):**
* **Text:** "Instruction: You are a popular novel writer, and are now making an interesting outline for the story. You know how to engage with the readers by not limited to introducing captivating characters and unexpected twist."
2. **Initial Outline Box (Top-Center):**
* A large, light-gray rounded rectangle containing a numbered list.
* **Content:**
1. Jill's friends encourage her to ignore her mother's remarks.
2. Jill and Molly experiment with Jill's morphing ability, trying out different transformations.
3. Jill and Molly realize the potential importance of Jill's morphing ability and discuss how it can be used.
4. Jill and Molly decide to keep Jill's ability a secret and come up with a plan to use it to their advantage.
* **Attached Label (Bottom-Right of box):** A yellow rectangle with the text "Interestingness 5/10".
3. **Central Evaluation Node (Center):**
* A blue rectangle with the text "Itemized Evaluation".
* A small robot icon is positioned to its right.
4. **Critique Labels (Below Central Node):**
* Three text labels branch down from the central node, each identifying a specific flaw in the initial outline.
* **Left:** "[2]-[3] Lack of conflicts."
* **Center:** "[3]-[4] Lack of suspense."
* **Right:** "[1]-[2] Lack of character development..."
5. **Revised Outline Boxes (Bottom Row):**
* Three light-gray rounded rectangles, each corresponding to a critique above it. They contain revised numbered lists. Changes from the original are highlighted in **green text**.
* **Left Box (Addresses "Lack of conflicts"):**
* **Content:**
1. Jill's friends encourage her to ignore her mother's remarks.
2. **Jill accidentally morphs into a dangerous creature while experimenting with her ability, causing tension between her and Molly.**
3. **Jill and Molly's friendship is tested as they grapple with the consequences of Jill's ability and struggle to find a way to control it.**
4. Jill and Molly decide to keep Jill's ability a secret and come up with a plan to use it to their advantage.
* **Attached Label (Bottom-Right):** Yellow rectangle with "Interestingness 8/10".
* **Center Box (Addresses "Lack of suspense"):**
* **Content:**
1. Jill's friends encourage her to ignore her mother's remarks.
2. Jill and Molly experiment with Jill's morphing ability, trying out different transformations.
3. **Jill and Molly find out a dangerous secret related to the morphing ability.**
4. **Jill and Molly must now navigate a web of lies and betrayal as they try to protect themselves from those who seek to exploit Jill's power.**
* **Attached Label (Bottom-Right):** Yellow rectangle with "Interestingness 7/10".
* **Right Box (Addresses "Lack of character development..."):**
* **Content:** This box is mostly obscured by a stack of placeholder cards. The visible top card contains only an ellipsis ("..."). A small yellow rectangle with "..." is attached to its bottom-right corner, implying an interestingness score is present but not shown.
### Detailed Analysis
The diagram presents a clear before-and-after analysis of a narrative outline.
* **Initial State:** The starting outline (Interestingness: 5/10) is a straightforward, low-conflict sequence where two characters discover and plan to use a secret ability.
* **Evaluation Process:** The "Itemized Evaluation" system diagnoses three specific narrative weaknesses between numbered points in the initial outline:
* Between points 2 & 3: Lack of conflicts.
* Between points 3 & 4: Lack of suspense.
* Between points 1 & 2: Lack of character development.
* **Revised States:** Each revision directly targets one critique:
1. **Conflict Revision (Score: 8/10):** Introduces an accidental dangerous transformation (point 2) and tests the friendship (point 3), creating direct interpersonal and internal conflict.
2. **Suspense Revision (Score: 7/10):** Introduces a "dangerous secret" (point 3) and external antagonists seeking to exploit the power (point 4), raising stakes and creating narrative suspense.
3. **Character Development Revision (Score: Not shown):** The content is not visible, but the critique suggests the revisions would focus on deepening the characters' backgrounds, motivations, or relationships in the early points of the outline.
### Key Observations
* **Score Progression:** Addressing narrative flaws leads to higher interestingness scores. The "Lack of conflicts" revision yields the highest visible score (8/10), suggesting conflict is weighted heavily in this evaluation model.
* **Targeted Revisions:** Each revised outline only modifies the specific points implicated by the critique, leaving other points unchanged. This demonstrates a modular approach to story editing.
* **Visual Coding:** The use of **green text** in the revised outlines clearly highlights the exact lines that were altered to address the critique.
* **Spatial Layout:** The flow is top-down: Instruction -> Initial Outline -> Evaluation -> Critiques -> Revised Outlines. The legend (Interestingness scores) is consistently placed at the bottom-right of each outline box.
### Interpretation
This diagram models a structured, analytical approach to creative writing. It treats a story outline as a system that can be debugged and optimized by identifying and patching specific narrative deficiencies (conflict, suspense, character development). The "Itemized Evaluation" acts as an automated or systematic critic.
The underlying principle is that reader engagement ("interestingness") is not a vague quality but a direct result of employing specific narrative techniques. The process demonstrates that:
1. **Conflict is crucial:** Introducing accidental danger and interpersonal tension provided the greatest boost to the score.
2. **Stakes matter:** Adding external threats and secrets increases suspense and engagement.
3. **Modular improvement is possible:** A weak outline can be strengthened by surgically enhancing its components rather than starting from scratch.
The obscured third branch implies the process is repeatable for various narrative elements. The diagram serves as both a flowchart for a specific revision task and a general metaphor for using systematic critique to enhance creative work.
</details>
Figure 2: Illustration of our Story Outline Improvement task. A step involves employing the thought evaluator to conduct itemized evaluations of the story outline and utilizing the thought generator to generate a candidate set of improved story outlines for task 4.1.
### 3.1 Thought Evaluator
The thought evaluator evaluates the status of each thought node and provides feedback for potential improvement. It not only works as a heuristic for the search algorithm but also gives potential directions and guidance to generate new candidates.
Feedback $f(x^i)$ for a node $x^i$ consists of numerical feedback $f_numeric(x^i)$ and natural language feedback $f_NL(x^i)$ . The numerical feedback will be used as the evaluation score $v(x^i)$ for the current node, and the natural language feedback will be used as context to generate child nodes.
$$
\displaystyle f(x^i)=\enspace<f_NL(x^i) \displaystyle,f_numeric(x^i)> \displaystyle f_numeric(x^i) \displaystyle=v(x^i) \tag{1}
$$
We present two types of natural language feedback, each beneficial for various task scenarios. Furthermore, these strategies are flexible, allowing for independent or combined utilization.
- Holistic Evaluation: Evaluate the entire thought node as a unified whole to provide comprehensive feedback. This approach captures the core message and coherence of the node. The right side of the zoomed-in expansion phase in Figure 1 illustrates how the thought evaluator generates holistic feedback based on the task description and the current solution of the node.
- Itemized Evaluation: Evaluate each sub-unit of the thought node individually, providing targeted feedback for each component. This method results in a list of feedback specific to each sub-unit, making it ideal when the thought node can be divided into distinct elements for localized evaluation. For instance, in the story outline task shown in Figure 2, breaking the outline into separate items allows for focused assessment and refinement.
### 3.2 Thought Generator
Once we have evaluation feedback of the current node, we can form subsequent thought nodes that aim to improve the current output. Based on the task description $I$ , the current solution $x_parent$ , and the natural language feedback $f_NL$ provided by the self-evaluator, each thought node generates $k$ candidate thought nodes using a pre-trained LM with a parameter $θ$ .
A child node $x_child$ will be generated as follows:
$$
\displaystyle x_child∼ p_θ(x|I,x_parent,f_NL(x_parent)) \tag{3}
$$
The left part of the zoomed-in expansion phase depicted in Figure 1 illustrates how leverages the task description, current solution, and evaluation feedback to produce a set of candidate nodes.
### 3.3 Decision Simulator
is equipped with a decision simulator that enables it to simulate decisions at deeper layers and then backpropagate to update the score of the current decision. In other words, we are doing a rollout to get a better estimate of the reward for the node we are at. The behavior of the decision simulator is analogous to the processes in Monte Carlo Tree Search (MCTS; see Algorithm 1). It is possible to replace the decision simulator with other search algorithms such as DFS, BFS, or A* search (and we in fact run DFS as well in our experiments in Section 4), but MCTS provides a computational advantage by efficiently navigating complex search spaces, balancing exploration and exploitation to reach optimal solutions with fewer evaluations. Its incremental and iterative nature also scales well to large problem instances.
MCTS explores potential moves and stores the outcomes in a search tree. With each search iteration, the tree expands, accumulating more information. As shown in Figure 1, MCTS can be divided into four phases: selection, expansion, simulation, and backpropagation.
In the selection phase, a leaf node will be selected based on Upper Confidence Bound 1 (UCB1) Eqn 4 which prioritizes nodes that have not been explored extensively but show promise. Therefore, the UCB1 value of node $x$ takes into account not only the heuristic score $v(x)$ but also the total number of visits to the node itself, $n(x)$ , as well as its parent node, $n(x_parent)$ .
$$
\displaystyle UCB1(x)=v(x)+c√{\frac{\ln n(x_parent)}{n(x)}} \tag{4}
$$
In the expansion phase, the thought generator will expand the selected leaf node by generating a set of children nodes based on the feedback provided by the thought evaluator.
In the simulation phase, a child node is picked from the newly generated set using a uniform distribution. In the subsequent iterations, however, we generate only a single node iteratively until the maximum simulation depth $d_simulation$ is reached.
Finally, in the backpropagation phase, we update the reward of the last node generated in the simulation back to the root node and iterate this process for $d_rollout$ steps. The node with the highest average reward will be chosen as the final output.
## 4 Experiments
We evaluate our method on three distinct tasks: Story Outline Improvement, Mini-Crossword Solving, and Constrained Generation.
We evaluate the tasks with Chain-of-Thought (CoT) (Wei et al., 2023), Self-Refine (Madaan et al., 2023), and Tree-of-Thoughts (ToT) with DFS (Yao et al., 2023a) as baselines. We use GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) (OpenAI, 2024) as strong base LMs for the reasoning algorithms across all tasks. Both base LMs use a temperature of 0.7. To further evaluate the efficacy of our proposed approach, we conduct an ablation study by investigating the performance of our method when employing Depth-First Search (DFS) (Algorithm 2) as an alternative search algorithm to the MCTS algorithm. In addition, running with DFS facilitates closer comparison with ToT, which also uses DFS. While with MCTS typically performs better, we observe in our experiments below that with DFS still outperforms our other baselines, demonstrating ’s ability to generalize to other search algorithms.
### 4.1 Story Outline Improvement
| Methods Initial Outline | Base LLM GPT3.5 12.0 | GPT4 12.0 |
| --- | --- | --- |
| CoT | 50.1 | 28.8 |
| Self-refine | 65.5 | 27.9 |
| ToT | 72.1 | 49.9 |
| (DFS) | 79.3 | 53.7 |
| (MCTS) | 89.9 | 65.0 |
Table 1: Average outline interestingness. Initial Outline is the starting point before rewriting with any reasoning method. ’s outputs are judged to be interesting at a higher percentage compared to baselines.
One approach to generating long-form stories via LLMs is to adopt a high-level writing process that first designs an outline of the story and fills up the details based on the outline (Yang et al., 2022, 2023). An unengaging or uncompelling outline is unlikely to yield a captivating final draft, regardless of the subsequent detailing efforts. To address this challenge, we propose a task focused specifically on enhancing the interestingness of story outlines generated by LLMs.
Task Setup
We sample 500 book descriptions from the WhatsThatBook dataset (Lin et al., 2023) and generate story outlines using DOC (Yang et al., 2023) with GPT-3.5. We allocate 400 descriptions for training, 50 for validation, and 50 for testing. For each description, we generate three types of outlines: one prompted to be interesting, one prompted to be boring, and one without specific instructions. Since there is no ground truth for the interestingness of the outline, we employ an outline content evaluator to assess the final interestingness of generated or revised outlines. Neither nor the baselines have access to this evaluator during outline generation. We fine-tune the pre-trained Flan-T5 model (Chung et al., 2022) to serve as the content evaluator, training it to rate interesting outlines as 1 and boring ones as 0. This evaluator’s output serves as the score metric for the task. For evaluation, LMs revise and improve the interestingness of default outlines in the test set. The dataset includes 400 interesting and 400 non-interesting outlines for fine-tuning, 50 interesting and 50 non-interesting outlines for validation, and 50 outlines for testing algorithms.
Also, we conduct human evaluation via Prolific to assess the generated story outlines (GPT-3.5 as the base LLM), capturing subjective perceptions and cultural nuances that LLMs may miss. We recruited annotators to evaluate 100 pairs of story outlines, each pair consisting of one outline generated by with MCTS and another by ToT or Self-Refine, with each pair annotated by two annotators.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Grouped Bar Chart: Preference Proportions for ThoughtSculpt (MCTS) vs. Baselines
### Overview
The image is a grouped bar chart comparing the "Proportion of Preference" for three categories across two different comparison scenarios. The chart visually demonstrates user or evaluator preference for a method called "ThoughtSculpt (MCTS)" against two other options: "Baselines" and a neutral "Neither" option. The data is presented in two distinct groups on the x-axis.
### Components/Axes
* **Chart Type:** Grouped vertical bar chart.
* **Y-Axis:**
* **Label:** "Proportion of Preference"
* **Scale:** Linear scale from 0 to 60, with major gridlines at intervals of 10 (0, 10, 20, 30, 40, 50, 60).
* **X-Axis:**
* **Categories (Groups):** Two primary comparison groups.
1. "vs. Self-Refine" (left group)
2. "vs. ToT" (right group)
* **Legend:** Located at the top of the chart, centered horizontally.
* **Blue Square:** "ThoughtSculpt (MCTS)"
* **Gold Square:** "Baselines"
* **Green Square:** "Neither"
* **Data Series:** Three colored bars per x-axis group, corresponding to the legend.
### Detailed Analysis
**Group 1: "vs. Self-Refine" (Left Cluster)**
* **ThoughtSculpt (MCTS) [Blue Bar]:** This is the tallest bar in the entire chart. Its top aligns just above the 60 gridline. **Approximate Value: 63** (±1).
* **Baselines [Gold Bar]:** This is the shortest bar in this group. Its top is slightly above the 10 gridline. **Approximate Value: 12** (±1).
* **Neither [Green Bar]:** This bar's height is midway between the 20 and 30 gridlines. **Approximate Value: 25** (±1).
**Group 2: "vs. ToT" (Right Cluster)**
* **ThoughtSculpt (MCTS) [Blue Bar]:** This bar is shorter than its counterpart in the first group. Its top is just below the 50 gridline. **Approximate Value: 49** (±1).
* **Baselines [Gold Bar]:** This bar is taller than the "Baselines" bar in the first group. Its top aligns exactly with the 25 mark (midway between 20 and 30). **Approximate Value: 25** (±1).
* **Neither [Green Bar]:** This is the shortest bar in this group. Its top is slightly below the 20 gridline. **Approximate Value: 18** (±1).
**Trend Verification:**
* **ThoughtSculpt (MCTS) Trend:** The blue bar shows a clear downward slope from the "vs. Self-Refine" group to the "vs. ToT" group, indicating a decrease in preference proportion when compared to ToT versus when compared to Self-Refine.
* **Baselines Trend:** The gold bar shows a clear upward slope from the "vs. Self-Refine" group to the "vs. ToT" group, indicating an increase in preference proportion when compared to ToT versus when compared to Self-Refine.
* **Neither Trend:** The green bar shows a slight downward slope from the "vs. Self-Refine" group to the "vs. ToT" group.
### Key Observations
1. **Dominant Preference:** "ThoughtSculpt (MCTS)" is the most preferred option in both comparison scenarios, with proportions significantly higher than both "Baselines" and "Neither."
2. **Strongest Lead:** ThoughtSculpt's lead is most pronounced in the "vs. Self-Refine" comparison (63 vs. 12 and 25).
3. **Baseline Performance Shift:** The "Baselines" category performs notably better in the "vs. ToT" comparison (25) than in the "vs. Self-Refine" comparison (12).
4. **Neutral Response:** The "Neither" option captures a moderate proportion of responses in both groups, suggesting a non-trivial number of evaluators found no clear preference.
### Interpretation
This chart presents results from a preference-based evaluation, likely from a human study or an automated judge comparing AI reasoning methods. The data strongly suggests that the **ThoughtSculpt (MCTS)** method is perceived as superior to both the **Self-Refine** baseline and the **ToT (Tree of Thoughts)** baseline.
The significant drop in ThoughtSculpt's preference score from ~63 against Self-Refine to ~49 against ToT indicates that **ToT is a more competitive baseline than Self-Refine**. Conversely, the rise in the "Baselines" score from 12 to 25 when moving from the Self-Refine to the ToT group confirms that evaluators found the ToT baseline more preferable than the Self-Refine baseline.
The "Neither" category's presence (18-25%) is important. It acts as a control, showing that the evaluations were not forced choices and that in a meaningful subset of cases, the methods were either indistinguishable or equally flawed. The slight decrease in "Neither" responses for the ToT comparison might imply that the distinction between ThoughtSculpt and ToT was slightly clearer to evaluators than the distinction between ThoughtSculpt and Self-Refine.
**In summary, the chart communicates that ThoughtSculpt (MCTS) is the preferred method, but its advantage is context-dependent, being more substantial against the Self-Refine baseline than against the stronger ToT baseline.**
</details>
Figure 3: Proportion of outlines generated by each method that were preferred by humans in pairwise comparison. ("Neither" indicates that neither nor the baseline methods were preferred.)
Method Setup
Each method is allowed to search or iterate through a maximum depth of 3. The thought evaluator will perform an itemized evaluation on the current outline and provide an interesting score from 1 to 10 as the numerical feedback. Based on each itemized feedback, a child node will be proposed to modify the current outline in order to improve its interestingness. For and ToT, each node will generate a maximum of 3 candidate child outlines. In this and all the experiments below, with MCTS will have a maximum $d_simulation$ of 1. Figure 2 illustrates how the story outline is improved.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Interestingness vs. Number of Steps
### Overview
The image is a line chart comparing the performance of four different methods or algorithms over a series of steps. The performance metric is "Interestingness," plotted on the y-axis, against the "Number of Steps" on the x-axis. The chart includes error bars for each data point, indicating variability or confidence intervals.
### Components/Axes
* **Y-Axis:** Labeled "Interestingness". The scale ranges from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Labeled "Number of Steps". The scale shows discrete steps at 0, 1, 2, and 3.
* **Legend:** Positioned in the bottom-right quadrant of the chart area, slightly overlapping the data lines. It contains four entries:
1. **ThoughtSculpt (MCTS):** Represented by a solid blue line with circular markers.
2. **ThoughtSculpt (DFS):** Represented by a dashed orange line with circular markers.
3. **Self Refine:** Represented by a dotted green line with circular markers.
4. **ToT:** Represented by a dash-dot red line with circular markers.
* **Data Series:** Four lines, each with data points at steps 0, 1, 2, and 3. Vertical error bars extend above and below each data point.
### Detailed Analysis
**Data Points and Trends (Approximate Values):**
* **Step 0:** All four methods start at approximately the same point, with an Interestingness value of ~0.1. The error bars are relatively small.
* **Step 1:** All methods show a sharp increase.
* **ThoughtSculpt (MCTS) [Blue, Solid]:** Rises to ~0.75. Trend: Steep upward slope from step 0.
* **ThoughtSculpt (DFS) [Orange, Dashed]:** Rises to ~0.65. Trend: Steep upward slope from step 0.
* **Self Refine [Green, Dotted]:** Rises to ~0.70. Trend: Steep upward slope from step 0.
* **ToT [Red, Dash-Dot]:** Rises to ~0.70. Trend: Steep upward slope from step 0.
* **Step 2:** The trends diverge.
* **ThoughtSculpt (MCTS) [Blue, Solid]:** Increases to ~0.80. Trend: Continued moderate upward slope.
* **ThoughtSculpt (DFS) [Orange, Dashed]:** Increases to ~0.80. Trend: Moderate upward slope, now matching MCTS.
* **Self Refine [Green, Dotted]:** Decreases slightly to ~0.70. Trend: Slight downward slope.
* **ToT [Red, Dash-Dot]:** Decreases to ~0.65. Trend: Moderate downward slope.
* **Step 3:** Final values show separation.
* **ThoughtSculpt (MCTS) [Blue, Solid]:** Increases to the highest value, ~0.90. Trend: Continued upward slope, finishing as the top performer.
* **ThoughtSculpt (DFS) [Orange, Dashed]:** Plateaus at ~0.80. Trend: Flat line from step 2.
* **Self Refine [Green, Dotted]:** Decreases further to ~0.65. Trend: Continued slight downward slope.
* **ToT [Red, Dash-Dot]:** Increases to ~0.72. Trend: Moderate upward slope from step 2.
**Error Bars:** Error bars are visible for all points. They appear largest for the "ToT" series at step 2 and for "ThoughtSculpt (MCTS)" at step 3, suggesting greater variance in results at those points.
### Key Observations
1. **Universal Initial Gain:** All methods demonstrate a significant improvement in "Interestingness" from step 0 to step 1.
2. **Diverging Paths:** After step 1, the performance trajectories split. ThoughtSculpt (MCTS) and (DFS) continue to improve or hold steady, while Self Refine and ToT show declines or volatility.
3. **Top Performer:** ThoughtSculpt (MCTS) shows the most consistent positive trend, ending with the highest Interestingness score at step 3.
4. **Plateauing:** ThoughtSculpt (DFS) matches MCTS at step 2 but then plateaus, failing to achieve the final gain that MCTS does.
5. **Volatility:** The ToT method shows a notable dip at step 2 before recovering somewhat at step 3.
### Interpretation
The chart suggests that for the task of generating "Interestingness," iterative refinement (increasing the number of steps) is beneficial, but the strategy employed matters greatly after the initial step.
* **ThoughtSculpt (MCTS)** appears to be the most robust and scalable approach, as its performance continues to climb with more steps. The use of Monte Carlo Tree Search (MCTS) may allow for more effective exploration of the solution space compared to the other methods.
* **ThoughtSculpt (DFS)**, using Depth-First Search, is effective initially but hits a performance ceiling by step 2, suggesting it may get stuck in local optima or lack the exploratory breadth of MCTS.
* **Self Refine** and **ToT (Tree of Thoughts)** show a pattern of initial promise followed by degradation. This could indicate overfitting, instability in the refinement process, or that these methods require careful tuning of the number of steps to avoid negative returns. The recovery of ToT at step 3, however, hints at potential if the process is extended further.
The presence of error bars underscores that these are average results with inherent variability. The larger error bars for ToT at its low point (step 2) and for MCTS at its high point (step 3) are particularly noteworthy, indicating less predictable outcomes at those stages for those methods. Overall, the data strongly favors the MCTS variant of ThoughtSculpt for maximizing interestingness over multiple refinement steps.
</details>
Figure 4: Average outline interestingness at each step. ’s interestingness increases more with steps compared to baselines.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Diagram: Crossword Puzzle Solving Process
### Overview
The image is a technical diagram illustrating a process for solving a 5x5 mini crossword puzzle. It depicts a workflow that starts with an instruction and an initial board state, moves through a reasoning and evaluation phase managed by an AI agent (represented by a robot icon), and results in the generation of candidate word sets. The diagram is composed of four main visual components connected by lines, indicating a flow of information or process steps.
### Components/Axes
The diagram is segmented into four primary regions:
1. **Left Region (Instruction & Initial State):**
* **Header Text:** "Instruction:"
* **Instruction Block:** "Let's play a 5 × 5 mini crossword, where each word should have exactly 5 letters. Your goal is to fill in the crossword with words based on the clues provided."
* **Initial Crossword Grid:** A rounded rectangle containing a 5x5 grid. The first column is partially filled with the letters `C`, `R`, `E`, `S`, `T` from top to bottom. The fourth row is filled with the word `SIEVE`. All other cells contain the placeholder `X`.
2. **Central Region (Processing & Evaluation):**
* **Main Box:** A large rectangle containing reasoning text and a button.
* **Reasoning Text:**
* "h1. Is able: C____ - CANST"
* "- Reasoning: This fits the definition and the initial "C" already placed on the board."
* "v2. True being: __I__ - OUSIA"
* "- Reasoning: the word that fits this definition and the pattern __I__"
* **Button:** A blue rectangle labeled "Holistic Evaluation".
* **Icon:** A simple line-drawing of a robot head and torso is positioned below the main box, symbolizing the AI agent performing the evaluation.
3. **Right Region (Candidate Sets):**
* **Label Box:** A small rectangle labeled "Candidate Set".
* **List:** Below the label, a partial list: "h1. CANST", "v2. OUSIA", "...".
* **Candidate Grids:** Two rounded rectangles stacked vertically, each showing a potential state of the crossword grid.
* **Top Candidate Grid:** Shows the first column as `C`, `R`, `E`, `S`, `T`. The first row is filled with `CANST`. The fourth row is `SIEVE`. All other cells are `X`.
* **Bottom Candidate Grid:** Shows the first column as `C`, `R`, `E`, `S`, `T`. The first row is `COXXX`. The second row is `RXXXX`. The third row is `EXXXX`. The fourth row is `SIEVE`. The fifth row is `TAXXX`. This grid appears to be one of several stacked, suggesting multiple candidates.
4. **Flow Indicators:**
* A line connects the top of the "Instruction" text block to the top-left of the central processing box.
* A line connects the right side of the central processing box to the "Candidate Set" label box.
* Three lines fan out from the bottom-right of the "Candidate Set" label box towards the stack of candidate grids.
### Detailed Analysis
* **Initial Puzzle State:** The starting board has the vertical word (column 1) beginning with "C", "R", "E", "S", "T" and the horizontal word (row 4) set as "SIEVE". This creates intersections at (Row 4, Column 1) with the letter 'S'.
* **Clue Processing:** The system is processing two clues:
* **h1 (Horizontal 1):** Clue "Is able:" with pattern "C____". The proposed answer is "CANST".
* **v2 (Vertical 2):** Clue "True being:" with pattern "__I__". The proposed answer is "OUSIA".
* **Reasoning Logic:** The reasoning for "CANST" explicitly references the pre-filled "C" on the board. The reasoning for "OUSIA" references the pattern constraint "__I__", which would place an 'I' at the intersection with the fourth row (S**I**EVE).
* **Candidate Generation:** The "Candidate Set" box lists the proposed words "CANST" and "OUSIA". The visual candidate grids show how these words might be placed. The top grid shows "CANST" placed in row 1. The bottom grid shows a different configuration where row 1 starts with "CO", suggesting the system is considering multiple possibilities for the same slot (h1).
### Key Observations
1. **Pattern Matching:** The process heavily relies on matching word patterns (e.g., "C____", "__I__") against clue definitions.
2. **Constraint Propagation:** The pre-filled letters ("C" in column 1, "SIEVE" in row 4) act as constraints that filter possible answers for intersecting words.
3. **Multiple Hypotheses:** The stack of candidate grids indicates the system generates and evaluates multiple potential solutions in parallel, not just a single linear path.
4. **Holistic Evaluation:** The central "Holistic Evaluation" button suggests that after generating candidate words based on individual clues and patterns, the system performs a higher-level check to ensure all words fit together consistently across the entire grid.
### Interpretation
This diagram outlines a **knowledge-based, constraint-satisfaction approach to automated crossword solving**. It demonstrates a multi-stage reasoning pipeline:
1. **Input Parsing:** The system takes a puzzle definition (grid size, initial state) and natural language clues.
2. **Clue Decoding & Candidate Generation:** For each clue, it performs lexical pattern matching (filling blanks) and semantic matching (linking clues to definitions) to produce a set of candidate words. The reasoning snippets show this dual constraint process.
3. **Constraint Integration:** The initial board state provides hard constraints (pre-filled letters) that immediately filter candidates for intersecting words.
4. **Hypothesis Management:** The system maintains multiple candidate sets, representing different combinations of word choices. The stacked grids visually represent this branching possibility space.
5. **Global Consistency Check:** The "Holistic Evaluation" is the critical final step. It likely checks for conflicts between intersecting words across the entire grid, ensuring that the chosen set of words is mutually compatible. This moves beyond solving clues in isolation to solving the puzzle as a unified system.
The presence of the robot icon and the structured reasoning text implies this is a model of an **AI agent's internal reasoning process**, making its step-by-step logic explicit. The diagram's value lies in visualizing how local, clue-based reasoning is combined with global grid constraints to converge on a valid solution.
</details>
Figure 5: Illustration of a step in the deliberation process in the Mini-Crosswords task, where the current crossword board is assessed using the thought evaluator and a candidate set of words is proposed for task 4.2. One step is equal to one $d_rollout$
Results
As illustrated in Table 1, all methods unsurprisingly improve the level of interestingness relative to the initial outline (sampled from default outlines with no prompting for either interesting or boring). However, overall, outperforms ToT even with DFS, while with MCTS demonstrates the highest average interestingness percentage across both GPT-3.5 and GPT-4 with 89.9 and 65.0 respectively. One possible explanation for why GPT-4, serving as the base LM, exhibits lower overall interestingness could be attributed to the fact that the outline content evaluator was trained on outlines generated using GPT-3.5. As Figure 3 shown, human annotators also gave a higher preference towards with MCTS outputs comparing with other baselines, agreeing with the prior evaluation results. Moreover, our strong performance comes at only a modest increase in computational cost compared to baselines. We compute the average token cost of for this task along with other tasks in Appendix B. with DFS has a cost comparable to ToT, while the higher-performing with MCTS requires 1.2x more computation than ToT due to its additional decision simulation process.
Continuous Improvement
Figure 4 illustrates the progression of story outline interestingness at various steps, employing GPT-3.5 as the base LM. Among the tested methods, only with MCTS has exhibited a consistent pattern of improvement over time. In contrast, both ToT and Self-refine exhibit a lack of continuous improvement. We suppose that Self-refine’s limited search space and ToT’s absence of a revision process contribute to this phenomenon.
| CoT Self-refine ToT | 10.5 13.5 19.5 | 34.6 27.4 36.6 | 0.0 5.0 0.0 | 15.6 46.5 39.5 | 40.6 74.8 64.8 | 5.0 5.0 5.0 |
| --- | --- | --- | --- | --- | --- | --- |
| (DFS) | 14.0 | 33.2 | 0.0 | 46.5 | 68.2 | 20.0 |
| (MCTS) | 19.0 | 41.6 | 0.0 | 54.0 | 74.0 | 25.0 |
Table 2: Mini-crossword results of 20 puzzles for and baselines (success % of letters, words, and games). with MCTS is either best or closely comparable to best across the board.
### 4.2 Mini crosswords
We also explore our method on 5x5 mini crosswords following the setup of Yao et al. (2023a). For every puzzle, there are five horizontal (h1 to h5) and five vertical (v1 to v5) words to be filled. The task is to solve a five-by-five crossword puzzle in several steps (either filling or editing a word counts as one step). For evaluation, we check the proportion of letters, words, and games correctly filled by each reasoning method.
Method Setup
Each thought node represents a (possibly partial) solution to the crossword puzzle. To evaluate each thought node, the LM is prompted to evaluate each clue against the filled-in letters and suggest whether it is reasonable. For example, if the first row is filled with "AMIGO" and nothing else is filled, then the first column will be shown as "A____". Thus, in the prompt, there will be one line "v1. A Mennonite sect, named for Jacob Ammann: A____" that asks the LM to determine whether there are potential answers. The node evaluation’s prompt setup is similar to (Yao et al., 2023a) ’s except that we use the evaluation feedback to generate new candidates instead of pruning branches. Based on the evaluation feedback, every candidate for a node will be generated to either suggest a new word to fill a blank space or propose a modification to a word already filled in. For each node, and ToT generate a maximum of 3 candidates. In contrast to the setup in Yao et al. (2023a), where maximum search steps is set to 100, we impose a constraint on all methods to utilize only 20 search steps. This constraint aims to prevent attempts to artificially boost performance by exhaustively trying numerous word possibilities. With this restriction, each row or column of the crossword puzzle allows, on average, only two word attempts to be made within the allocated search budget. Figure 5 illustrates how approaches to solve a crossword puzzle.
Results
As shown in Table 8, with MCTS attains the highest letter success rate using GPT-3.5 and the highest word and game success rate using GPT-4; it is also always at least comparable to the best in all cases. With limited search steps, it is surprising that ToT using GPT-4 performs worse than even Self-refine; it turns out that a self-revision mechanism is important in this task. with MCTS achieves comparable performance to that reported by ToT (Yao et al., 2023a) using 100 search steps, despite employing just 20 search steps in our experiment.
### 4.3 Constrained Generation
CommonGen is a benchmark dataset and a constrained text generation task designed to evaluate LMs’ abilities in generative commonsense reasoning (Lin et al., 2020). An example instruction for the task is shown in Appendix A.3. However, currently, the coverage test of CommonGen can be completed with 90% or higher accuracy by many LLMs with one-shot prompting. Therefore, we instead test on CommonGen-Hard as introduced by (Madaan et al., 2023). Rather than just four concepts, CommonGen-Hard requires models to generate a sentence with 20-30 concepts.
Method Setup
In this task, we first provide the set of concepts required and the task description for the LM to generate an initial thought node. During the thought evaluation, the LM will be prompted to give feedback about the quality of the concepts used and whether there are any missing concepts. A child node will be generated using the feedback along with the current solution. We set a maximum depth of 3 for this task. For each node, both and ToT will generate a maximum of 3 child candidates.
| CoT Self-refine ToT | 44.1 70.0 54.8 | 96.1 98.5 98.8 |
| --- | --- | --- |
| (DFS) | 79.6 | 99.1 |
| (MCTS) | 77.9 | 99.0 |
Table 3: Constrained Generation Results (% Coverage of Concepts). outperforms all baselines on both base LMs.
Results
Table 3 shows that outperforms all other baselines when using either GPT-3.5 or GPT-4 as the base LM. While with DFS achieves the highest coverage of 79.6% (GPT-3.5) and 99.1% (GPT-4), with MCTS also demonstrates comparable concept coverage of 77.9% using GPT-3.5 and 99.0% using GPT-4. While MCTS exhibits notable exploration capabilities, it fails to surpass DFS due to the task’s nature, where effective solutions are abundant as long as generated sentences correctly integrate assigned concepts. DFS, employing a greedy approach prioritizing nodes with the highest concept coverage, outperforms MCTS in this context. However, solely relying on concept coverage does not ensure appropriate concept utilization. Hence, we conduct an additional evaluation using GPT-4 to determine the preferred output based on concept coverage and appropriateness. Figure 6, comparing with MCTS against with DFS and a third baseline (intuitively, representing the case where neither version’s output is good), indicates that with MCTS is significantly favored.
## 5 Conclusion
We introduce , a framework designed to empower LLMs to handle complex tasks requiring continuous refinement and reasoning capabilities, all without necessitating any modifications or updates to the underlying model architecture.
By harnessing Monte Carlo Tree Search (MCTS), enables LLMs to effectively explore vast search spaces while managing computational resource costs efficiently. Moreover, facilitates a seamless self-revision process, allowing LLMs to iteratively refine and improve their outputs without the need for extensive prompt engineering. Through our experiments, we illustrate ’s potential across diverse tasks, highlighting its versatility and broad applicability. The results underscore ’s capacity to enhance LLM performance in challenges requiring continuous thought iteration, such as open-ended generation, multi-step reasoning, and creative ideation.
## Limitations
While presents a promising approach for reasoning during inference, its reliance on multiple calls to the base language model incurs a higher computational cost than most sampling methods. Consequently, in scenarios where base language models already demonstrate satisfactory performance, the adoption of may not be advisable. However, proves beneficial for tasks requiring intricate reasoning, potential for continual improvement, or when the base language model’s performance is suboptimal. Furthermore, the incorporation of MCTS enables to navigate complex search spaces, striking a balance between exploration and exploitation, and handling scalability concerns, thereby offering computational advantages over alternative search algorithms.
## Ethics Statement
We affirm that all datasets utilized in our experiments have been appropriately sourced and cited, adhering to principles of academic integrity and proper attribution.
Our experiments primarily leverage GPT-3.5 and GPT-4 as the base LLMs. These models possess remarkable capabilities in generating human-like text based on prompts. However, we acknowledge the ethical concerns surrounding their potential misuse for spreading misinformation, generating harmful content, or impersonating individuals. We recognize the imperative for ethical considerations to include robust mechanisms aimed at preventing misuse and fostering responsible use of these models.
The purpose of is to enhance the reasoning and complex problem-solving capabilities of Language Models (LMs). However, it is essential to acknowledge that does not inherently include mechanisms to prevent LMs from generating harmful content. Therefore, we strongly advise anyone utilizing our model to exercise caution and be mindful of the potential for misuse. Users must take proactive measures to mitigate the risk of harmful content generation by implementing effective safeguards and appropriate controls.
## Reproducibility
In our experiments, we aim for transparency and reproducibility by utilizing publicly accessible datasets. Furthermore, for the content evaluator utilized in the story outline improvement task, we employed Flan-T5, an open-source model. To facilitate reproducibility, our codebase will also be made available for reference and validation upon publication. However, as we access GPT-3.5 and GPT-4 through the OpenAI API, we acknowledge that reproducibility may be affected subject to OpenAI changing their API.
## References
- Anthropic (2024) Anthropic. 2024. [link].
- Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint, arXiv:2204.05862.
- Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. Preprint, arXiv:2308.09687.
- Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Preprint, arXiv:2005.14165.
- Chen et al. (2024) Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. 2024. Alphamath almost zero: Process supervision without process. Preprint, arXiv:2405.03553.
- Chen et al. (2023) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. Preprint, arXiv:2304.05128.
- Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. Preprint, arXiv:2210.11416.
- Elgohary et al. (2021) Ahmed Elgohary, Christopher Meek, Matthew Richardson, Adam Fourney, Gonzalo Ramos, and Ahmed Hassan Awadallah. 2021. Nl-edit: Correcting semantic parse errors through natural language interaction. Preprint, arXiv:2103.14540.
- Huang et al. (2022a) Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022a. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. Preprint, arXiv:2201.07207.
- Huang et al. (2022b) Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. 2022b. Inner monologue: Embodied reasoning through planning with language models. Preprint, arXiv:2207.05608.
- Hui and Tu (2024) Wenyang Hui and Kewei Tu. 2024. Rot: Enhancing large language models with reflection on search trees. Preprint, arXiv:2404.05449.
- Kim et al. (2023) Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks. Preprint, arXiv:2303.17491.
- Le et al. (2022) Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Preprint, arXiv:2207.01780.
- Lin et al. (2020) Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2020. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online. Association for Computational Linguistics.
- Lin et al. (2023) Kevin Lin, Kyle Lo, Joseph E. Gonzalez, and Dan Klein. 2023. Decomposing complex queries for tip-of-the-tongue retrieval. Preprint, arXiv:2305.15053.
- Liu et al. (2022) Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and Yejin Choi. 2022. Rainier: Reinforced knowledge introspector for commonsense question answering. Preprint, arXiv:2210.03078.
- Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment. Preprint, arXiv:2303.16634.
- Lu et al. (2022) Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. 2022. Quark: Controllable text generation with reinforced unlearning. Preprint, arXiv:2205.13636.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv preprint. ArXiv:2303.17651 [cs].
- Mirowski et al. (2022) Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. 2022. Co-writing screenplays and theatre scripts with language models: An evaluation by industry professionals. Preprint, arXiv:2209.14958.
- OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Paul et al. (2024) Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2024. REFINER: Reasoning Feedback on Intermediate Representations. arXiv preprint. ArXiv:2304.01904 [cs].
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv preprint. ArXiv:2303.11366 [cs].
- Tandon et al. (2022) Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. 2022. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. Preprint, arXiv:2112.09737.
- Tian et al. (2024) Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. 2024. Toward self-improvement of llms via imagination, searching, and criticizing. Preprint, arXiv:2404.12253.
- Tian and Peng (2022) Yufei Tian and Nanyun Peng. 2022. Zero-shot sonnet generation with discourse-level planning and aesthetics features. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3587–3597, Seattle, United States. Association for Computational Linguistics.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
- Wang et al. (2023a) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023a. Self-consistency improves chain of thought reasoning in language models. Preprint, arXiv:2203.11171.
- Wang et al. (2023b) Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. 2023b. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. Preprint, arXiv:2302.01560.
- Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
- Welleck et al. (2022) Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct. Preprint, arXiv:2211.00053.
- Xie et al. (2023) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. 2023. Self-evaluation guided beam search for reasoning. Preprint, arXiv:2305.00633.
- Yang et al. (2023) Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian. 2023. DOC: Improving Long Story Coherence With Detailed Outline Control. arXiv preprint. ArXiv:2212.10077 [cs].
- Yang et al. (2022) Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein. 2022. Re3: Generating Longer Stories With Recursive Reprompting and Revision. arXiv preprint. ArXiv:2210.06774 [cs].
- Yao et al. (2023a) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023a. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint. ArXiv:2305.10601 [cs].
- Yao et al. (2023b) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023b. React: Synergizing reasoning and acting in language models. Preprint, arXiv:2210.03629.
## Appendix A Prompts
Generally, requires only three prompts: TASK_DESCRIPTION, NEW_CANDIDATE, and EVALUATE_CURRENT.
1. TASK_DESCRIPTION is the general instruction for the specific task. It will be placed in front of rest of the prompts.
1. NEW_CANDIDATE is the prompt to generate new candidates based on the evaluation feedback and the current solution.
1. EVALUATE_CURRENT instructs the language model to evaluate the current solution. The prompt can be tailored to ask for itemized evaluations, holistic evaluations, or both.
### A.1 Task 1 Story Outline Improvement
TASK_DESCRIPTION = """ # Task Description You are a popular novel writer. You are now making an interesting outline for the story. You know how to engage with the readers by not limited to introducing interesting characters and unexpected twist. You also know how to make the story outline coherent and consistent. """
NEW_CANDIDATE = TASK_DESCRIPTION + """ # Original Outline outline
# Feedback feedback
Based on the feedback and the task description, can you make a better story outline by replacing the items suggested by the feedback?
Write the outline in this format just like the original outline from [1] to [num]: [1] … [2] … …
# Your response: """
EVALUATE_CURRENT = TASK_DESCRIPTION + """ # Original Outline outline
Do you think that this outline is good enough? Write a score from 1 to 100 where 100 means the outline is perfect based on the task description, and provide an explanation on strengths and weaknesses. Please be specific. # Write in this format: [score: 1-100] [reason] xxx (50 words max)
# Example: [score: 50] [reason] the current outline is too predictable
# Your response: """
EVALUATE_CURRENT_ITEMIZED = TASK_DESCRIPTION + """ Here is a story outline. outline Which continuous num_consecutive_lines outlines items do you think are least interesting? The interesting outline items should engage readers to read the story. Otherwise, it’s boring and should be revised. The interesting level would be from 1 to 5, where 1 is the least interesting and 5 is the most interesting.
Write in this format: Thought Process: … [reason: too repetitive/cliche plot/unsurprising/etc] [start_index]-[end_index] [interesting level: 1-10]
Example: Thought Process: Outline items 9 and 10 talks about the same thing over outline items 7 and 8. It’s too repetitive. [reason: too repetitive] [9]-[10] [interesting level: 5]
Can you provide num_candidates proposals?
# Your response: """
Fine-tuned story outline content evaluator
The content evaluator for the story outline is fine-tuned on a Flan-T5 model with a learning rate of 3e-4 and a weight decay of 0.1, trained over 5 epochs.
PROMPT = """ Given the story outlines above, do you think that the new story point below is interesting? """
### A.2 Task 2 Mini-Crossword Solving
TASK_DESCRIPTION = """ Task Description: Let’s play a 5 x 5 mini crossword, where each word should have exactly 5 letters. Your goal is to fill in the crossword with words based on the hints provided. """
NEW_CANDIDATE = TASK_DESCRIPTION + """ #Current board: obs
#Strategy: feedback
Given the current status of the board and the strategy, list all possible answers for unfilled or changed words, and your confidence levels (certain/high/medium/low), using the format like this: Use "certain" cautiously and only when you are 100
h1. [hint: _____] xxxxx (medium) h2. [hint: _____] xxxxx (certain) … v1. [hint: _____] xxxxx (high) …
Write your response in the format: h1. [A financial loss; a negative profit; to remove bits from: D_B__] DEBTS (low) h2. [Fatuous; empty headed: _____] INANE (high) … v1. [A dice player; something that cuts into small cubes: _____] DICER (high) v5. [An Indian tent: _____] TEPEE (medium)
Each line can only have one candidate answer. #Your response: """
EVALUATE_CURRENT = TASK_DESCRIPTION + """ # Current board: obs Evaluate the current board and provide a strategy on how to continue to fill in the blank or correct potential mistakes. Write your response in the format: v1. [reasoning and potential answers] v2. [reasoning and potential answers] … h1. [reasoning and potential answers] … # Example: v2. [Current answer: tough; since the filled in h1. is debit; e is conflicted with t, we could consider other options such as ENURE] v3. [Current answer: ??? CUTUP could be a potential answer] # Your response: """
### A.3 Task 3 Constrained Generation
TASK_DESCRIPTION = """ # Instruction Given several concepts (i.e., nouns or verbs), write a short and simple sentence that contains *all* the required words. The sentence should describe a common scene in daily life, and the concepts should be used in a natural way. # # Examples # ## Example 1 - Concepts: "dog, frisbee, catch, throw" - Sentence: The dog catches the frisbee when the boy throws it into the air. # ## Example 2 - Concepts: "apple, place, tree, pick" - Sentence: A girl picks some apples from a tree and places them into her basket. """ INSTRUCTION = """ Your Task - Concepts: concepts """ NEW_CANDIDATE = TASK_DESCRIPTION + """ Instruction: instruct
Here is a proposed sentence. solution
Here is the feedback of outline item. feedback
Based on the feedback, can you make a revised solution? # Sentence: """ EVALUATE_CURRENT = TASK_DESCRIPTION + """ Instruction: instruct
Here is a proposed sentence. solution
Do you think that the proposed sentence is good enough? Write "no need to improve" if you think 1) the sentence covers all the concepts listed in the instruction; and 2) the sentence describes a common scene in daily life.
Otherwise, write "still need to improve" and provide a reason.
# Write in this format: [No need to improve/still need to improve] [reason] xxx (50 words max)
# Example 1: [still need to improve] the sentence misses the concept "dog", "ladder", and "drum". # Example 2: [still need to improve] the cat does not fly.
# Your response: """
## Appendix B Computation Efficiency
Table 4, Table 5, and Table 6 show the estimated number of input/output tokens usage and the cost of completing one case. with DFS has a comparable cost to ToT while with MCTS requires a greater computation since it has an additional decision simulation process.
| ToT with DFS with MCTS | 10.1k/4.9k 11.3k/4.6k 25.0k/9.9k | $0.248 $0.251 $0.547 |
| --- | --- | --- |
Table 4: Token use and estimated cost for Story Outline Improvement (Base LLM: gpt-4-0125-preview)
| ToT with DFS with MCTS | 64.5k/8.9k 41.6k/7.1k 100.2k/16.3k | $0.912 $0.629 $1.491 |
| --- | --- | --- |
Table 5: Token use and estimated cost for Mini-Crossword (Base LLM: gpt-4-0125-preview)
| ToT with DFS with MCTS | 7.1k/1.1k 7.0k/0.7k 15.7k/2.0k | $0.104 $0.091 $0.217 |
| --- | --- | --- |
Table 6: Token use and estimated cost for Constrained Generation (Base LLM: gpt-4-0125-preview)
## Appendix C Alternative Search Algorithm
Algorithm 1 with MCTS
1: Input: Initial node $x_0$
2: Output: Output node $x^*$
3: Initialize empty search tree $T$
4: for $j← 1$ to $d_rollout$ do
5: Select a leaf node $x$ using the tree policy UCB1 Eqn 4
6: Expand node $x$ by generating a set of children nodes $X_child$
7: node $x←uniformly\_sampled(X_child)$
8: for $k← 1$ to $d_simulation$ do
9: node $x←generate\_single\_child(x)$
10: end for
11: Evaluate reward $v(x)$
12: Propagate the reward $v$ and number of explorations $n$ back to $x_0$
13: end for
14: Choose the best node $x^*$ with the highest reward $v$
15: return $x^*$
Algorithm 2 with DFS
1: Input: Initial node $x$ , Depth $d$
2: Output: Goal node $x^*$
3: $x← x_0$
4: if $d=0$ then
5: return $x$
6: end if
7: Expand node $x$ by generating a set of children nodes $X_child$
8: for $k← 1$ to $max\_candidates$ do
9: Evaluate reward $v(X_child[k])$
10: end for
11: Choose the node $x^*$ with the highest reward $v$ in $X_child$
12: $DFS(x^*,d-1)$
## Appendix D Output Examples
### D.1 Story Outline Improvement
The examples below show the final outputs of each reasoning algorithm in the story outline improvement task. In general, with MCTS displays a more complex story with a greater tendency to introduce twists and characters. The final output of with MCTS may be greatly modified. Self Refine is less prone to change from the initial outline. ToT and with DFS are in the middle and would like to add interesting details to the initial outlines.
#### D.1.1 Example 1
Initial Outline:
1. Jack arrives in Maine and starts attending the same high school as Leo.
1. Jack begins observing Leo’s daily routine and interests to find common ground.
1. Jack strategically plans his first encounter with Leo to establish a connection.
1. Jack successfully befriends Leo and starts spending more time together.
1. Jack and Leo spend more time together, deepening their bond.
1. Jack and Leo’s attraction becomes undeniable, leading to their first kiss.
1. Jack and Leo navigate the challenges of their secret relationship.
1. Jack and Leo’s love for each other grows stronger, despite the risks involved.
1. The Center orders Jack to cut ties with Leo and leave Maine immediately.
1. Jack struggles with the decision to leave Leo and return to the Center.
1. Leo’s father becomes a target of the Center’s operation, putting Leo in danger.
1. Jack risks his own safety to protect Leo and his father, defying the Center’s orders.
Self Refine:
1. Jack arrives in Maine and starts attending the same high school as Leo.
1. Jack begins observing Leo’s daily routine and interests to find common ground.
1. Jack strategically plans his first encounter with Leo to establish a connection.
1. Jack successfully befriends Leo and starts spending more time together.
1. Jack and Leo spend more time together, deepening their bond.
1. Jack and Leo’s attraction becomes undeniable, leading to their first kiss.
1. Jack and Leo navigate the challenges of their secret relationship.
1. Jack and Leo’s love for each other grows stronger, despite the risks involved.
1. The Center orders Jack to cut ties with Leo and leave Maine immediately.
1. Jack struggles with the decision to leave Leo and return to the Center.
1. A mysterious organization threatens to expose Jack and Leo’s relationship, putting their lives in danger.
1. Jack must make a decision to protect Leo and himself, risking everything to defy both the Center and the new threat.
ToT:
1. Jack arrives in Maine and starts attending the same high school as Leo.
1. Jack is immediately drawn to Leo’s mysterious and aloof demeanor, sparking his curiosity.
1. Jack discovers Leo’s passion for art and secretly starts leaving anonymous gifts for him.
1. Leo becomes intrigued by the gifts and begins to search for the mysterious admirer.
1. Jack and Leo finally meet at a school art exhibition, where Leo realizes Jack’s connection to the gifts.
1. Their initial awkward encounter slowly blossoms into a genuine friendship as they bond over their shared love for art.
1. Jack and Leo’s friendship is tested when a scandal rocks their school, threatening to tear them apart.
1. Despite the challenges, Jack and Leo’s bond deepens as they confide in each other and support one another.
1. The Center, a secretive organization Jack is a part of, reveals their true intentions and orders Jack to complete a dangerous mission that puts Leo at risk.
1. Jack is torn between his loyalty to the Center and his love for Leo, struggling to find a way to protect both.
1. Leo’s life is put in grave danger when he uncovers a dark secret that implicates the Center in a series of crimes.
1. Jack makes the ultimate sacrifice to save Leo and expose the Center’s corrupt actions, putting his own life on the line for the person he loves.
with DFS
1. Jack arrives in Maine and starts attending the same high school as Leo.
1. Jack is immediately drawn to Leo, but soon realizes that Leo is hiding a dark secret.
1. Instead of confronting Leo right away, Jack decides to secretly follow him and discovers that Leo is actually working undercover to bring down the dangerous organization.
1. Jack is conflicted between his growing feelings for Leo and the danger of being involved in Leo’s mission.
1. Jack offers to help Leo in his mission, leading to a deepening of their bond as they face increasing risks together.
1. Their forbidden romance blooms amidst the chaos, complicating their mission and putting them both in jeopardy.
1. The organization catches wind of Jack and Leo’s alliance, putting both of their lives at risk.
1. Jack and Leo must make difficult choices to protect each other and navigate the treacherous waters of their relationship.
1. The Center threatens to harm Leo’s family if they don’t stop their investigation.
1. Jack and Leo are forced to go underground and make a dangerous deal to protect their loved ones.
1. The stakes are raised when Leo’s father is kidnapped, pushing Jack and Leo to their limits in a race against time to save him.
1. Jack and Leo face off against the leaders of the organization in a thrilling final showdown, risking everything to bring them down and secure a future together.
with MCTS:
1. Jack arrives in Maine and starts attending the same high school as Leo.
1. Jack notices a mysterious new student, Alex, who seems to be watching Leo closely.
1. Jack becomes curious about Alex’s intentions and starts investigating their background.
1. Jack discovers that Alex is actually a spy from a rival organization called the Center targeting Leo.
1. Jack confronts Alex and warns Leo about the danger, leading to a confrontation between the three.
1. Jack, Leo, and Alex form an unlikely alliance to uncover the truth behind the Center’s motives.
1. Jack and Leo’s bond is tested as they navigate the complexities of trust and betrayal within their group.
1. The rival organization escalates their actions, putting Jack, Leo, and Alex in grave danger.
1. Leo sacrifices himself to protect Jack and defeat the Center, shocking everyone.
1. The truth behind the Center’s motives is revealed, leading to a deeper understanding of the sacrifices made.
1. Jack and Alex must work together to honor Leo’s sacrifice and take down the Center.
1. In a shocking revelation, it is uncovered that Leo’s twin brother, who was thought to be dead, is actually the mastermind behind the Center, creating a complex dynamic between the characters.
#### D.1.2 Example 2
Initial outline:
1. The Health Department wants to close the Boys Club due to infestation of rats and cockroaches.
1. Closing the Boys Club means the kids will be left without a place to go.
1. The kids are worried about being out on the street again.
1. Buddy suggests organizing a bake sale to raise funds.
1. Johnny proposes hosting a talent show as another fundraising idea.
1. Rich suggests reaching out to local businesses for donations or sponsorships.
1. Cool suggests organizing a community car wash to raise money.
1. Buddy Williams suggests asking the local pet store owner if he knows anything about the rich cat.
1. The kids visit the pet store and ask the owner about the rich cat.
1. The pet store owner tells the kids about a rumor he heard of a rich cat living in a mansion nearby.
1. The kids decide to investigate the mansion and see if they can find the rich cat.
1. The townspeople share their findings and clues with the kids, who compile the information and continue the search.
1. The kids and townspeople follow leads and investigate potential sightings of the rich cat.
1. The search intensifies as more people join in, creating a sense of community and determination to find the rich cat.
1. After an extensive search, the kids and townspeople finally locate the rich cat, who generously donates the money needed to save the Boys Club.
Self Refine:
1. The Health Department wants to close the Boys Club due to infestation of rats and cockroaches.
1. Closing the Boys Club means the kids will be left without a place to go.
1. The kids are worried about being out on the street again.
1. Buddy suggests organizing a bake sale to raise funds.
1. Johnny proposes hosting a talent show as another fundraising idea.
1. Rich suggests reaching out to local businesses for donations or sponsorships.
1. Cool suggests organizing a community car wash to raise money.
1. Buddy Williams suggests asking the local pet store owner if he knows anything about the rich cat.
1. The kids visit the pet store and ask the owner about the rich cat.
1. The pet store owner tells the kids about a rumor he heard of a rich cat living in a mansion nearby.
1. The kids decide to investigate the mansion and see if they can find the rich cat.
1. While searching the mansion, the kids stumble upon a hidden room filled with mysterious artifacts and clues.
1. The artifacts lead them to a map that hints at a hidden treasure in the town.
1. The kids and townspeople embark on a thrilling treasure hunt, following the map’s clues and facing challenges along the way.
1. After overcoming obstacles and working together, the kids and townspeople unearth the hidden treasure, which turns out to be the rich cat’s fortune that saves the Boys Club.
ToT:
1. The Health Department wants to close the Boys Club due to infestation of rats and cockroaches.
1. Closing the Boys Club means the kids will be left without a place to go.
1. The kids are worried about being out on the street again.
1. Buddy suggests organizing a bake sale to raise funds.
1. Johnny proposes hosting a talent show as another fundraising idea.
1. Rich suggests reaching out to local businesses for donations or sponsorships.
1. Cool suggests organizing a community car wash to raise money.
1. Buddy Williams suggests asking the local pet store owner if he knows anything about the rich cat.
1. The kids visit the pet store and ask the owner about the rich cat.
1. The pet store owner tells the kids about a rumor he heard of a rich cat living in a mansion nearby.
1. The kids decide to investigate the mansion and see if they can find the rich cat.
1. While investigating the mansion, the kids stumble upon a hidden room with a mysterious map that leads to a hidden treasure.
1. The townspeople share their findings and clues with the kids, who now have a new mission to find the hidden treasure.
1. The search for the hidden treasure leads the kids and townspeople to uncover secrets about the town’s history and a long-lost treasure that could save the Boys Club.
1. After a thrilling adventure and overcoming obstacles, the kids and townspeople finally find the hidden treasure, which not only saves the Boys Club but also brings the community closer together.
with DFS
1. The Health Department discovers that the Boys Club is in danger of losing its funding due to budget cuts.
1. The kids are devastated as they learn that the Boys Club may have to shut down if they can’t come up with a solution.
1. Buddy suggests organizing a scavenger hunt around town to raise funds and awareness for the Boys Club.
1. Johnny proposes creating a social media campaign to rally support and donations for the Boys Club.
1. Rich suggests hosting a charity concert with local musicians and performers to raise money.
1. Cool suggests organizing a charity auction with items donated by local businesses and community members.
1. The kids brainstorm and come up with a plan to combine all the fundraising ideas into one big event to save the Boys Club.
1. Buddy Williams suggests asking the local historian if he knows of any hidden treasures or artifacts that could help raise funds.
1. The kids visit the local historian and learn about a legendary treasure rumored to be hidden somewhere in town.
1. The historian gives the kids a map and clues to help them on their quest to find the treasure.
1. The kids embark on an exciting adventure to follow the map and solve the clues to locate the hidden treasure.
1. Along the way, they face challenges and obstacles that test their friendship and determination.
1. The townspeople join in the treasure hunt, adding to the excitement and suspense of the search.
1. After overcoming obstacles and following the clues, the kids and townspeople finally discover the hidden treasure, which turns out to be the key to saving the Boys Club.
1. The treasure not only provides the funds needed to keep the Boys Club open but also brings the community together in a heartwarming and unforgettable way.
with MCTS
1. The Health Department wants to close the Boys Club due to infestation of rats and cockroaches.
1. Closing the Boys Club means the kids will be left without a place to go.
1. The kids are worried about being out on the street again.
1. Buddy suggests organizing a scavenger hunt around town to raise funds.
1. Johnny proposes creating a community mural project as another fundraising idea.
1. Luna, a mysterious new girl in town, suggests hosting a talent show to raise money, revealing her own hidden talents in the process.
1. Rich suggests organizing a neighborhood cleanup event to raise money.
1. Cool suggests hosting a movie night in the park to raise funds.
1. Buddy Williams suggests asking the local historian if he knows anything about a hidden treasure in the town.
1. The kids visit the local historian and ask about the hidden treasure.
1. The historian tells the kids about a legend of a lost pirate treasure buried somewhere in the town.
1. Luna shares a personal connection to the treasure hunt, sparking curiosity and excitement among the group.
1. The kids face challenges and puzzles during their treasure hunt that test their problem-solving skills and teamwork.
1. In a surprising twist, the treasure turns out to be a time capsule filled with memories and mementos from the town’s past.
1. The kids realize that the true treasure was the journey itself and the bonds of friendship they formed along the way, teaching them the value of community and unity. Luna’s involvement adds depth to the story and enhances the theme of unity and friendship.
### D.2 Constrained Generation
The examples below show the final outputs of different reasoning strategies in Constrained Generation.
#### D.2.1 Example 1
Concepts: drum, clip, window, cover, graffiti, wave, vomit, hurdle, dye, sidewalk, animal, slice, swing, pottery, pineapple, fix, decorate, tire, tag, bicycle, watermelon, sign, mask, plant
Refine
Children decorated pottery with vibrant dye near the graffiti-covered wall, while a masked animal rode a bicycle past a sign, waving as they passed by a tire swing hanging from a tree and a clip holding a sign on the sidewalk, avoiding a hurdle, slicing a pineapple and a watermelon to plant in the garden.
ToT
A child on a swing by the sidewalk waved at a bicyclist with a tire fixed and decorated with graffiti.
with DFS
The children decorated the pottery with vibrant dye, while the adults fixed the tire on the bicycle near the sidewalk, as a group of friends played the drum and waved at passersby from the swing, with a clip holding a sign on the window, a slice of watermelon in hand, and a pineapple plant in the background.
with MCTS
A person walks down the sidewalk past a graffiti-covered window, a pottery clip lies on the ground, a bicycle rider swerves to avoid a tire, and a sign is being fixed and decorated with dye next to a pineapple stand, as a drum beats in the background, a wave crashes on the shore, an animal scurries across the path, a slice of watermelon is enjoyed under the shade of a swing, a mask-wearing individual tags a plant with a new growth, and a hurdle is cleared by a runner as they vomit from exhaustion.
#### D.2.2 Example 2
Concepts: plant, word, remove, toddler, hamburger, potato, set, line, place, control, eat, banner, singe, hurdle, bed, bar, alley, stove, top, air, cowboy, sharpen, pair, beat, arena, wave, singer, jacket
Refine
The toddler removes the banner from the line and places it on the top of the bar to singe while the cowboy sharpens a pair of knives to beat the hurdle in the arena.
ToT
The toddler eats a hamburger and potato at the place where the plant is growing, while the singer on stage waves under the banner to the cheering crowd, and the cowboy sharpens his pair of beat-up boots backstage.
with DFS
The toddler eagerly eats a hamburger while the cowboy sharpens his pair of scissors at the plant nursery.
with MCTS
The toddler removes a plant from the line, places it in the bed, eats a hamburger with a side of potato, then sets the table for a cowboy sharpening his pair of knives at the top of the bar, while a singer controls the banner in the arena, singing as the wave of music beats through the alley where a jacketed cowboy hurdles over a singe stove.
## Appendix E G-Eval on Story Outline Generation
We’ve run an additional evaluation using the G-Eval metric Liu et al. (2023). We provide a definition prompt of interestingness, and the result indicates that ThoughtSculpt(MCTS) outperforms other baselines when evaluated by either GPT-4 and GPT-3.5, consistent with other metrics.
| | GPT-3.5 | GPT-4 |
| --- | --- | --- |
| Self-refine | 4.33 | 4.45 |
| ToT | 4.37 | 4.66 |
| ThoughtSculpt (DFS) | 4.47 | 4.71 |
| ThoughtSculpt (MCTS) | 4.60 | 4.73 |
Table 7: G-Eval Result (1-5 scale)
## Appendix F More search steps on Mini-Crosswords
We’ve shown that (MCTS) could achieve a solid performance in only 20 search steps, but we have run an extended number of search steps to match the experiment setup provided by Yao et al. (2023a).
| | GPT4 | | |
| --- | --- | --- | --- |
| Methods | % word | % letter | % game |
| ToT (20 search steps) | 39.5 | 64.8 | 5.0 |
| ToT (100 search steps) Yao et al. (2023a) | 60 | 78 | 20 |
| (MCTS) (20 search steps) | 54.0 | 74.0 | 25.0 |
| (MCTS) (100 search steps) | 66.0 | 83.0 | 35.0 |
Table 8: Mini-crossword results of 20 puzzles for and baselines (success % of letters, words, and games). Comparison between ToT and (MCTS) with 20 and 100 search steps
## Appendix G Constrained Generation LLM Evaluation
To evaluate the preferred output based on concept coverage and appropriateness, we conduct an additional assessment using GPT-4. We prompt GPT-4 to select the output that is most preferred, considering both its coverage of relevant concepts and the appropriateness of how this coverage is utilized. Figure 6, which compares with MCTS against with DFS and a third baseline (which intuitively represents the case where neither version of produces a satisfactory output), shows that with MCTS is significantly favored.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Method Preference Evaluation
### Overview
The image displays a vertical bar chart comparing the preference rates for three different outcomes or methods. The chart is presented on a light gray background with white horizontal grid lines for reference. The data suggests a comparative evaluation, likely from a user study or model output assessment.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "% of outputs preferred"
* **Scale:** Linear scale from 0 to 100.
* **Major Tick Marks:** 0, 20, 40, 60, 80, 100.
* **X-Axis (Horizontal):**
* **Label:** "Method"
* **Categories (from left to right):**
1. "Neither is good"
2. "Ours (DFS)"
3. "Ours (MCTS)"
* **Data Series:** Three distinct bars, each corresponding to one of the x-axis categories. There is no separate legend; the category labels are placed directly beneath their respective bars.
### Detailed Analysis
The chart presents the percentage of times each method's output was preferred in a comparative evaluation.
1. **"Neither is good" (Green Bar, Leftmost):**
* **Visual Trend:** This is the shortest bar, indicating the lowest preference rate.
* **Approximate Value:** The top of the bar aligns just below the midpoint between 0 and 20 on the y-axis. Estimated value: **~8%**.
2. **"Ours (DFS)" (Orange Bar, Center):**
* **Visual Trend:** This bar is of medium height, showing a moderate preference rate.
* **Approximate Value:** The top of the bar is clearly above the 20 line and appears to be slightly above the midpoint between 20 and 40. Estimated value: **~32%**.
3. **"Ours (MCTS)" (Blue Bar, Rightmost):**
* **Visual Trend:** This is the tallest bar by a significant margin, indicating the highest preference rate.
* **Approximate Value:** The top of the bar aligns almost exactly with the 60 grid line. Estimated value: **~60%**.
**Spatial Grounding & Verification:** The bars are positioned sequentially along the x-axis. The color-to-label mapping is direct and unambiguous due to the labels being centered under each bar. The blue bar for "Ours (MCTS)" is visually dominant, occupying the rightmost position and reaching the highest point on the y-axis scale.
### Key Observations
* There is a clear, ascending trend in preference from left to right: "Neither is good" < "Ours (DFS)" < "Ours (MCTS)".
* The "Ours (MCTS)" method is preferred nearly twice as often as the "Ours (DFS)" method (60% vs. ~32%).
* The combined preference for the two "Ours" methods (DFS and MCTS) totals approximately 92%, suggesting that in the vast majority of cases, at least one of the proposed methods was preferred over the "Neither is good" option.
* The "Neither is good" outcome represents a small minority of cases (~8%).
### Interpretation
This chart likely presents results from a human evaluation or automated assessment comparing the outputs of two algorithmic methods (labeled "Ours (DFS)" and "Ours (MCTS)") against a baseline or failure case ("Neither is good"). The data strongly suggests that the **MCTS (Monte Carlo Tree Search) variant of the authors' method is significantly more effective or preferred** than their DFS (Depth-First Search) variant in the evaluated context.
The low percentage for "Neither is good" indicates that the evaluation task was generally solvable by at least one of the presented methods. The substantial gap between the DFS and MCTS results implies that the search strategy (MCTS vs. DFS) is a critical factor in the performance of the underlying system. This visualization effectively argues for the superiority of the MCTS approach within the scope of this study.
</details>
(a) Base LLM GPT-3.5
<details>
<summary>x7.png Details</summary>

### Visual Description
## Bar Chart: Preference Percentage by Method
### Overview
The image displays a vertical bar chart comparing the percentage of outputs preferred for three distinct categories or methods. The chart uses a simple, clean design with a light gray background and white horizontal grid lines.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "% of outputs preferred"
* **Scale:** Linear scale from 0 to 100.
* **Major Tick Marks:** 0, 20, 40, 60, 80, 100.
* **X-Axis (Horizontal):**
* **Label:** "Method"
* **Categories (from left to right):**
1. "Neither is good"
2. "Ours (DFS)"
3. "Ours (MCTS)"
* **Legend:** Not present as a separate element. The categories are directly labeled on the x-axis, and each bar is a distinct color corresponding to its label.
* **Spatial Layout:** The chart area occupies the majority of the image. The y-axis label is positioned vertically along the left edge. The x-axis label and category names are centered below their respective bars at the bottom.
### Detailed Analysis
The chart presents three data points, each represented by a colored bar:
1. **"Neither is good" (Green Bar, Left):**
* **Visual Trend:** The bar is the shortest of the three.
* **Approximate Value:** The top of the bar aligns between the 0 and 20 grid lines, closer to 20. Estimated value: **~15%**.
2. **"Ours (DFS)" (Orange Bar, Center):**
* **Visual Trend:** The bar is slightly taller than the green bar to its left but significantly shorter than the blue bar to its right.
* **Approximate Value:** The top of the bar is just below the 20 grid line. Estimated value: **~18%**.
3. **"Ours (MCTS)" (Blue Bar, Right):**
* **Visual Trend:** The bar is the tallest by a large margin, dominating the chart.
* **Approximate Value:** The top of the bar is above the 60 grid line but below the 80 line. It appears to be roughly one-third of the way from 60 to 80. Estimated value: **~66%**.
### Key Observations
* There is a stark disparity in preference between the "Ours (MCTS)" method and the other two categories.
* The preference for "Ours (MCTS)" is more than three times higher than that for "Ours (DFS)" and more than four times higher than "Neither is good."
* The difference in preference between "Neither is good" (~15%) and "Ours (DFS)" (~18%) is relatively small (approximately 3 percentage points).
### Interpretation
This chart likely presents results from a user preference study or an automated evaluation comparing different algorithmic approaches (DFS and MCTS variants of a proposed method) against a baseline where neither output was deemed good.
* **Primary Finding:** The "Ours (MCTS)" method is overwhelmingly preferred, suggesting it produces outputs that are considered superior in the vast majority of cases (~66%) compared to the alternative method and the failure case.
* **Secondary Finding:** The "Ours (DFS)" method shows only a marginal improvement over the "Neither is good" baseline. This indicates that the DFS variant may not offer a substantial practical benefit in terms of output quality as perceived by the evaluators.
* **Implication:** The data strongly advocates for the MCTS-based approach over the DFS-based one within the context of this evaluation. The low "Neither is good" rate (~15%) also suggests that the evaluated systems (particularly MCTS) are generally capable of producing at least acceptable outputs. The chart effectively communicates a clear performance hierarchy: MCTS >> DFS ≈ Neither.
</details>
(b) Base LLM GPT-4
Figure 6: GPT-4’s comprehensive preference based on concept coverage and appropriateness over the final outputs for Constrained Generation. with MCTS is preferred by a wide margin.
## Appendix H Game of 24
We additionally experiment with our method on the game of 24 using the same test set provided by (Yao et al., 2023a). Compared to (Yao et al., 2023a) ’s prompts with detailed few-shot examples for this task, uses a much more general prompt. As illustrated in Table 9, still outperforms ToT in this comparison despite this setup that might be expected to favor ToT.
| | Success % |
| --- | --- |
| CoT-SC (k=100) Yao et al. (2023a) | 9 |
| IO - Refine (k=10) Yao et al. (2023a) | 27 |
| ToT (b = 1) Yao et al. (2023a) | 45 |
| ToT (b = 5) Yao et al. (2023a) | 74 |
| ThoughtSculpt (MCTS) (b = 5) | 79 |
Table 9: Game of 24 Result
### H.1 Game of 24 Task prompts
TASK_DESCRIPTION = """ Use the given four numbers and basic arithmetic operations (+ - * /) to obtain 24. You can use the numbers only once but you can use them in any order. """
SOLUTION_OUTPUT_FORMAT = """ # Think step by step first. Then, please output the solution in the following format (in a Python code block). “‘python (1 + 2) * (2 * 4) “‘ # Your response. """
REVISE_SOLUTIONS = """ # Instruction instruction
# Current Solution solution
Calculate the result of the current solution. Do you think the solution is correct? If not, please provide feedback.
# Output format
“‘json "calculation": "step by step calculation of the current solution", "result: int, "feedback": "Your feedback here", "correct" : true/false “‘ """
## Appendix I Effectiveness of components
Our framework has two core components that use LLM: the thought evaluator, providing both numeric and language feedback for revision, and the thought generator, which generates subsequent nodes based on this feedback to improve outputs. Both are essential, as removing either makes the method non-functional. We conducted an additional ablation study by "weakening" these modules. We replaced GPT-4 with Llama3.1-8B-Instruct as the language model for either the thought evaluator or the thought generator in the Constrained Generation task. The results are summarized below:
| ThoughtSculpt | 99.0 |
| --- | --- |
| ThoughtSculpt (weakened generator) | 98.5 |
| ThoughtSculpt (weakened evaluator) | 96.3 |
| ThoughtSculpt (weakened generator + evaluator) | 90.7 |
Table 10: Coverage of Different ThoughtSculpt’s Components
These results highlight the critical role of a strong evaluator in deep reasoning tasks, as it provides essential revision feedback that significantly enhances the generator’s effectiveness.