2308.10379

Model: nemotron-free

# Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models **Authors**: Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, Ming Jin Abstract Current literature, aiming to surpass the “Chain-of-Thought” approach, often resorts to external modi operandi involving halting, modifying, and then resuming the generation process to boost Large Language Models’ (LLMs) reasoning capacities. Due to their myopic perspective, they escalate the number of query requests, leading to increased costs, memory, and computational overheads. Addressing this, we propose the Algorithm of Thoughts —a novel strategy that propels LLMs through algorithmic reasoning pathways. By employing algorithmic examples fully in-context, this overarching view of the whole process exploits the innate recurrence dynamics of LLMs, expanding their idea exploration with merely one or a few queries. Our technique outperforms earlier single-query methods and even more recent multi-query strategies that employ an extensive tree search algorithms while using significantly fewer tokens. Intriguingly, our results suggest that instructing an LLM using an algorithm can lead to performance surpassing that of the algorithm itself, hinting at LLM’s inherent ability to weave its intuition into optimized searches. We probe into the underpinnings of our method’s efficacy and its nuances in application. The code and related content can be found in: algorithm-of-thoughts.github.io. Decision-making, Large Language Models, Language Models, Machine Learning, Algorithmic Thinking 1 Introduction <details> <summary>2308.10379v3/x1.png Details</summary> ![2f0b7e92b3a9efd6cf39cef6b759f6bd093535c8b5ae1373412975b72f7ebb8b](http://localhost:8000/v1/image/2f0b7e92b3a9efd6cf39cef6b759f6bd093535c8b5ae1373412975b72f7ebb8b) ### Visual Description # Technical Document Extraction: Mathematical Problem Solving Approaches ## Section 1: Standard Prompting ### Model Input ``` User: 8 6 4 4 Assistant: Answer: (4 + (8 - 6)) * 4 = 24. User: 11 10 5 1 ``` ### Model Output ``` Assistant: Answer: (11 - 1) * (10 - 5) = 24. ``` --- ## Section 2: Algorithm of Thoughts ### Model Input ``` User: 8 6 4 4 Assistant: 1. 4 - 4: (left: 8, 6, 0) ... - 4 + 2: (left: 6, 4) 10, 2, 24 = 6 * 4 -> found it! Answer: (4 + (8 - 6)) * 4 = 24. User: 11 10 5 1 ``` ### Model Output ``` Assistant: 1. 11 - 1: (left: 10, 10, 5) ... - 12 * 10: (left: 120, 5) 24 = 120 / 5 -> found it! Answer: ((11 + 1) * 10) / 5 = 24. ``` --- ## Section 3: Chain-of-Thought ### Model Input ``` User: 8 6 4 4 Assistant: 8 - 6 = 2 (left: 4, 4, 2) 4 + 2 = 6 (left: 6, 4) 6 * 4 = 24 Answer: (4 + (8 - 6)) * 4 = 24. User: 11 10 5 1 ``` ### Model Output ``` Assistant: 11 - 10 = 1 (left: 1, 1, 5) 1 * 1 = 1 (left: 1, 5) 5 * 1 = 5 (left: 5) Answer: (((11 - 10) * 1) * 5) - 1 = 24. ``` --- ## Section 4: Alternative Model Output ### Model Input ``` User: 8 6 4 4 Assistant: 8 - 6 = 2 (left: 4, 4, 2) 4 + 2 = 6 (left: 6, 4) 6 * 4 = 24 Answer: (4 + (8 - 6)) * 4 = 24. User: 11 10 5 1 ``` ### Model Output ``` Assistant: 11 - 10 = 1 (left: 1, 1, 5) 1 * 1 = 1 (left: 1, 5) 5 * 1 = 5 (left: 5) Answer: (((11 - 10) * 1) * 5) - 1 = 24. ``` --- ## Key Observations 1. **Problem Structure**: All examples use four integers to reach a target value (24) through arithmetic operations. 2. **Approach Variations**: - **Standard Prompting**: Direct answer format - **Algorithm of Thoughts**: Step-by-step exploration with backtracking - **Chain-of-Thought**: Sequential reasoning with intermediate steps 3. **Operational Patterns**: - Common operations: subtraction, addition, multiplication, division - Parenthetical grouping for operation precedence - Explicit "left" number tracking in Algorithm of Thoughts 4. **Solution Diversity**: Multiple valid paths to the same result (e.g., 24) demonstrate algorithmic flexibility. ## Data Table Reconstruction | Approach | Input 1 | Input 2 | Final Answer Format | |------------------------|------------------|------------------|-----------------------------------------| | Standard Prompting | 8 6 4 4 | 11 10 5 1 | (4 + (8-6)) * 4 = 24 | | Algorithm of Thoughts | 8 6 4 4 | 11 10 5 1 | ((11+1)*10)/5 = 24 | | Chain-of-Thought | 8 6 4 4 | 11 10 5 1 | (4 + (8-6)) * 4 = 24 | ## Spatial Grounding - **Legend**: Not applicable (no visual elements) - **Text Placement**: All labels and content follow a top-to-bottom, left-to-right hierarchy ## Trend Verification - No numerical trends present (text-based reasoning only) - All solutions converge to target value 24 through different operational paths ## Component Isolation 1. **Header**: Section titles (Standard Prompting, Algorithm of Thoughts, etc.) 2. **Main Content**: Model Input/Output pairs with reasoning traces 3. **Footer**: No footer content present ## Critical Notes - All text is in English - No numerical data tables present beyond the extracted examples - No visual elements (charts/diagrams) detected - All operations follow standard arithmetic rules with explicit precedence handling </details> Figure 1: Comparison between standard prompting, CoT, and AoT in the game of 24. While standard prompting aims for a direct answer, CoT sketches out the successive steps to the final solution. AoT’s in-context example, distinct from CoT, integrates the search process, highlighted by markers ‘1’,…, ‘3’ as “first operations” guiding subtree exploration for the problem set ‘8 6 4 4’. For clarity, only a single in-context example is displayed, with a focus on the third subtree exploration. AoT produces prospective search steps (e.g., the subtree exploration ‘5. $11+1$ ’) and evaluates potential subsequent steps to either progress towards a solution or retrace to another viable subtree. Recent developments in large language models (Chowdhery et al., 2022; Thoppilan et al., 2022; Liu et al., 2023, inter alia) have spotlighted their efficacy in general problem solving (Huang & Chang, 2022; Suzgun et al., 2022), code generation (Chen et al., 2021; Austin et al., 2021), and instruction following (Ouyang et al., 2022; Bai et al., 2022). While early models relied on direct answer strategies (Brown et al., 2020), contemporary research has shifted towards linear reasoning paths (Wei et al., 2022b; Kojima et al., 2022; Zhang et al., 2022) by breaking problems into sub-tasks for solution discovery, or harnesses external mechanisms to alter token generation by changing the context (Zhou et al., 2022a; Drozdov et al., 2022; Yao et al., 2023). Analogous to human cognition (Sloman, 1996; Kahneman, 2011), early LLM strategies seemed to emulate the instantaneous System 1, characterized by its impulsive decision-making. In contrast, more recent methodologies like chain-of-thought (CoT) (Wei et al., 2022b) and least-to-most prompting (L2M) (Zhou et al., 2022a; Drozdov et al., 2022) reflect the analytical nature of System 2. Notably, integrating intermediary reasoning steps has yielded improvements in arithmetic reasoning tasks (Srivastava et al., 2022; Liang et al., 2022). However, as tasks shift towards deeper planning and extensive thought exploration, these methods appear restrictive. Although CoT integrated with Self-Consistency (CoT-SC) (Wang et al., 2022) enlists multiple LLM outputs for a consensus, the lack of meticulous evaluation can result in model misdirection. The “Tree of Thoughts” (Yao et al., 2023; Long, 2023) emerges as a notable solution. While one LLM is dedicated to idea generation, another steps in to assess the merit of these ideas, following a halting-assessment-resuming cycle. This iterative process, based on a tree search, has shown marked effectiveness, especially in tasks with a breadth of continuations. We see this progression as akin to humans employing tools to circumvent working memory limitations, serving as an external augmentation for LLMs (Mialon et al., 2023; Sel et al., 2023; Gu et al., 2024a). <details> <summary>2308.10379v3/x2.png Details</summary> ![7059118da078117bd7d8ecbcba4900eed38bee41bda28cf2a8310e6599bc1c12](http://localhost:8000/v1/image/7059118da078117bd7d8ecbcba4900eed38bee41bda28cf2a8310e6599bc1c12) ### Visual Description # Technical Diagram Analysis ## Diagram Overview The image depicts a comparative analysis of four cognitive processing frameworks using flowchart structures. Each framework is represented by a distinct branching pattern, with standardized color coding (green for correct paths, pink for incorrect paths). --- ## Component Breakdown ### 1. Standard Prompting - **Structure**: Linear flow - **Input**: Single blue oval labeled "Input" - **Processing**: - One green rectangle (correct path) - **Output**: Single green oval labeled "Output" - **Flow**: Direct connection from Input → Processing → Output ### 2. Chain of Thoughts - **Structure**: Branching flow - **Input**: Single blue oval labeled "Input" - **Processing**: - Two green rectangles (correct paths) - One pink rectangle (incorrect path) - **Output**: Single green oval labeled "Output" - **Flow**: Input → Branching paths → Output ### 3. Tree of Thoughts - **Structure**: Multi-layered branching - **Input**: Single blue oval labeled "Input" - **Processing**: - Three green rectangles (correct paths) - Three pink rectangles (incorrect paths) - **Output**: Single green oval labeled "Output" - **Flow**: Input → First branching → Second branching → Output ### 4. Algorithm of Thoughts - **Structure**: Complex hierarchical branching - **Input**: Single blue oval labeled "Input" - **Processing**: - Four green rectangles (correct paths) - Four pink rectangles (incorrect paths) - **Output**: Single green oval labeled "Output" - **Flow**: Input → First branching → Second branching → Third branching → Output --- ## Key Observations 1. **Color Coding**: - Green rectangles: Represent correct processing paths - Pink rectangles: Represent incorrect processing paths - Blue ovals: Input nodes - Green ovals: Output nodes 2. **Progression Complexity**: - Standard Prompting: Simplest (1 processing step) - Chain of Thoughts: Moderate complexity (2 processing steps) - Tree of Thoughts: High complexity (3 processing steps) - Algorithm of Thoughts: Most complex (4 processing steps) 3. **Path Diversity**: - Each subsequent framework increases the number of processing paths exponentially - Algorithm of Thoughts shows the highest path diversity (8 total paths) --- ## Technical Specifications - **Diagram Type**: Comparative flowchart - **Color Scheme**: - Blue: Input nodes - Green: Correct paths/output - Pink: Incorrect paths - **Flow Direction**: Top-to-bottom vertical progression - **Node Types**: - Ovals: Input/Output nodes - Rectangles: Processing steps --- ## Framework Comparison Matrix | Framework | Processing Steps | Correct Paths | Incorrect Paths | Total Paths | |----------------------|------------------|---------------|-----------------|-------------| | Standard Prompting | 1 | 1 | 0 | 1 | | Chain of Thoughts | 2 | 2 | 1 | 3 | | Tree of Thoughts | 3 | 3 | 3 | 6 | | Algorithm of Thoughts| 4 | 4 | 4 | 8 | --- ## Flowchart Notation - **Arrows**: Represent directional flow between nodes - **Branching**: Indicates decision points in processing - **Convergence**: All paths ultimately lead to Output node - **Divergence**: Increasing path complexity in later frameworks --- ## Implementation Implications 1. **Resource Requirements**: - Complexity increases with framework sophistication - Algorithm of Thoughts requires 8x processing resources vs Standard Prompting 2. **Error Management**: - Pink paths represent potential failure points - Error rate increases with path complexity 3. **Optimization Opportunities**: - Early elimination of pink paths could improve efficiency - Algorithm of Thoughts offers maximum path optimization potential --- ## Conclusion The diagram illustrates the evolution from simple linear processing (Standard Prompting) to complex multi-path decision-making (Algorithm of Thoughts). Each framework represents an incremental increase in cognitive processing capability, with corresponding increases in computational complexity and potential error points. </details> Figure 2: Illustration outlining various strategies for tackling reasoning problems with LLMs. Each box signifies a distinct thought, functioning as a unified string of words that forms an incremental pathway to reasoning. Green boxes indicate ideas deemed promising by the LLM, while red boxes represent less promising concepts. On the flip side, this enhanced LLM approach is not without pitfalls. A prominent downside is the substantial surge in the number of queries and computational demands. Each query to online LLM APIs such as GPT-4—a focal point of our study—incurs a monetary expense (Chen et al., 2023) but also contributes to latency, a significant limitation especially critical in real-time applications. Cumulative delays from these queries can compromise solution efficiency. Infrastructure-wise, continuous interactions can stress systems, leading to potential bandwidth constraints and reduced model availability (Aminabadi et al., 2022). Moreover, the environmental implications cannot be ignored; incessant querying escalates the energy consumption of already power-hungry data centers, exacerbating the carbon footprint (Wu et al., 2022; Dhar, 2020; Khattar & Jin, 2023). With this in mind, our goal was to dramatically reduce the query counts employed by contemporary multi-query reasoning methods while maintaining performance for tasks necessitating adept use of world knowledge, thereby steering a more responsible and proficient use of AI resources. Intriguingly, our aim has actually resulted in surpassing the performance of such techniques while requiring significantly fewer tokens for prompting and generation. Reflecting on the evolution of LLMs from System 1 to System 2, an essential ingredient comes to light: algorithms (Sel et al., 2021; Al-Tawaha et al., 2023; Jin et al., 2023a; Gu et al., 2024b). Characterized by its methodical nature, the algorithmic perspective offers a path to keenly explore problem spaces, enact strategies, and formulate solutions (Al-Tawaha et al., 2021; Helie & Pizlo, 2022; Banerjee et al., 2022; Sel et al., 2022; Khattar et al., 2022). While much of the prevailing literature treats algorithms as external to LLMs (Lin et al., ), given LLMs’ inherent generative recurrence, can we channel this iterative logic to internalize an algorithm? Drawing upon both the intricate nuances of human reasoning and the disciplined precision of algorithmic methodologies, our work aims to fuse these two elements to enhance reasoning capabilities within LLMs. Existing research underscores that humans, when navigating complex problems, instinctively draw upon past efforts, ensuring a comprehensive contemplation rather than a narrow focus (Monsell, 2003; Holyoak & Morrison, 2005; Baddeley, 2003). LLMs, with their generative span bounded only by token limits, appear poised to break through the barriers of human working memory. Spurred by this observation, we investigated if LLMs could mirror a similar layered exploration of ideas, referencing prior intermediate steps to sieve out infeasible options, all within their iterative generation cycle. And while humans excel with their intuitive insight, algorithms stand out with organized, systematic exploration. Current techniques, like CoT, often sidestep this synergistic potential, imposing undue pressure on LLMs for on-the-spot precision. By capitalizing on LLMs’ recursive capabilities, we emulate a hybrid human-algorithmic approach. This is achieved through our use of algorithmic examples that capture the essence of exploration, from initial candidates to validated solutions. Thus emerges our concept of the Algorithm of Thoughts (AoT), as illustrated in Figs. 1 and 2. More broadly, our approach signifies a new paradigm of in-context learning. Instead of the traditional “supervised-learning” mold of [problem, solution] or [problem, successive steps to solution], we present a new structure that covers [problem, search process, solution]. Naturally, when instructing an LLM using an algorithm, the anticipation leans towards the LLM simply imitating the algorithm’s iterative thinking. However, what emerges as intriguing is the LLM’s ability to infuse its own “intuition” to achieve a search efficiency that even surpasses the algorithm itself (see Fig. 5). In the subsequent sections, we first situate our work within the existing literature, followed by a discussion of our principal idea. We then present our experimental results and probe a series of hypotheses related to this emerging capability of LLM before rounding off with a conclusion. 2 Related Work Standard Prompting. Also known as input-output prompting, it provides a few input-output examples of the task before getting an answer for the test sample from the language model (Brown et al., 2020). Although this method is very general and does not need any special prompting strategy, the performance is also worse compared to more advanced methods (Shao et al., 2023; Wei et al., 2022a; Lyu et al., 2023). Chain-of-Thought. In CoT, LLMs are presented with examples where a given question $x$ unfolds through a chain of intermediate reasoning pieces $c_{1},...,c_{n}$ to reach an answer $y$ , represented as $x→ c_{1}→...→ c_{n}→ y$ (Wei et al., 2022b; Lyu et al., 2023). By mimicking the examples in the context, the LLM automatically divides the solution into simpler linear steps to arrive at the answer, improving performance across numerous reasoning benchmarks. Self-consistency (Wang et al., 2022) is a widely used decoding strategy aimed at generating a variety of reasoning paths by choosing the final answer through a majority vote, though this necessitates additional generations. CoT can be further improved with integrating detailed algorithmic reasoning (Zhou et al., 2022b). We also utilize algorithmic examples in AoT, however, they are for emerging the inherent heuristic of LLMs to lead the search and not designed to follow a specified pseudocode, or are on language tasks, e.g., creative writing. Contrary to CoT’s linear progression, our approach pivots towards the explorative aspect of LLMs. We reconceptualize the $c_{1},...,c_{n}$ sequence, not merely as successive steps towards a solution, but as a dynamic, potentially mutable path that resembles an algorithmic search, allowing for exploration, recalibration, and non-linear progression. Least-to-Most prompting (L2M). Taking cues from educational psychology (Libby et al., 2008), L2M prompting directs the LLM to decompose the central problem into smaller subproblems. Each subproblem is tackled in sequence, with the outcome appended before progressing to the next (Zhou et al., 2022a; Drozdov et al., 2022). While this structured delineation is beneficial for broader generalization, it operates on the premise of finding a nearly perfect decomposition in a single attempt—ideal for problems with a clear-cut structure. Yet, when tasks intertwine with their decomposition complexities (like games of 24), this method’s inflexibility becomes apparent. Contrastingly, AoT not only underscores the active subproblem (as shown in Fig. 1), but also permits a more contemplative approach by entertaining various options for each subproblem, while maintaining efficacy even with minimal prompts. Tree of Thoughts (ToT). In the cases where each subproblem has multiple viable options to explore, linear reasoning paths from CoT or L2M substantially limit the coverage of the thought space. Considering possible options for each subproblem, the decision tree can be explored by external tree-search mechanisms (e.g., BFS, DFS) (Yao et al., 2023; Jin et al., 2023b; Sel et al., 2024). Evaluation capabilities of LLMs can also be used to direct the search by pruning nodes that are hopeless to increase efficiency. However, ToT, due to its requirement for multiple queries to the LLM for a solution, demands significantly more computation than AoT. Additionally, it necessitates evaluating the potential of each search node in the in-context examples and writing specialized functions to extract information from model responses to maintain the tree structure externally. In stark contrast, AoT requires just a single prompt and no coding skills, greatly democratizing LLM use for complex problems. 3 Algorithm of Thoughts Our strategy pivots on recognizing a core shortcoming of current in-context learning paradigms. CoT, while enhancing the coherency of thought linkages leading to solutions, occasionally falters, presenting incorrect intermediate steps (Zelikman et al., 2022; Turpin et al., 2023; Lanham et al., 2023). Faithful CoT (Lyu et al., 2023) ought to amend this by eliciting symbolic chains of reasoning where the LLM’s output resembles task-specific pseudo-code, primed for deterministic execution like Python. The intention is only to use the thought processes but not the outputs and inputs of each link since they have a tendency to be unreliable. But, the occasional missteps of CoT may not necessarily be due to the LLM’s inability to compute correctly. The LLM, when confronted with questions that closely match conditions of previous in-context examples, may favor echoing those outputs over generating the appropriate questions. To shed light on this phenomenon, we designed an experiment. Querying text-davinci-003 for arithmetic tasks (e.g., ‘ $11-2=$ ’), we prefixed them with multiple in-context equations converging to an identical output (e.g. ‘ $15-5=10,8+2=10$ ’). Our results, presented in Fig. 3, reveal a steep decline in accuracy, suggesting that the mere presence of correct reasoning in the context might inadvertently compromise even basic arithmetic skills. <details> <summary>2308.10379v3/x3.png Details</summary> ![ec775ba93e7ffc579cc9b3fddef0df04ada5919f400c5d0d1f3abfa02f4a8b67](http://localhost:8000/v1/image/ec775ba93e7ffc579cc9b3fddef0df04ada5919f400c5d0d1f3abfa02f4a8b67) ### Visual Description # Technical Document Extraction: Line Graph Analysis ## Chart Description The image depicts a **line graph** illustrating the relationship between the **number of equations** and the **probability of correct token** predictions. The graph includes a shaded area representing variability or confidence intervals around the central trend line. --- ### **Axis Labels and Markers** - **X-Axis (Horizontal):** - Title: `# of Equations` - Range: `0.0` to `12.5` - Increment: `2.5` (e.g., `0.0`, `2.5`, `5.0`, `7.5`, `10.0`, `12.5`) - **Y-Axis (Vertical):** - Title: `Probability of Correct Token` - Range: `0.0` to `1.0` - Increment: `0.2` (e.g., `0.0`, `0.2`, `0.4`, `0.6`, `0.8`, `1.0`) --- ### **Legend** - **Line Color:** Green - Label: `Probability of Correct Token` - **Shaded Area:** Light green - Purpose: Represents variability or confidence intervals around the trend line. --- ### **Key Data Points and Trends** 1. **Initial Value:** - At `# of Equations = 0.0`, the probability starts at **~0.95**. 2. **General Trend:** - The probability **declines steadily** as the number of equations increases. - Notable fluctuations occur, with temporary increases (e.g., at `2.5` and `7.5` equations). 3. **Critical Thresholds:** - **`2.5 Equations`:** Probability drops to **~0.65**. - **`5.0 Equations`:** Further decline to **~0.55**. - **`7.5 Equations`:** Temporary rise to **~0.4**, followed by a drop. - **`10.0 Equations`:** Probability stabilizes around **~0.3**. - **`12.5 Equations`:** Final value reaches **~0.15**. 4. **Shaded Area Behavior:** - The variability (shaded region) widens significantly between `0.0` and `2.5` equations, then narrows as the number of equations increases. --- ### **Summary** The graph demonstrates an **inverse relationship** between the number of equations and the probability of correct token predictions. While the trend is generally downward, localized fluctuations suggest variability in model performance under specific conditions. The shaded area highlights uncertainty, which decreases as the number of equations grows. </details> Figure 3: The probability of generating the correct token as we add more in-context examples that are correct but possess identical outputs. To offset this bias, diversifying the outputs of examples might seem like a viable solution, but this could subtly skew the distribution of outputs. Merely adding unsuccessful trials, much like a random search, might inadvertently encourage the model to retry rather than truly solve. Capturing the true essence of algorithmic behavior, where both failed searches and subsequent recovering and learning from such attempts play a role, we incorporate in-context examples patterned after search algorithms, notably depth-first search (DFS) and breadth-first search (BFS). See Fig. 1 for an example. This paper focuses on a broad class of tasks reminiscent of tree-search problems. These tasks necessitate breaking down the main problem, crafting feasible solutions for each segment, and making decisions on the paths to either pursue or forsake, with the option of reevaluating more promising segmentations. Rather than posing separate queries for every subset, we leverage the iterative capabilities of the LLM to address them in one unified generation sweep. By confining ourselves to one or two LLM interactions, this approach naturally incorporates insights from antecedent context candidates and tackles intricate issues requiring an in-depth exploration of the solution domain. We also give insights into how small or big those thoughts should be and what type of in-context examples should be given to the LLM to promote token efficiency. Subsequently, we outline key components of tree-search algorithms and their manifestation in our framework. 1. Dividing the search into steps. Similar to creating step-by-step solutions in CoT or L2M, we also need to identify intermediate search layers. This is akin to creating examples for CoT, especially for tree-search problems, where the correct reasoning path resembles a CoT solution. The challenge lies in selecting the right chain from numerous candidates at each layer to reach the final answer. Thus, our focus will be more on generating the search process for in-context examples rather than how to solve each subproblem after selecting the next chain. <details> <summary>2308.10379v3/x4.png Details</summary> ![2d757eadeafe76164d8bf4c6b1f1caece83d5e191e95401c3faf3da33c173539](http://localhost:8000/v1/image/2d757eadeafe76164d8bf4c6b1f1caece83d5e191e95401c3faf3da33c173539) ### Visual Description # Technical Document Extraction: Text Completion Diagram ## Diagram Overview The image depicts a **text completion diagram** illustrating the probabilistic distribution of the first token in a sequence of prime numbers. The diagram combines textual labels, numerical probabilities, and visual elements to represent the model's output. --- ## Key Components and Labels 1. **Title Bar** - **Label**: `Text Completion` (orange background, black text). - Positioned at the top of the diagram. 2. **Main Text Block** - **Header**: `The first five prime numbers:` (blue text). - **List of Primes**: - `2, 3, 5, 7, 11` (green text, left-aligned). - **Probability Section**: - **Label**: `probabilities for the first token` (black text, below the primes). - **Probability Values**: - `2 = 87.6%` (dark blue text, top of the probability box). - `1 = 12.3%` (dark blue text, below `2`). - `...` (ellipsis, repeated for subsequent tokens). 3. **Visual Elements** - **Arrow**: A black arrow connects the primes list to the probability box, indicating the flow from input to output. - **Color Gradient**: - **Orange to Blue**: Represents the transition from the title bar to the probability box. - **Legend**: The title bar's orange color is labeled as `Text Completion` (no explicit color-key legend for the gradient). --- ## Data Structure ### Probability Distribution Table | Token | Probability | |-------|-------------| | 2 | 87.6% | | 1 | 12.3% | | ... | ... | --- ## Flow and Interpretation 1. **Input**: The first five prime numbers (`2, 3, 5, 7, 11`) are provided as context. 2. **Output**: The model predicts the probability distribution for the **first token** in the sequence. - The highest probability (`87.6%`) is assigned to the token `2`, reflecting its likelihood as the next number in the sequence. - The token `1` has a lower probability (`12.3%`), indicating it is less likely to follow the given primes. 3. **Visual Flow**: The arrow from the primes to the probability box emphasizes the causal relationship between the input and the model's output. --- ## Notes - **Color Coding**: - Green text highlights the prime numbers. - Blue text emphasizes the probability values. - Orange background in the title bar distinguishes the "Text Completion" label. - **Ellipsis (`...`)**: Indicates continuation of the probability distribution beyond the first two tokens. - **No Explicit Legend for Gradient**: The orange-to-blue gradient is used for visual hierarchy but lacks a formal legend. --- ## Conclusion This diagram demonstrates how a text completion model processes a sequence of prime numbers and generates probabilistic predictions for the next token. The high probability for `2` suggests the model prioritizes continuity in the sequence, while `1` is considered a less likely candidate. The use of color and layout aids in distinguishing input, output, and model confidence. </details> Figure 4: An example highlighting the drawback of isolated sampling of sequenced ideas. Input is denoted in blue, with the text-davinci-003 providing the green completions. 2. Proposing Solutions to Subproblems. A dominant approach in existing works involves direct sampling from LLM token output probabilities (Wang et al., 2022; Yao et al., 2023). Though effective for one-off answers (Kadavath et al., 2022) , this method falls short in scenarios demanding a sequence of samples to be integrated or evaluated within subsequent prompts (Robinson & Wingate, 2022). To minimize model queries, we adopt an uninterrupted solution-creation process. Here, we directly and continuously generate solutions for the prevailing subproblem without any generation pauses. The benefits are three-fold. First, with all generated solutions existing within a shared context, there is no need for individual model queries for each solution evaluation. Second, while it may seem counterintuitive, isolated token or token group probabilities might not always yield meaningful choices. A simple illustration is found in Fig. 4. When evaluated independently, the second-most probable token for our inaugural number is ‘ $1$ ’—not qualifying as prime. But, when generation remains unbroken, the derived sequence is correct. This incongruence points towards the restrictive nature of the Markov property in sequence modeling. Core to our perspective is the premise that for sequential tasks like algorithmic search, LLMs are more adept at generating entire sequences than intermittently pausing and re-initiating the token sampling process. 3. Evaluating the Promise of a Subproblem. Existing techniques lean on additional prompting to discern the potential of tree nodes, aiding decisions regarding exploration direction. Our observations suggest that if the most promising routes are encapsulated within the in-context examples, LLMs inherently gravitate towards prioritizing those promising candidates. This diminishes the need for intricate prompt engineering and allows the incorporation of intricate heuristics, whether intuitive or knowledge-driven. Again, the absence of disjoint prompts in our approach allows for an immediate assessment of candidate viability in the same generation. 4. Backtracking to a More Promising Node. The decision of which node to explore next (including retracing to a prior node) inherently depends on the selected tree-search algorithm. While previous studies (Yao et al., 2023) have employed external means, such as coded mechanisms for the search process, this restricts its broader appeal and entails additional customization. Our designs predominantly adopt a DFS approach supplemented by pruning. The aim is to maintain proximity between nodes sharing the same parent, thereby encouraging the LLM to prioritize local over distant features. Additionally, we present performance metrics for the AoT approach grounded in BFS. Our reliance on the model’s inherent capacity to glean insights from in-context examples obviates the necessity for additional mechanisms. Expressiveness of LLMs with AoT. Recent works have investigated the expressivity of transformers with standard and CoT prompting (Chiang et al., 2023; Schuurmans, 2023; Merrill & Sabharwal, 2023; Feng et al., 2023). We provide the following theoretical result for AoT, which implies that it can tackle NP problems, extending from P problems that of COT’s. **Corollary 3.1 (Informal)** *Consider TIME $(a^{n})$ as the class of problems for which a Turing machine exists that operates within a time complexity of $\mathcal{O}(a^{n})$ for some $a≥ 1$ . If a transformer can generate $a^{n}$ intermediate tokens to solve the problem when prompted by AoT, we have $$ \textsc{TIME}(a^{n})\subseteq\textsc{AoT}(n), \tag{1} $$ where $\textsc{AoT}(n)$ refers to the decoding steps by AoT when the input has $n$ tokens.* The proof of the above corollary is given in the appendix. 4 Experiments We show that AoT surpasses the performance of other single-prompt methods (e.g., standard, CoT/-SC prompting) and even that of strategies utilizing external mechanisms, such as ToT, across the benchmarks we tested. We present the results for the creative writing task in the appendix. In addition, we show that AoT continues to have an advantage over standard prompting or CoT even after fine-tuning. This implies that the issue with LLMs is not simply a minor misalignment or a deficiency in domain expertise. Rather, it underscores the necessity of AoT prompting. This approach is vital because the nature of the tasks we evaluated inherently demands a thorough exploration of solution paths, a requirement that goes beyond simple fine-tuning adjustments. For the generation of in-context examples, we have asked the authors to write down their search process and randomly chosen from that list. We have written them again in a simple structured way to have uniformity between the examples to create our AoT prompts. More details regarding this process for each task is given in the AoT setup subsections. 4.1 Game of 24 The game of 24 is a mathematical card game in which players are given four numbers and must use addition, subtraction, multiplication, and division (each operation can be used more than once) to manipulate those numbers to reach a total of 24. For instance, for the numbers ‘ $8\>8\>5\>4$ ’, one solution could be ‘ $8*(5-(8/4))=24$ ’. At first glance, the game might appear straightforward. However, a cursory calculation suggests there are nearly 13,000 distinct expressions possible for any set of four numbers, making it a formidable challenge for present-day LLMs. Task Setup. Adhering to the setup detailed in (Yao et al., 2023), we use games from indices 901-1000, sourced from the 1362 games ranked by relative difficulty at 4nums.com. An attempt is considered successful if it is able to reach a total of 24 using the exact numbers provided and only the allowed operations. Baselines. Standard prompting and CoT are used in the 5-shot setting, with CoT integrating 3 steps for the operations. These methods are sampled 100 times, and the averaged success rates from these samples are reported. CoT-SC is also tested with 100 votes in our setup. For ToT, we use a breadth of 5. AoT Setup. We employ the same 5-shot setting as in standard prompting and CoT baseline setup. Our in-context samples leverage a DFS-style search algorithm, which is the same version used when contrasting with traditional DFS in Fig. 5. During each subtree exploration, dubbed either the ‘first step’ or ‘first operation’, we choose two numbers—illustrated by the selection of 8 and 6 in the third ’first step’ (i.e., subtree labeled ‘3’) of Fig. 1 —and a corresponding operation (e.g., $8-6$ ). This operation results in a new number, 2, leaving us with three numbers in total. A thorough combing of these three numbers culminates in 19 leaf nodes, all visible under the ‘3’ subtree in Fig. 1. In order to generate our in-context examples, we have randomly selected games that do not appear at test time. We asked the authors to write the search steps they used until they arrived at the answers. These are exactly the node selection, node expansion steps with inherent heuristics of the individuals. Then, we have selected randomly from these solutions and written them in a trivial structured way to assure uniformity between the examples. The exact prompts we use are given in the Prompts section under the ‘AoT (DFS)’ subsection in the appendix. We aim to assess two aspects: the ability of the LLM to pinpoint promising first operations, which directly impacts the number of resolved leaf nodes, and its performance against a conventional DFS. Details on the prompts are provided in the appendix. As our method emphasizes sequential generation over trajectory sampling, we operate with a temperature setting of 0. Results. From Table 1, it is evident that standard prompting combined with CoT/-SC significantly lags behind tree search methods when used with LLMs. The “Standard + Refine” result, showing a 27% success rate, is referenced from (Yao et al., 2023). This method involves iteratively asking the LLM (up to 10 iterations) to refine its answer if the initial one is incorrect. Meanwhile, ToT is limited to a maximum of 100 node visits, translating to several hundred LLM queries for each example. Remarkably, AoT achieves its results with just a single query! Despite reducing the number of requests by more than a factor of 100, AoT still outperforms ToT in this task. Furthermore, AoT is also more efficient than ToT in terms of the total number of prompt tokens given to the LLM and the completion tokens it generates. | Method I/O CoT | Success $7.3\%$ $4.0\%$ | Queries $1$ $1$ | PTs $164$ $421$ | CTs $18$ $46.2$ | | --- | --- | --- | --- | --- | | CoT-SC | $9.0\%$ | $100$ | $42$ , $100$ | $4$ , $620$ | | I/O + Refine | $27\%$ | $10$ | $458$ | $360$ | | ToT $(b=5)$ | $69\%$ | $109.1$ | $13$ , $900$ | $5$ , $500$ | | AoT (ours) | $\mathbf{71}\boldsymbol{\%}$ | $1$ | $5$ , $450$ | $998.4$ | Table 1: Game of 24: success rates and the average number of LLM queries for each example. We give the average query count, prompt tokens (PT), and completion tokens generated by the LLM (CT). Error Analysis. Using a strictly LLM-centric approach—eschewing any external tooling or edits—we sought to categorize mistakes observed during the game of 24. This aids in highlighting areas for refinement when solely deploying LLMs. We’ve classified these errors into four distinct categories: 1) Out-of-token error: The LLM reaches its maximum token threshold without identifying a solution. 2) Expression misstep: The LLM has the correct logic or steps but fails when trying to express or formulate them into a coherent answer. 3) Non-finalization error: The LLM discovers the solution but continues its search without consolidating the finding. 4) Other errors: This umbrella term encompasses other mistakes like computational errors that result in overlooking the solution or furnishing incorrect answers. To exclusively showcase the AoT’s search capabilities, we also present the AoT + Manual Resolution version. Here, once the LLM pinpoints a solution, its final articulation is manually processed—a strategy also employed by the ToT method. As evidenced in Table 2, a notable 7% of mistakes stem from non-algorithmic factors like non-finalization and expression missteps. In fact, with manual resolution, AoT attains a 78% success rate, surpassing ToT. This underlines the potential for refining our prompt, especially in areas concerning recognizing and expressing successful problem resolutions. Additionally, the token limitation underscores the appeal of expanding the generative context window, which may further bolster LLMs’ recursive reasoning when engaged with algorithmic examples. | Out-of-token error Expression misstep Non-finalization error | $9\%$ $4\%$ $3\%$ | | --- | --- | | Others | $13\%$ | | Method | Success | | ToT | $69\%$ | | AoT | $71\%$ | | AoT + Manual Resolution | $78\%$ | Table 2: Game of 24: AoT error analysis. 4.2 Mini Crosswords The $5× 5$ mini crossword is a compact word puzzle featuring a grid of 25 squares arranged in a $5$ -by- $5$ configuration. Players are tasked with filling the grid based on provided clues for each word. Clues are given for words that run both across (horizontally) and down (vertically). Words intersect at certain letters, offering additional hints to complete the puzzle. Task Setup. Adhering to the setup outlined in (Yao et al., 2023), we draw our prompts from games 136, 141, 146, 151, and 156 out of the 156 games available on goobix.com. Our testing focuses on a set of 20 games, specifically games 1, 6, $...$ , 91, and 96. Baselines. As done in the game of 24, we benchmark our method against established techniques: standard prompting, CoT, and ToT. For standard prompting, we provide both the crosswords and their respective solutions as in-context examples. CoT augments this by prompting the retrieval of words for each of the ten clues—equally split between horizontal and vertical orientations. We directly extract the success rates of ToT from their paper for comparison. AoT Setup. We divide the process into two steps, each involving a query. Initially, we task the LLM with suggesting five potential words for each row and column. We then pinpoint the starting word candidates that have the highest compatibility with other words within the crossword framework. This preliminary phase mirrors a ’warm-up’ sequence in algorithm initialization. In the subsequent step, we exclusively leverage the LLM’s algorithmic reasoning prowess, starting with the pre-selected word. The method involves cyclically choosing a likely option for insertion, generating candidate words, and assessing their compatibility with the words already on the board. If no match is found, the process shifts focus to another promising candidate. Otherwise, the word is added to the crossword, and the search continues. The cycle concludes either when the board is fully populated or no more suitable words can be found, which may be due to either incorrect existing words or the absence of matching words. Notably, this entire process unfolds within a single-generation window. The algorithmic examples in our prompt (detailed in the Appendix) include three that achieve game completion and two that predominantly populate the crossword, filling 8 or 9 slots. Results. Table 3 underscores AoT’s proficiency in the mini crosswords task, showcasing a word success rate—a measure used in existing studies to represent the percentage of words correctly completed out of the total—that surpasses earlier methods reliant on various prompting techniques. It also outperforms ToT. An important observation is the sheer volume of queries ToT employs, exceeding AoT’s by over a factor of 100. AoT also enjoys 25x reduction in total tokens required compared to ToT, a benefit of having everything in-context. | I/O CoT-SC ToT | $14\%$ $15.6\%$ $46.5\%$ | $1$ $1$ $>200$ | $790.3$ $1$ , $400$ $96$ , $700$ | $30.5$ $1$ , $600$ $21.8$ k | | --- | --- | --- | --- | --- | | AoT (ours) | $\mathbf{52}\boldsymbol{\%}$ | $2$ | $3$ , $800$ | $975.6$ | Table 3: $5× 5$ mini crosswords word: word success rates and the average number of LLM queries for each example. We give the average query count, prompt tokens (PT), and completion tokens generated by the LLM (CT). Error Analysis. To understand the prevalent mistakes made by AoT, we’ve categorized the errors into four distinct categories. In our analysis for each game, we focus on the initial error the LLM produces while charting its reasoning path, given that an early error typically cascades into subsequent failures. 1) No preselections: LLM fails to generate compatible words essential for the warm-start phase. Given a correctly preselected word, the second phase for recursive reasoning can exhibit errors including: 2) Expression misstep: The LLM mistakenly believes it has exhausted all choices and jumps to an answer prematurely. 3) Incorrect pattern extraction: The LLM wrongly extracts a pattern based on the current board layout. 4) Erroneous word placement: Despite recognizing the correct pattern, the LLM selects a mismatched word or misses better-fitting alternatives. Navigating the crossword complexity arises from outdated terms and esoteric references. Predominantly, the errors observed are due to misguided word placements followed by pattern misinterpretations. Also, the LLM seems challenged in aligning letters at precise indices to create word structures— an obstacle circumvented by an external mechanism in the ToT framework. | No preselections Expression misstep Incorrect pattern extraction | $15.8\%$ $5.3\%$ $26.3\%$ | | --- | --- | | Erroneous word placement | $52.6\%$ | Table 4: Breakdown of errors in $5× 5$ mini crosswords with AoT. Numbers indicate the relative percentage of each error type among all errors. 4.3 Finetuning In order to eliminate the possibility that prior experiments lacked domain knowledge or were misaligned with the task, even after few-shot prompting via standard prompting or CoT, we also finetuned GPT-3.5-Turbo using OpenAI’s API with 900 examples with CoT and AoT. In Table 5, we can see that although GPT-3.5-Turbo had similar solution rates for the Game of 24 with CoT and AoT, AoT fine-tuning improved the model by 60% compared to 8% for CoT. This shows that fine-tuning alone cannot emerge implicit non-linear thinking, and LLMs still require explicit exploration of possible options to arrive at a solution. This is similar to chess grandmasters being able to find better moves than others even when they play without thinking deeply. However, to find the truly great moves, they are also required to deliberately explore the possibility of space. | Method | w/o finetuning | w/ finetuning | | --- | --- | --- | | CoT | 3% | 12% | | AoT | 3% | 63% | Table 5: AoT’s advantage continues even after finetuning. Finetuning results on the Game of 24 on 900 examples with CoT and AoT prompting. 5 Discussion In this section, we delve into crucial aspects to consider when crafting prompts for AoT, using the game of 24 as our primary case study. Can AoT surpass the DFS it is patterned after? A core query of ours is to ascertain if the LLM has the capability to not only mirror but also outdo the efficiency of the algorithm introduced in-context. As evidenced in Fig. 5, AoT systematically navigates fewer nodes than its DFS counterpart. While DFS employs a uniform strategy when choosing the subsequent subtree to investigate, AoT’s LLM integrates its inherent heuristic. This amplification over the base algorithm exemplifies the advantages of LLM’s recursive reasoning capability. <details> <summary>2308.10379v3/x5.png Details</summary> ![6404f626fa9896c85c2a1749e5bf2c66e4a8933f95561b09af44965eda543fbb](http://localhost:8000/v1/image/6404f626fa9896c85c2a1749e5bf2c66e4a8933f95561b09af44965eda543fbb) ### Visual Description # Technical Document Extraction: Bar Chart Analysis ## Chart Overview The image is a **bar chart** comparing the performance of two algorithms (**DFS** and **AoT**) based on the relationship between the **number of visited nodes** and the **number of games**. The chart uses two distinct colors to differentiate the algorithms: - **Purple**: DFS (Depth-First Search) - **Green**: AoT (Algorithm of the Test) --- ## Axis Labels and Scales - **X-Axis**: - Title: `# of Visited Nodes` - Range: `0` to `1000` (in increments of 200) - Categories: Discrete intervals representing the number of nodes visited during algorithm execution. - **Y-Axis**: - Title: `# of Games` - Range: `0` to `20` (in increments of 4) - Categories: Discrete intervals representing the number of games completed or analyzed. --- ## Legend - **DFS**: Represented by **purple bars**. - **AoT**: Represented by **green bars**. --- ## Key Data Trends 1. **AoT (Green Bars)**: - Dominates in **lower node ranges** (0–400 nodes). - Peaks at **100 nodes** with **18 games** (highest value on the chart). - Gradual decline in frequency as node count increases beyond 400. - No bars appear beyond **800 nodes**. 2. **DFS (Purple Bars)**: - Concentrated in **mid-to-high node ranges** (200–1000 nodes). - Peaks at **600 nodes** with **8 games**. - Shows multiple smaller peaks (e.g., 400 nodes: 4 games; 800 nodes: 2 games). - Extends to the maximum node count (1000 nodes) with minimal frequency. --- ## Observations - **AoT** is more efficient in completing games with fewer visited nodes, as evidenced by its dominance in the 0–400 node range. - **DFS** requires significantly more nodes to achieve comparable results, with its highest frequency occurring at 600 nodes. - Both algorithms show a **negative correlation** between node count and game frequency, but AoT exhibits a steeper decline. --- ## Structural Notes - The chart uses **grouped bars** for each node interval, allowing direct comparison between DFS and AoT. - No overlapping text or annotations are present in the chart. - The gridlines are evenly spaced, aiding in precise data point estimation. --- ## Conclusion The chart highlights a clear performance disparity between DFS and AoT. AoT achieves higher game counts with fewer nodes, while DFS requires more nodes to reach similar outcomes. This suggests AoT may be more efficient for scenarios prioritizing node minimization. </details> Figure 5: Histogram showing the number of visited nodes for AoT and DFS in the Game of 24. How does the search step count within the algorithmic example modulate AoT’s behavior? We begin with the standard AoT prompt and modify the subtree explorations. In AoT (Short), each in-context example uses one or two steps to reach a solution, while AoT (Long) incorporates three to five extra subtree explorations. The impact on total search steps is illustrated in Fig. 6. Our observations highlight longer generations for AoT (Long) and shorter ones for AoT (Short) relative to the original AoT. This suggests that the search step count introduces an implicit bias on the LLM’s search velocity. Notably, even when navigating incorrect steps, it’s essential to emphasize the exploration of promising directions. <details> <summary>2308.10379v3/x6.png Details</summary> ![46f4b85391163c6bb669fd260022d873a13d3c5a4c2d73b72a1a3352967008a9](http://localhost:8000/v1/image/46f4b85391163c6bb669fd260022d873a13d3c5a4c2d73b72a1a3352967008a9) ### Visual Description # Technical Document Extraction: Line Graph Analysis ## Chart Overview The image depicts a line graph comparing the performance of three variants of the "AoT" algorithm across varying numbers of visited nodes. The graph illustrates the relationship between the number of visited nodes (x-axis) and the number of games processed (y-axis). --- ## Axis Labels and Markers - **X-Axis**: - Title: `# of Visited Nodes` - Range: `0` to `400` - Increment: `100` (grid lines at 0, 100, 200, 300, 400) - **Y-Axis**: - Title: `# of Games` - Range: `0` to `100` - Increment: `20` (grid lines at 0, 20, 40, 60, 80, 100) --- ## Legend and Line Correspondence - **Legend**: - **Blue Line**: `AoT (Short)` - **Green Line**: `AoT` (standard variant) - **Red Line**: `AoT (Long)` - **Line Behavior**: - All lines originate at `(0, 0)`. - **AoT (Short)** (blue): - Steady linear growth until ~300 nodes. - Plateaus at ~80 games after 300 nodes. - **AoT** (green): - Slightly outperforms `AoT (Short)` in early stages. - Reaches ~90 games at 300 nodes, then plateaus. - **AoT (Long)** (red): - Slow initial growth (lags behind others until ~300 nodes). - Sharp exponential increase after 300 nodes, surpassing all variants to reach ~100 games at 400 nodes. --- ## Key Trends and Data Points 1. **Early-Stage Performance (0–100 nodes)**: - All variants show rapid growth, with `AoT (Short)` and `AoT` outperforming `AoT (Long)`. - At 100 nodes: - `AoT (Short)`: ~60 games - `AoT`: ~50 games - `AoT (Long)`: ~30 games 2. **Mid-Stage Performance (100–300 nodes)**: - `AoT (Short)` and `AoT` maintain similar trajectories. - `AoT (Long)` remains the weakest performer until ~300 nodes. - At 200 nodes: - `AoT (Short)`: ~75 games - `AoT`: ~70 games - `AoT (Long)`: ~40 games 3. **Late-Stage Performance (300–400 nodes)**: - `AoT (Long)` exhibits a dramatic acceleration, overtaking all variants. - At 300 nodes: - `AoT (Short)`: ~80 games - `AoT`: ~90 games - `AoT (Long)`: ~50 games - At 400 nodes: - All variants converge near 100 games, with `AoT (Long)` achieving the highest value. --- ## Observations - **Scalability**: - `AoT (Long)` demonstrates superior scalability in high-node regimes (>300 nodes). - **Efficiency Trade-offs**: - `AoT (Short)` and `AoT` prioritize early-stage efficiency but plateau in later stages. - **Convergence**: - All variants approach 100 games by 400 nodes, suggesting asymptotic performance limits. --- ## Conclusion The graph highlights the trade-offs between algorithmic variants: - `AoT (Short)` and `AoT` excel in low-to-mid node regimes. - `AoT (Long)` sacrifices early performance for late-stage scalability. This data is critical for selecting the optimal variant based on the expected node density in deployment scenarios. </details> Figure 6: Comparison of AoT with shorter and longer in-context examples prompted AoT versions: cumulative number of games for the number of visited nodes. Can AoT be used for question-answering tasks? To answer this question, we have followed the same structure to the Creative Writing task (given in the appendix of our paper) to evaluate AoT and the baselines on the first 100 questions of well-known GSM8K and StrategyQA benchmarks. Briefly, we implemented a zero-shot AoT prompt for StrategyQA and GSM8K that proposes 3 strategies and expands them with detail to select the best one. As seen in Table 6, we see a slight boost on this task due to GPT-4 with CoT already being competent. | IO | 51% | 73% | | --- | --- | --- | | CoT | 86% | 82% | | ToT | 90% | 83% | | AoT | 89% | 84% | Table 6: Performance comparison of different methods on question-answer tasks using GSM8K and StrategyQA benchmarks. The AoT model shows competitive performance, especially when compared with the CoT and ToT methods. Can AoT work as other dynamic programming methods? We have also tested AoT and the baselines on the traditional dynamic programming problems Coin Change and Edit Distance, where DFS and BFS have explosive complexities. However, another DP method named tabulation can more easily solve these problems. Since ToT cannot take the form of tabulation, and has to either DFS or BFS, we decided not to include those poor results to be fair. However, one can use a single leaf node with a CoT prompt to solve these problems. There, ToT’s performance can be considered the same as that of CoT’s. Please refer to Table 7 for the detailed results. | I/O | 72% | 61% | | --- | --- | --- | | CoT | 76% | 64% | | AoT | 96% | 90% | Table 7: Performance comparison of AoT with traditional dynamic programming methods on solving Coin Change and Edit Distance problems. Can AoT help other SOTA LLMs? We investigate AoT for other SOTA LLMs, Claude 3 and Gemini 1.5 Pro, to see if it provides a significant boost for them as well on the game of 24. We see that both Claude 3 and Gemini 1.5 Pro benefit significantly. We were unable to run Gemini 1.5 Pro on CoT-SC due to API access not being available yet. Please refer to Table 8 for detailed results. | IO | 7% | 6% | 6% | | --- | --- | --- | --- | | CoT-SC | 9% | 9% | - | | AoT | 71% | 68% | 55% | Table 8: Additional language model results for the AoT, CoT-SC, and IO methods across different models. 6 Conclusion This paper presents the Algorithm of Thoughts, a pioneering prompting strategy to navigate reasoning pathways in LLMs using minimal queries. Our findings reveal that this method not only substantially surpasses prior single-query techniques but also outperforms external tree-search implementations. Such an approach augments the potential to streamline idea discovery in LLMs, balancing both cost and computational demands. Future work includes designing token-efficient algorithmic examples, developing adaptive mechanisms for “tunnel-vision” activation to expedite the search, and deepening the understanding of this fresh mode of in-context learning from theoretical angles. 7 Limitations While AoT substantially cuts down on the number of queries relative to ToT, its resource demands exceed those of standard prompting and CoT, a consequence of its extensive exploration of ideas via token generation. Crafting token-efficient algorithmic examples is one direction of future research. It is also pertinent to highlight that we conducted our tests exclusively with GPT-4. Though more costly than other LLMs, GPT-4’s advanced capabilities appear pivotal for AoT’s optimal functioning; models of lesser caliber might not yield comparable performance boosts from AoT. Acknowledgments This work was supported in part by the Amazon Research and Virginia Tech Initiative for Efficient and Robust Machine Learning and the National Science Foundation (Grants #2331775 and #2312794). Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References - Al-Tawaha et al. (2023) Al-Tawaha, A., Kaushik, H., Sel, B., Jia, R., and Jin, M. Decision-focused learning for inverse noncooperative games: Generalization bounds and convergence analysis. IFAC-PapersOnLine, 56(2):9336–9341, 2023. - Al-Tawaha et al. (2021) Al-Tawaha, A. S., Aljanaideh, K., and Alshorman, A. A singular value thresholding algorithm for order estimation. In 2021 American Control Conference (ACC), pp. 4478–4483. IEEE, 2021. - Aminabadi et al. (2022) Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE, 2022. - Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. - Baddeley (2003) Baddeley, A. Working memory: looking back and looking forward. Nature reviews neuroscience, 4(10):829–839, 2003. - Bai et al. (2022) Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. - Banerjee et al. (2022) Banerjee, S., Bringsjord, S., Giancola, M., and Govindarajulu, N. S. Qualitative mechanical problem-solving by artificial agents:: Further progress, under psychometric ai. In The International FLAIRS Conference Proceedings, volume 35, 2022. - Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. - Chen et al. (2023) Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023. - Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. - Chiang et al. (2023) Chiang, D., Cholak, P., and Pillay, A. Tighter bounds on the expressivity of transformer encoders. arXiv preprint arXiv:2301.10743, 2023. - Chowdhery et al. (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. - Dhar (2020) Dhar, P. The carbon impact of artificial intelligence. Nat. Mach. Intell., 2(8):423–425, 2020. - Drozdov et al. (2022) Drozdov, A., Schärli, N., Akyürek, E., Scales, N., Song, X., Chen, X., Bousquet, O., and Zhou, D. Compositional Semantic Parsing with Large Language Models. September 2022. URL https://openreview.net/forum?id=gJW8hSGBys8. - Feng et al. (2023) Feng, G., Gu, Y., Zhang, B., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. arXiv preprint arXiv:2305.15408, 2023. - Gu et al. (2024a) Gu, S., Sel, B., Ding, Y., Wang, L., Lin, Q., Jin, M., and Knoll, A. Balance reward and safety optimization for safe reinforcement learning: A perspective of gradient manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 21099–21106, 2024a. - Gu et al. (2024b) Gu, S., Sel, B., Ding, Y., Wang, L., Lin, Q., Knoll, A., and Jin, M. Safe and balanced: A framework for constrained multi-objective reinforcement learning. arXiv preprint arXiv:2405.16390, 2024b. - Helie & Pizlo (2022) Helie, S. and Pizlo, Z. When is psychology research useful in artificial intelligence? a case for reducing computational complexity in problem solving. Topics in Cognitive Science, 14(4):687–701, 2022. - Holyoak & Morrison (2005) Holyoak, K. J. and Morrison, R. G. The Cambridge handbook of thinking and reasoning. Cambridge University Press, 2005. - Huang & Chang (2022) Huang, J. and Chang, K. C.-C. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022. - Jin et al. (2023a) Jin, M., Khattar, V., Kaushik, H., Sel, B., and Jia, R. On solution functions of optimization: Universal approximation and covering number bounds. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8123–8131, 2023a. - Jin et al. (2023b) Jin, M., Sel, B., Hardeep, F., and Yin, W. A human-on-the-loop optimization autoformalism approach for sustainability. arXiv preprint arXiv:2308.10380, 2023b. - Kadavath et al. (2022) Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. - Kahneman (2011) Kahneman, D. Thinking, fast and slow. macmillan, 2011. - Khattar & Jin (2023) Khattar, V. and Jin, M. Winning the citylearn challenge: adaptive optimization with evolutionary search under trajectory-based guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 14286–14294, 2023. - Khattar et al. (2022) Khattar, V., Ding, Y., Sel, B., Lavaei, J., and Jin, M. A cmdp-within-online framework for meta-safe reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022. - Kojima et al. (2022) Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022. - Lanham et al. (2023) Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023. - Liang et al. (2022) Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. - Libby et al. (2008) Libby, M. E., Weiss, J. S., Bancroft, S., and Ahearn, W. H. A comparison of most-to-least and least-to-most prompting on the acquisition of solitary play skills. Behavior analysis in practice, 1:37–43, 2008. - (31) Lin, T.-W., Khattar, V., Huang, Y., Hong, J., Jia, R., Liu, C.-C., Sangiovanni-Vincentelli, A., and Jin, M. Causalprompt: Enhancing llms with weakly supervised causal reasoning for robust per-formance in non-language tasks. - Liu et al. (2023) Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., et al. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023. - Long (2023) Long, J. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023. - Lyu et al. (2023) Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., and Callison-Burch, C. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023. - Merrill & Sabharwal (2023) Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023. - Mialon et al. (2023) Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023. - Monsell (2003) Monsell, S. Task switching. Trends in cognitive sciences, 7(3):134–140, 2003. - Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. - Robinson & Wingate (2022) Robinson, J. and Wingate, D. Leveraging Large Language Models for Multiple Choice Question Answering. September 2022. URL https://openreview.net/forum?id=yKbprarjc5B. - Schuurmans (2023) Schuurmans, D. Memory augmented large language models are computationally universal. arXiv preprint arXiv:2301.04589, 2023. - Sel et al. (2021) Sel, A., Sel, B., and Kasnakoglu, C. Glsdc based parameter estimation algorithm for a pmsm model. Energies, 14(3):611, 2021. - Sel et al. (2022) Sel, A., Sel, B., Coskun, U., and Kasnakoglu, C. Sos-based nonlinear observer design for simultaneous state and disturbance estimation designed for a pmsm model. Sustainability, 14(17):10650, 2022. - Sel et al. (2023) Sel, B., Tawaha, A., Ding, Y., Jia, R., Ji, B., Lavaei, J., and Jin, M. Learning-to-learn to guide random search: Derivative-free meta blackbox optimization on manifold. In Learning for Dynamics and Control Conference, pp. 38–50. PMLR, 2023. - Sel et al. (2024) Sel, B., Shanmugasundaram, P., Kachuee, M., Zhou, K., Jia, R., and Jin, M. Skin-in-the-game: Decision making via multi-stakeholder alignment in llms. arXiv preprint arXiv:2405.12933, 2024. - Shao et al. (2023) Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., and Chen, W. Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models. June 2023. URL https://openreview.net/forum?id=RYD1UMgTdk. - Sloman (1996) Sloman, S. A. The empirical case for two systems of reasoning. Psychological bulletin, 119(1):3, 1996. - Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. - Suzgun et al. (2022) Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., and Wei, J. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, October 2022. URL http://arxiv.org/abs/2210.09261. arXiv:2210.09261 [cs]. - Thoppilan et al. (2022) Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. - Turpin et al. (2023) Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388, 2023. - Wang et al. (2022) Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. September 2022. URL https://openreview.net/forum?id=1PL1NIMMrw. - Wei et al. (2022a) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent Abilities of Large Language Models, October 2022a. URL http://arxiv.org/abs/2206.07682. arXiv:2206.07682 [cs]. - Wei et al. (2022b) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b. - Wu et al. (2022) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813, 2022. - Yao et al. (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May 2023. URL http://arxiv.org/abs/2305.10601. arXiv:2305.10601 [cs]. - Zelikman et al. (2022) Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. - Zhang et al. (2022) Zhang, Z., Zhang, A., Li, M., and Smola, A. Automatic Chain of Thought Prompting in Large Language Models. September 2022. URL https://openreview.net/forum?id=5NTt8GFjUHkr. - Zhou et al. (2022a) Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V., and Chi, E. H. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. September 2022a. URL https://openreview.net/forum?id=WZH7099tgfM. - Zhou et al. (2022b) Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., and Sedghi, H. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022b. Appendix A Game of 24 - Additional Details In order to avoid confusion in our analysis of AoT in the game of 24, we give additional details in terms of terminologies we use as well as their direct implications in the performance figures. An Illustration of these are given in Fig. 7. <details> <summary>2308.10379v3/x7.png Details</summary> ![016122ef2ed3a9ea9238c67777f15438f0258c18fd8d238ba0d4bea29190f079](http://localhost:8000/v1/image/016122ef2ed3a9ea9238c67777f15438f0258c18fd8d238ba0d4bea29190f079) ### Visual Description # Technical Document Extraction: Subtree Exploration Flowchart ## Diagram Overview The image depicts a **subtree exploration process** for arithmetic operations on a sequence of numbers. The flowchart illustrates decision paths, visited nodes, and intermediate results during computation. --- ### Key Components 1. **Input** - **Value**: `8 6 4 4` - **Role**: Initial sequence of numbers for operations. 2. **Visited Nodes** - **Purpose**: Tracks accumulated results from operations. - **Structure**: A gray box with connections to all operation nodes. - **Values**: - `8, 6, 0` (from `4 - 4 = 8`) - `4, 4, 2` (from `8 - 6 = 2`) - `6, 4` (from `4 + 2 = 6`) - `2, 1` (from `4 / 4 = 1`) - `24` (from `6 * 4 = 24`) - `10` (from `6 + 4 = 10`) 3. **Operations** - **Categorized into three tiers**: - **First Operations** (Yellow boxes) - **Second Operations** (Purple boxes) - **Third Operations** (Purple boxes with ellipsis) --- ### Operation Details #### First Operations 1. **Subtraction** - **Equation**: `4 - 4 = 8` - **Left Values**: `8, 6, 0` - **Connection**: Directly linked to input and visited nodes. 2. **Subtraction** - **Equation**: `8 - 6 = 2` - **Left Values**: `4, 4, 2` - **Connection**: Branches into second operations. #### Second Operations 1. **Addition** - **Equation**: `4 + 2 = 6` - **Left Values**: `6, 4` - **Connection**: Branches into third operations. 2. **Division** - **Equation**: `4 / 4 = 1` - **Left Values**: `2, 1` - **Connection**: Branches into third operations. #### Third Operations 1. **Multiplication** - **Equation**: `6 * 4 = 24` - **Left Values**: `24` - **Color**: Green box (distinct from other operations). 2. **Addition** - **Equation**: `6 + 4 = 10` - **Left Values**: `10` - **Connection**: Truncated with ellipsis (`...`), indicating additional unshown operations. --- ### Flowchart Structure - **Input** → **First Operations** → **Second Operations** → **Third Operations** - **Visited Nodes** act as a central repository for all intermediate results. - **Color Coding**: - Yellow: First Operations - Purple: Second/Third Operations - Green: Special case (multiplication in third operations). --- ### Key Trends 1. **Operation Hierarchy**: - Operations cascade from input to deeper subtree levels. - Each operation reduces the sequence length (e.g., `8 6 4 4` → `4 4 2`). 2. **Visited Nodes Accumulation**: - Results are appended to the visited nodes list, reflecting all explored paths. 3. **Termination**: - The process terminates when no further valid operations can be performed (implied by ellipsis). --- ### Diagram Annotations - **Dotted Box**: Encloses the subtree exploration area. - **Arrows**: Indicate directional flow of operations and node updates. - **Ellipsis (`...`)**: Signals continuation of operations beyond the diagram’s scope. --- ### Summary This flowchart models a recursive exploration of arithmetic operations on a number sequence, tracking visited states and intermediate results. The use of color and hierarchical grouping clarifies the computational path and state transitions. </details> Figure 7: An illustration of terminologies we use for the game of 24. The yellow nodes represent the first operations and the states they lead to; the green node represents the node where we find the solution; all other nodes are represented by pink. First operations / First iterations. This represents the scenario that after we choose the first two number in the game of 24, the case of either adding, subtracting, multiplying or dividing them. Subtree Exploration. This denotes searching all or most of the nodes coming from the same state, typically states with less than four numbers left. Number of nodes visited. This is the number of states that the method has been on the game of 24. Each state is the set of number we are left with, after our operations in the numbers. For example, after the first operation we might be left with the numbers ‘ $8\>3\>1$ ’. This set of numbers represent a state, as well as the state of ‘ $8\>3$ ’ that we will be left with after another operation of ‘ $8*1=8$ ’. Appendix B Creative Writing We use the creative writing task, also used by (Yao et al., 2023), where the LLM is provided with four arbitrary sentences. The objective is to craft a cohesive narrative divided into four paragraphs, with each paragraph culminating in one of the given sentences. This exercise not only fosters creativity but also emphasizes strategic deliberation. B.1 Task Setup Sentences are randomly sourced from randomwordgenerator.com, resulting in 100 distinct sets of inputs. Given the absence of predetermined correct answers, the primary focus lies in evaluating the coherence of the responses. We have noted that GPT-4 consistently aligns with these input guidelines. Evaluation is centered around assessing passage coherence using a GPT-4 zero-shot prompt, where each output is rated on a scale of 1 to 10. Each task response undergoes five such evaluations, with their scores being averaged subsequently. B.2 Baselines For this task, both standard and CoT prompts are employed without preliminary training. While the standard prompt directly guides the LLM to fashion a cohesive narrative based on stipulated parameters, the CoT prompt obliges the model to initially outline a succinct plan prior to drafting the narrative, serving as an intermediate cognitive bridge. For each task iteration, ten samples are generated using both the standard and CoT methods. Results of the ToT approach are presented without modification. B.3 AoT Setup Mirroring ToT’s methodology, the task is tackled in a zero-shot setting. Our prompt instructs the model to first formulate five distinct plans. Subsequent to this, the model selects the most promising among them to shape a narrative and then refines it for optimal coherence. The exact prompts used for this zero-shot approach will be provided in the subsequent section. B.4 Results As depicted in Fig. 8, AoT outpaces other singular query prompting techniques such as standard prompting and CoT in terms of performance. It also exhibits a marked improvement over ToT, although the difference is not statistically significant. Comprehensive scores, along with the average query count needed for each method, are consolidated in Table 9. Notably, AoT necessitates fewer queries compared to ToT. <details> <summary>2308.10379v3/x8.png Details</summary> ![0fe1670bdef8a8bcdfac590b9fa42632a33e1f5df011d22d0403a8784235d1ea](http://localhost:8000/v1/image/0fe1670bdef8a8bcdfac590b9fa42632a33e1f5df011d22d0403a8784235d1ea) ### Visual Description # Technical Document Extraction: Box Plot Analysis ## **Chart Type** - Box plot (box-and-whisker plot) comparing four categorical groups. --- ## **Axis Labels and Markers** - **X-axis (Categories):** - Labels: `Standard`, `CoT`, `ToT`, `AoT` - No numerical axis markers (categorical). - **Y-axis (Values):** - Title: `Value` - Range: `0` to `10` (linear scale, increments of `2`). --- ## **Legend and Color Coding** - **Legend:** Not explicitly labeled, but colors are distinct for each category: - `Standard`: Light blue (`#ADD8E6`) - `CoT`: Medium blue (`#87CEEB`) - `ToT`: Dark blue (`#4682B4`) - `AoT`: Navy blue (`#000080`) - **Color Consistency:** - Boxes and median lines match category colors. - Outliers (diamonds) are black (`#000000`). --- ## **Data Points and Trends** ### **Key Observations:** 1. **Median Values (Central Line in Boxes):** - `Standard`: ~6 - `CoT`: ~7 - `ToT`: ~7.5 - `AoT`: ~8 2. **Interquartile Range (IQR):** - `Standard`: ~5–7 (IQR ≈ 2) - `CoT`: ~6–8 (IQR ≈ 2) - `ToT`: ~7–9 (IQR ≈ 2) - `AoT`: ~7–9 (IQR ≈ 2) 3. **Whisker Range (Excluding Outliers):** - `Standard`: ~3–9 - `CoT`: ~4–9 - `ToT`: ~5–10 - `AoT`: ~5–10 4. **Outliers (Diamond Markers):** - `Standard`: 1 outlier at ~4 - `CoT`: 1 outlier at ~4 - `ToT`: 5 outliers (~4, 4.5, 5, 5.5, 6) - `AoT`: 6 outliers (~4, 4.5, 5, 5.5, 6, 6.5) 5. **Extreme Values:** - Maximum whisker values: - `Standard`: 9 - `CoT`: 9 - `ToT`: 10 - `AoT`: 10 - Minimum whisker values: - `Standard`: 3 - `CoT`: 4 - `ToT`: 5 - `AoT`: 5 --- ## **Components of the Box Plot** 1. **Box:** - Represents the interquartile range (IQR). - Top edge: Third quartile (Q3). - Bottom edge: First quartile (Q1). - Median line: Horizontal line inside the box. 2. **Whiskers:** - Extend from Q1 to the minimum non-outlier value. - Extend from Q3 to the maximum non-outlier value. 3. **Outliers:** - Marked as black diamonds (`•`). - Defined as values outside 1.5×IQR from Q1/Q3. --- ## **Summary of Trends** - **Increasing Central Tendency:** Median values increase progressively from `Standard` to `AoT`. - **Similar IQRs:** All categories exhibit comparable variability (IQR ≈ 2). - **Outlier Distribution:** `AoT` has the highest number of outliers, suggesting greater data dispersion. - **Extreme Values:** `ToT` and `AoT` reach the maximum y-axis value (10), while `Standard` has the lowest minimum (3). --- ## **Data Table Reconstruction (Hypothetical)** | Category | Median | Q1 | Q3 | Min (Whisker) | Max (Whisker) | Outliers (Count) | |----------|--------|----|----|---------------|---------------|------------------| | Standard | 6 | 5 | 7 | 3 | 9 | 1 | | CoT | 7 | 6 | 8 | 4 | 9 | 1 | | ToT | 7.5 | 7 | 9 | 5 | 10 | 5 | | AoT | 8 | 7 | 9 | 5 | 10 | 6 | --- ## **Conclusion** The box plot illustrates a clear trend of increasing central tendency (median) from `Standard` to `AoT`, with `AoT` showing the highest median and most outliers. All categories share similar variability (IQR), but `AoT` and `ToT` exhibit broader ranges and higher extreme values. Outliers are more frequent in `AoT`, indicating potential anomalies or variability in that group. </details> Figure 8: Comparison of the standard prompting, CoT, ToT and AoT on the creative writing task. | Standard Prompting CoT ToT | $6.19$ $6.93$ $7.56$ | $1$ $1$ $20$ | | --- | --- | --- | | AoT | $\mathbf{7.58}$ | $1$ | Table 9: Performance of the methods determined by GPT-4. Appendix C CoT vs. Single Iteration AoT in the Game of 24 To demonstrate that the tree search mechanism is fundamentally distinct from the CoT prompting, even in scenarios where AoT’s in-context examples include only a single initial operation in the game of 24, we draw a comparison between AoT (Short) and CoT. In this setup, AoT (Short) determines the first operation and subsequently conducts a tree search on the remaining three numbers. Interestingly, AoT (Short) achieves a success rate of $48\%$ , while CoT lags significantly, securing only $4\%$ . These results underscore the notion that even a rudimentary search mechanism can lead to significant performance enhancements. Appendix D Detailed Analysis on the Effect of the Length of the Prompts In this section, we delve deeper into Fig. 6 by presenting histograms for the successful, unsuccessful, and total games of ‘24’, considering the number of initial steps in methods AoT (Short), AoT, and AoT (Long). These are displayed in Figs. 9 - 11. From these figures, it becomes evident that the length of the prompts, measured by the number of initial steps included in in-context examples, correlates with the length of their solutions to test examples. This trend is consistent across all three cases, suggesting that AoT’s strategy in determining the number of initial steps is influenced by its in-context examples. Interestingly, when AoT is provided a well-balanced set of initial steps that emphasize the most promising operations, it excels in solving the majority of games in earlier iterations. This indicates AoT’s capacity to prioritize swift problem-solving without sacrificing performance. This tendency is also observed in AoT (Long), albeit with a somewhat reduced success rate, as illustrated in Fig. 9. <details> <summary>2308.10379v3/x9.png Details</summary> ![ac1b1deb6b873ad40cefc561c5d255181051714160a413e337e7d5d462294258](http://localhost:8000/v1/image/ac1b1deb6b873ad40cefc561c5d255181051714160a413e337e7d5d462294258) ### Visual Description # Technical Document Extraction: Bar Chart Analysis ## Chart Structure - **Chart Type**: Stacked bar charts (3 separate charts) - **Axes**: - **X-axis**: "# of First Steps" (0–12, increments of 2) - **Y-axis**: "# of Successful Games" (0–40, increments of 20) - **Legend**: - **Blue**: AoT (Short) - **Green**: AoT - **Red**: AoT (Long) --- ## Chart 1: AoT (Short) [Blue] - **Data Points**: - X=0: 40 successful games - X=2–12: 0 successful games - **Trend**: All successes occur at 0 first steps; no success beyond this threshold. --- ## Chart 2: AoT [Green] - **Data Points**: - X=0: 25 successful games - X=2: 25 successful games - X=4: 15 successful games - X=6: 5 successful games - X=8: 2 successful games - X=10: 2 successful games - X=12: 0 successful games - **Trend**: Success decreases as first steps increase, with a gradual decline after X=4. --- ## Chart 3: AoT (Long) [Red] - **Data Points**: - X=0: 15 successful games - X=2: 10 successful games - X=4: 5 successful games - X=6: 2 successful games - X=8: 1 successful game - X=10: 1 successful game - X=12: 0 successful games - **Trend**: Success declines more steeply than AoT, with minimal success beyond X=6. --- ## Cross-Reference Validation - **Legend Colors**: - Blue (AoT Short) matches Chart 1. - Green (AoT) matches Chart 2. - Red (AoT Long) matches Chart 3. - **Axis Consistency**: All charts share identical axis labels and scales. --- ## Summary - **AoT (Short)** achieves maximum success at 0 first steps with no success beyond this. - **AoT** and **AoT (Long)** show diminishing returns as first steps increase, with AoT (Long) performing worse overall. </details> Figure 9: Histogram of the number of successful games with respect to the number of first steps for AoT (Short), AoT and AoT (Long). <details> <summary>2308.10379v3/x10.png Details</summary> ![6b8a92e03d2827a76949d574df2e5ee9c20e6082da48bd07ddf470fdb1294495](http://localhost:8000/v1/image/6b8a92e03d2827a76949d574df2e5ee9c20e6082da48bd07ddf470fdb1294495) ### Visual Description # Technical Document Extraction: Bar Chart Analysis ## Chart Structure - **Chart Type**: Stacked bar charts (3 separate charts) - **Orientation**: Vertical bars - **Axes**: - **X-axis**: "# of First Steps" (0–12, increments of 2) - **Y-axis**: "# of Unsuccessful Games" (0–40, increments of 20) ## Legend | Color | Label | |-------------|----------------| | Blue | AoT (Short) | | Green | AoT | | Red | AoT (Long) | ## Data Points ### Chart 1: AoT (Short) [Blue] - **X=0**: Y=40 (Single bar) - **X=2–12**: No bars (Y=0) ### Chart 2: AoT [Green] - **X=2**: Y≈5 - **X=4**: Y≈7 - **X=6**: Y≈3 - **X=10**: Y≈15 (Tallest bar) - **X=0,8,12**: No bars ### Chart 3: AoT (Long) [Red] - **X=6**: Y≈20 - **X=8**: Y≈22 (Tallest bar) - **X=10**: Y≈15 - **X=12**: Y≈10 - **X=0,2,4**: No bars ## Key Trends 1. **AoT (Short)**: - All unsuccessful games occur at **0 first steps** - No activity beyond X=0 2. **AoT**: - Unsuccessful games distributed across **2–10 first steps** - Peak at **X=10** (Y=15) 3. **AoT (Long)**: - Activity concentrated at **6–12 first steps** - Peak at **X=8** (Y=22) - Gradual decline from X=8 to X=12 ## Observations - **AoT (Long)** shows the most distributed pattern across first steps - **AoT (Short)** exhibits a singular, high-value outlier at X=0 - **AoT** demonstrates moderate spread with a late-stage peak </details> Figure 10: Histogram of the number of unsuccessful games with respect to the number of first steps for AoT (Short), AoT and AoT (Long). <details> <summary>2308.10379v3/x11.png Details</summary> ![a5df86f486a23386ee8da2565c63651d6103a8c0c1e302feeb82e07d6305c30f](http://localhost:8000/v1/image/a5df86f486a23386ee8da2565c63651d6103a8c0c1e302feeb82e07d6305c30f) ### Visual Description # Technical Document Analysis of Chart Image ## Chart Structure The image contains three vertically stacked bar charts with shared x-axis and y-axis scales. Each chart represents a different data series with distinct color coding. ### Legend - **Location**: Bottom right corner - **Color-Coding**: - Blue: AoT (Short) - Green: AoT - Red: AoT (Long) ## Axis Details - **X-Axis**: - Title: "# of First Steps" - Range: 0 to 12 - Increment: 2 - Labels: 0, 2, 4, 6, 8, 10, 12 - **Y-Axis**: - Title: "# of All Games" - Range: 0 to 100 - Increment: 50 - Labels: 0, 50, 100 ## Chart Analysis ### Chart 1: AoT (Short) [Blue] - **Trend**: Single bar at x=0 with maximum height - **Data Points**: - [0, 100] (100% of y-axis) ### Chart 2: AoT [Green] - **Trend**: - Peaks at x=2 (35 units) - Gradual decline until x=6 (10 units) - Slight resurgence at x=12 (15 units) - **Data Points**: - [0, 30] - [2, 35] - [4, 20] - [6, 10] - [8, 5] - [10, 5] - [12, 15] ### Chart 3: AoT (Long) [Red] - **Trend**: - Initial decline from x=0 (20) to x=4 (10) - Sharp peak at x=6 (25) - Gradual decline to x=12 (5) - **Data Points**: - [0, 20] - [2, 15] - [4, 10] - [6, 25] - [8, 20] - [10, 10] - [12, 5] ## Spatial Grounding - Legend position: [x=950, y=950] (normalized coordinates) - Color verification: - Blue bars match "AoT (Short)" legend - Green bars match "AoT" legend - Red bars match "AoT (Long)" legend ## Key Observations 1. AoT (Short) shows 100% dominance at initial step 0 2. AoT demonstrates bimodal distribution with peaks at steps 2 and 12 3. AoT (Long) exhibits delayed peak at step 6 with highest magnitude 4. All series show decreasing frequency beyond step 6 except AoT (Long) at step 8 ## Data Table Reconstruction | Series | Step 0 | Step 2 | Step 4 | Step 6 | Step 8 | Step 10 | Step 12 | |--------------|--------|--------|--------|--------|--------|---------|---------| | AoT (Short) | 100 | - | - | - | - | - | - | | AoT | 30 | 35 | 20 | 10 | 5 | 5 | 15 | | AoT (Long) | 20 | 15 | 10 | 25 | 20 | 10 | 5 | ## Language Analysis - All text appears in English - No non-English content detected </details> Figure 11: Histogram of the number of all games with respect to the number of first steps for AoT (Short), AoT and AoT (Long). Appendix E Proof of Corollary 3.1 **Corollary E.1** *Consider TIME $(a^{n})$ as the class of languages L for which a Turing machine exists that operates within a time complexity of $O(a^{n})$ for some $a≥ 1$ . For a transformer generating $O(a^{n})$ intermediate tokens with AoT, we have $$ \textsc{TIME}(a^{n})\subseteq\textsc{AoT}(n), \tag{2} $$ where $\textsc{AoT}(n)$ refers to the decoding steps by AoT when the input has $n$ tokens.* * Proof* Since with AoT prompting, we can backtrack and continue from other nodes, the number of intermediate tokens scale exponentially with the depth of the problem over the number of possible actions in each leaf node. Then, directly by Theorem 2. in Merrill & Sabharwal (2023), we have the stated result implying the capability of solving NP problems with AoT prompting. ∎ Appendix F Prompts F.1 Game of 24 Below, we represent the specific prompts employed for the various methods detailed in the experiments section. It’s important to note that the terms “System”,“User”, and “Assistant” are utilized to denote the roles within the OpenAI API when operating in chat completion mode. The line breaks serve to show the transitions between the user and assistant interactions within the API. AoT (DFS) ⬇ System: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. When considering the next steps, do not choose operations that will result in a negative or fractional number. In order to help with the calculations, the numbers in the parenthesis represent the numbers that are left after the operations and they are in descending order. Another thing we do is when there are only two numbers left in the parenthesis, we check whether we can arrive at 24 only by using basic arithmetic operations (+ - * /). Some examples regarding this idea: (21 2) no since 21 + 2 = 23, 21 - 2 = 19, 21 * 2 = 42, 21 / 2 = 10.5, none of which is equal to 24. (30 6) 30 - 6 = 24 yes (8 3) 8 * 3 = 24 yes (12 8) no (48 2) 48 / 2 = 24 yes Most importantly, do not give up, all the numbers that will be given has indeed a solution. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 14 8 8 2 Assistant: Trying a promising first operation: 1. 8 + 2: (14, 10, 8) - 14 + 10: (24, 8) 32, 16, 192, 3 - 14 - 10: (8, 4) 12, 8, 32, 2 - 14 * 10: (140, 8) 148, 132, 1120, fractional - 14 / 10: (8, 1.4) fractional - 14 + 8: (22, 10) 32, 12, 220, fractional - 14 - 8: (10, 6) 16, 4, 60, fractional - 14 * 8: (112, 10) 122, 102, 1120, fractional - 14 / 8: fractional - 10 + 8: (18, 14) 32, 4, 252, fractional - 10 - 8: (14, 2) 16, 12, 28, 7 - 10 * 8: (80, 14) 94, 66, big, fractional - 10 / 8: fractional Trying another promising first operation: 2. 8 / 2: (14, 8, 4) - 14 + 8: (22, 4) 26, 18, 88, fractional - 14 - 8: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 / 2 = 4 Step 2: 14 - 8 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (14 - 8) * 4 = (14 - 8) * (8 / 2) = 24. answer: (14 - 8) * (8 / 2) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 5 5 5 Assistant: Trying a promising first operation: 1. 9 - 5: (5, 5, 4) - 5 + 5: (10, 4) 14, 6, 40, fractional - 5 - 5: (4, 0) 4, 4, 0, undefined - 5 * 5: (25, 4) 29, 21, 100, fractional - 5 / 5: (4, 1) 5, 3, 4, 4 - 5 + 4: (9, 5) 14, 4, 45, fractional - 5 - 4: (5, 1) 6, 4, 5, 0.2 - 5 * 4: (20, 5) 25, 15, 100, fractional - 5 / 4: fractional Trying another promising first operation: 2. 5 + 5: (10, 9, 5) - 10 + 9: (19, 5) 24 = 19 + 5 -> found it! Backtracking the solution: Step 1: 5 + 5 = 10 Step 2: 10 + 9 = 19 Step 3: 19 + 5 = 24 Considering these steps: 24 = 19 + 5 = (10 + 9) + 5 = ((5 + 5) + 9) + 5 = 24. answer: ((5 + 5) + 9) + 5 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 6 4 4 Assistant: Trying a promising first operation: 1. 8 - 6: (4, 4, 2) - 4 + 4: (8, 2) 10, 6, 16, 4 - 4 - 4: (2, 0) 2, 2, 0, undefined - 4 * 4: (16, 2) 18, 14, 32, 8 - 4 / 4: (2, 1) 3, 1, 2, 2 - 4 + 2: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 - 6 = 2 Step 2: 4 + 2 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (4 + 2) * 4 = (4 + (8 - 6)) * 4 = 24. answer: (4 + (8 - 6)) * 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 10 9 4 Assistant: Trying a promising first operation: 1. 13 - 4: (10, 9, 9) - 10 + 9: (19, 9) 28, 10, 171, fractional - 10 - 9: (9, 1) 10, 8, 9, 9 - 10 * 9: (90, 9) 99, 81, 810, fractional - 10 / 9: fractional - 9 + 9: (18, 10) 28, 8, 180, fractional - 9 - 9: (10, 0) 10, 10, 0, undefined - 9 * 9: (81, 10) 91, 71, 810, fractional - 9 / 9: (10, 1) 11, 9, 10, 10 Trying another promising first operation: 2. 10 - 4: (13, 9, 6) - 13 + 9: (22, 6) 28, 16, 132, fractional - 13 - 9: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 10 - 4 = 6 Step 2: 13 - 9 = 4 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = 6 * (13 - 9) = (10 - 4) * (13 - 9) = 24. answer: (10 - 4) * (13 - 9) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 8 5 4 Assistant: Trying a promising first operation: 1. 8 - 4: (8, 5, 4) - 8 + 5: (13, 4) 17, 9, 52, fractional - 8 - 5: (4, 3) 7, 1, 12, fractional - 8 * 5: (40, 4) 44, 36, 160, fractional - 8 / 5: fractional - 8 + 4: (12, 5) 17, 7, 60, fractional - 8 - 4: (5, 4) 9, 1, 20, fractional - 8 * 4: (32, 5) 37, 27, 160, fractional - 8 / 4: (5, 2) 7, 3, 10, 2.5 - 5 + 4: (9, 8) 17, 1, 72, fractional - 5 - 4: (8, 1) 9, 7, 8, 8 - 5 * 4: (20, 8) 28, 12, 160, fractional - 5 / 4: fractional Trying another promising first operation: 2. 8 / 4: (8, 5, 2) - 8 + 5: (13, 2) 15, 11, 26, fractional - 8 - 5: (3, 2) 5, 1, 6, 1.5 - 8 * 5: (40, 2) 42, 38, 80, 20 - 8 / 5: fractional - 8 + 2: (10, 5) 15, 5, 50, 2 - 8 - 2: (6, 5) 11, 1, 30, fractional - 8 * 2: (16, 5) 21, 11, 80, fractional - 8 / 2: (5, 4) 9, 1, 20, fractional - 5 + 2: (7, 8) 15, 1, 56, fractional - 5 - 2: (8, 3) 11, 5, 24 = 8 * 3 -> found it! Backtracking the solution: Step 1: 8 / 4 = 2 Step 2: 5 - 2 = 3 Step 3: 8 * 3 = 24 Considering these steps: 24 = 8 * 3 = 8 * (5 - 2) = 8 * (5 - (8 / 4)) = 24. answer: 8 * (5 - (8 / 4)) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 11 1 1 Assistant: Trying a promising first operation: 1. 11 - 1: (11, 10, 1) - 11 + 10: (21, 1) 22, 20, 21, 21 - 11 - 10: (1, 1) 2, 0, 1, 1 - 11 * 10: (110, 1) 111, 109, 110, 110 - 11 / 10: fractional - 11 + 1: (12, 10) 22, 2, 120, 1.2 - 11 - 1: (10, 10) 20, 0, 100, 1 - 11 * 1: (11, 10) 21, 1, 110, 1.1 - 11 / 1: (11, 10) 21, 1, 110, fractional - 10 + 1: (11, 11) 22, 0, 121, 1 - 10 - 1: (11, 9) 20, 2, 99, 1.1 - 10 * 1: (11, 10) 21, 1, 110, 1.1 - 10 / 1: (11, 10) 21, 1, 110, 1.1 Trying another promising first operation: 2. 11 + 1: (12, 11, 1) - 12 + 11: (23, 1) 24 = 23 + 1 -> found it! Backtracking the solution: Step 1: 11 + 1 = 12 Step 2: 12 + 11 = 23 Step 3: 23 + 1 Considering these steps: 24 = 23 + 1 = (12 + 11) + 1 = ((11 + 1) + 11) + 1 = 24. answer: ((11 + 1) + 11) + 1 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 7 4 1 Assistant: Trying a promising first operation: 1. 7 * 4: (28, 11, 1) - 28 + 11: (39, 1) 40, 38, 39, 39 - 28 - 11: (17, 1) 18, 16, 17, 17 - 28 * 11: (308, 1) 309, 307, 308, 308 - 28 / 11: fractional - 28 + 1: (29, 11) 40, 18, 319, fractional - 28 - 1: (27, 11) 38, 16, 297, fractional - 28 * 1: (28, 11) 39, 17, 308, fractional - 28 / 1: (28, 11) 39, 17, 308, fractional - 11 + 1: (29, 28) 57, 1, 812, fractional - 11 - 1: (28, 10) 38, 18, 280, fractional - 11 * 1: (28, 11) 39, 17, 308, fractional - 11 / 1: (28, 11) 39, 17, 308, fractional Trying another promising first operation: 2. 7 + 1: (11 8 4) - 11 + 8: (19, 4) 23, 15, 76, fractional - 11 - 8: (4, 3) 7, 1, 12, fractional - 11 * 8: (88, 4) 92, 84, 352, fractional - 11 / 8: fractional - 11 + 4: (15, 8) 23, 7, 120, fractional - 11 - 4: (7, 8) 15, -1, 56, fractional - 11 * 4: (44, 8) 52, 36, 352, fractional - 11 / 4: fractional - 8 + 4: (12, 11) 23, -1, 132, fractional - 8 - 4: (11, 4) 15, 7, 44, fractional - 8 * 4: (32, 11) 43, 21, 352, fractional - 8 / 4: (11, 2) 13, 9, 22, fractional Trying another promising first operation: 3. 4 + 1: (11 7 5) - 11 + 7: (18, 5) 23, 13, 90, fractional - 11 - 7: (5, 4) 9, 1, 20, fractional - 11 * 7: (77, 5) 82, 72, 385, fractional - 11 / 7: fractional - 11 + 5: (16, 7) 23, 9, 112, fractional - 11 - 5: (7, 6) 13, 1, 42, fractional - 11 * 5: (55, 7) 62, 48, 385, fractional - 11 / 5: fractional - 7 + 5: (12, 11) 23, 1, 132, fractional - 7 - 5: (11, 2) 13, 9, 22, fractional - 7 * 5: (35, 11) 46, 24 = 35 - 11 -> found it! Step 1: 4 + 1 = 5 Step 2: 7 * 5 = 35 Step 3: 35 - 11 = 24 Considering these steps: Backtracking the solution: 24 = 35 - 11 = (7 * 5) - 11 = (7 * (4 + 1)) - 11 = 24. answer: (7 * (4 + 1)) - 11 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 5 4 3 Assistant: Trying a promising first operation: 1. 5 * 4: (20, 11, 3) - 20 + 11: (31, 3) 34, 28, 93, fractional - 20 - 11: (9, 3) 12, 6, 27, 3 - 20 * 11: (220, 3) 223, 217, 660, fractional - 20 / 11: fractional - 20 + 3: (23, 11) 34, 12, 253, fractional - 20 - 3: (17, 11) 28, 6, 187, fractional - 20 * 3: (60, 11) 71, 49, 660, fractional - 20 / 3: fractional - 11 + 3: (14, 20) 34, -6, 280, fractional - 11 - 3: (8, 20) 28, -12, 160, fractional - 11 * 3: (33, 20) 53, 13, 660, fractional - 11 / 3: fractional Trying another promising first operation: 2. 11 - 3: (8, 5, 4) - 8 + 5: (13, 4) 17, 9, 52, fractional - 8 - 5: (4, 3) 7, 1, 12, fractional - 8 * 5: (40, 4) 44, 36, 160, fractional - 8 / 5: fractional - 8 + 4: (12, 5) 17, 7, 60, fractional - 8 - 4: (5, 4) 9, 1, 20, fractional - 8 * 4: (32, 5) 37, 27, 160, fractional - 8 / 4: (5, 2) 7, 3, 10, 2.5 - 5 + 4: (9, 8) 17, 1, 72, fractional - 5 - 4: (8, 1) 9, 7, 8, 8 - 5 * 4: (20, 8) 28, 12, 160, fractional - 5 / 4: fractional Trying another promising first operation: 3. 11 * 3: (33, 5, 4) - 33 + 5: (38, 4) 42, 34, 152, fractional - 33 - 5: (28, 4) 32, 24 = 28 - 4 -> found it! Backtracking the solution: Step 1: 11 * 3 = 33 Step 2: 33 - 5 = 28 Step 3: 28 - 4 = 24 Considering these steps: 24 = 28 - 4 = (33 - 5) - 4 = ((11 * 3) - 5) - 4 = 24. answer: ((11 * 3) - 5) - 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 12 5 2 Assistant: Trying a promising first operation: 1. 12 + 2: (14, 13, 5) - 14 + 13: (27, 5) 32, 22, 135, fractional - 14 - 13: (5, 1) 6, 4, 5, 5 - 14 * 13: (182, 5) 187, 177, 910, fractional - 14 / 13: fractional - 14 + 5: (19, 13) 32, 6, 247, fractional - 14 - 5: (13, 9) 22, 4, 117, fractional - 14 * 5: (70, 13) 83, 57, 910, fractional - 14 / 5: fractional - 13 + 5: (18, 14) 32, 4, 252, fractional - 13 - 5: (14, 8) 22, 6, 112, fractional - 13 * 5: (65, 14) 79, 51, 910, fractional - 13 / 5: fractional Trying another promising first operation: 2. 13 + 5: (18, 12, 2) - 18 + 12: (30, 2) 32, 28, 60, fractional - 18 - 12: (6, 2) 8, 4, 12, 3 - 18 * 12: (216, 2) 218, 214, 432, fractional - 18 / 12: fractional - 18 + 2: (20, 12) 32, 8, 240, fractional - 18 - 2: (16, 12) 28, 4, 192, fractional - 18 * 2: (36, 12) 48, 24 = 36 - 12 -> found it! Backtracking the solution: Step 1: 13 + 5 = 18 Step 2: 18 * 2 = 36 Step 3: 36 - 12 = 24 Considering these steps: 24 = 36 - 12 = (18 * 2) - 12 = ((13 + 5) * 2) - 12 = 24. answer: ((13 + 5) * 2) - 12 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 8 2 1 Assistant: Trying a promising first operation: 1. 9 - 1: (8, 8, 2) - 8 + 8: (16, 2) 18, 14, 32, 8 - 8 - 8: (2, 0) 2, 2, 0, undefined - 8 * 8: (64, 2) 66, 62, 128, 32 - 8 / 8: (2, 1) 3, 1, 2, 2 - 8 + 2: (10, 8) 18, 2, 80, 4 - 8 - 2: (6, 8) 14, -2, 48, fractional - 8 * 2: (16, 8) 24 = 16 + 8 -> found it! Backtracking the solution: Step 1: 9 - 1 = 8 Step 2: 8 * 2 = 16 Step 3: 16 + 8 = 24 Considering these steps: 24 = 16 + 8 = (8 * 2) + 8 = ((9 - 1) * 2) + 8 = 24. answer: ((9 - 1) * 2) + 8 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.1.1 AoT (Long) ⬇ System: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. When considering the next steps, do not choose operations that will result in a negative or fractional number. In order to help with the calculations, the numbers in the parenthesis represent the numbers that are left after the operations and they are in descending order. Another thing we do is when there are only two numbers left in the parenthesis, we check whether we can arrive at 24 only by using basic arithmetic operations (+ - * /). Some examples regarding this idea: (21 2) no since 21 + 2 = 23, 21 - 2 = 19, 21 * 2 = 42, 21 / 2 = 10.5, none of which is equal to 24. (30 6) 30 - 6 = 24 yes (8 3) 8 * 3 = 24 yes (12 8) no (48 2) 48 / 2 = 24 yes Most importantly, do not give up, all the numbers that will be given has indeed a solution. User: 14 8 8 2 Assistant: Trying a promising first operation: 1. 8 + 2: (14, 10, 8) - 14 + 10: (24, 8) 32, 16, 192, 3 - 14 - 10: (8, 4) 12, 8, 32, 2 - 14 * 10: (140, 8) 148, 132, 1120, fractional - 14 / 10: (8, 1.4) fractional - 14 + 8: (22, 10) 32, 12, 220, fractional - 14 - 8: (10, 6) 16, 4, 60, fractional - 14 * 8: (112, 10) 122, 102, 1120, fractional - 14 / 8: fractional - 10 + 8: (18, 14) 32, 4, 252, fractional - 10 - 8: (14, 2) 16, 12, 28, 7 - 10 * 8: (80, 14) 94, 66, big, fractional - 10 / 8: fractional Trying another promising first operation: 2. 14 + 8: (22, 8, 2) - 22 + 8: (30, 2) 32, 28, 60, 15 - 22 - 8: (14, 2) 16, 12, 28, 7 - 22 * 8: (176, 2) 178, 174, 88 - 22 / 8: (2.75, 2) fractional - 22 + 2: (24, 8) 32, 16, 192, 3 - 22 - 2: (20, 8) 28, 12, 160, fractional - 22 * 2: (44, 8) 52, 36, 352, fractional - 22 / 2: (11, 8) 19, 3, 88, fractional - 8 + 2: (22, 10) 32, 12, 220, fractional - 8 - 2: (22, 6) 28, 16, 132, fractional - 8 * 2: (22, 16) 38, 6, 352, fractional - 8 / 2: (22, 4) 26, 18, 88, fractional Trying another promising first operation: 3. 14 + 2: (16, 8, 8) - 16 + 8: (24, 8) 32, 16, 192, 3 - 16 - 8: (8, 8) 16, 0, 64, 1 - 16 * 8: (128, 8) 136, 120, 1024, 16 - 16 / 8: (8, 2) 10, 6, 16, 4 - 8 + 8: (16, 16 32, 0, 256, 1 - 8 - 8: (16, 0) 16, 16, 0, undefined - 8 * 8: (64, 16) 80, 48, 1024, 4 - 8 / 8: (16, 1) 17, 15, 16, 16 Trying another promising first operation: 4. 8 - 2: (14, 8, 6) - 14 + 8: (22, 14) 36, 8, 308, fractional - 14 - 8: (6, 6) 12, 0, 36, 1 - 14 * 8: (112, 6) 118, 106, 672, fractional - 14 / 8: (6, 1.75) fractional - 14 + 6: (20, 8) 22, 12, 160, fractional - 14 - 6: (8, 8) 16, 0, 64, 1 - 14 * 6: (84, 8) 92, 76, 672, fractional - 14 / 6: (8, 2.3) fractional - 8 + 6: (14, 14) 28, 0, 196, 1 - 8 - 6: (14, 2) 16, 12, 28, 7 - 8 * 6: (48, 14) 62, 34, 672, fractional - 8 / 6: (14, 1.3) fractional Trying another promising first operation: 5. 8 * 2: (16, 14, 8) - 16 + 14: (30, 8) 38, 22, 240, fractional - 16 - 14: (8, 2) 10, 6, 16, 4 - 16 * 14: (224, 8) 232, 216, 1792, 28 - 16 / 14: (8, 1.1) fractional - 16 + 8: (24, 14) 38, 10, 336, fractional - 16 - 8: (14, 8) 22, 6, 112, fractional - 16 * 8: (128, 14) 142, 112, 1792, fractional - 16 / 8: (14, 2) 16, 12, 28, 7 - 14 + 8: (22, 16) 38, 6, 352, fractional - 14 - 8: (16, 6) 22, 10, 96, fractional - 14 * 8: (112, 16) 128, 96, 1792, 7 - 14 / 8: (16, 1.7) fractional Trying another promising first operation: 6. 14 * 2: (28, 8, 8) - 28 + 8: (36, 8) 44, 28, 288, fractional - 28 - 8: (20, 8) 28, 12, 160, fractional - 28 * 8: (224, 8) 232, 216, 1792, 28 - 28 / 8: (8, 3.5) fractional, fractional, 28, fractional - 8 + 8: (16, 16 32, 0, 256, 1 - 8 - 8: (16, 0) 16, 16, 0, undefined - 8 * 8: (64, 16) 80, 48, 1024, 4 - 8 / 8: (16, 1) 17, 15, 16, 16 Trying another promising first operation: 7. 8 / 2: (14, 8, 4) - 14 + 8: (22, 4) 26, 18, 88, fractional - 14 - 8: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 / 2 = 4 Step 2: 14 - 8 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (14 - 8) * 4 = (14 - 8) * (8 / 2) = 24. answer: (14 - 8) * (8 / 2) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 5 5 5 Assistant: Trying a promising first operation: 1. 9 - 5: (5, 5, 4) - 5 + 5: (10, 4) 14, 6, 40, fractional - 5 - 5: (4, 0) 4, 4, 0, undefined - 5 * 5: (25, 4) 29, 21, 100, fractional - 5 / 5: (4, 1) 5, 3, 4, 4 - 5 + 4: (9, 5) 14, 4, 45, fractional - 5 - 4: (5, 1) 6, 4, 5, 0.2 - 5 * 4: (20, 5) 25, 15, 100, fractional - 5 / 4: fractional Trying another promising first operation: 2. 5 * 5: (25, 9, 5) - 25 + 9: (34, 5) 39, 29, 170, fractional - 25 - 9: (16, 5) 21, 11, 80, fractional - 25 * 9: (225, 5) 230, 220, 1125, 45 - 25 / 9: (5, 2.7) fractional - 25 + 5: (30, 9) 39, 21, 270, fractional - 25 - 5: (20, 9) 29, 11, 180, fractional - 25 * 5: (75, 9) 84, 66, 675, fractional - 25 / 5: (9, 5) 14, 4, 45, fractional - 9 + 5: (25, 14) 39, 11, 350, fractional - 9 - 5: (25, 4) 29, 21, 100, fractional - 9 * 5: (45, 25) 70, 20, 1125, fractional - 9 / 5: (25, 1.8) fractional, fractional, 45, fractional Trying another promising first operation: 3. 5 - 5: (9, 5, 0) - 9 + 5: (25, 14) 39, 11, 350, fractional - 9 - 5: (25, 4) 29, 21, 100, fractional - 9 * 5: (45, 25) 70, 20, 1125, fractional - 9 / 5: (25, 1.8) fractional, fractional, 45, fractional - 9 + 0: (9, 5) 14, 4, 45, fractional - 9 - 0: (9, 5) 14, 4, 45, fractional - 9 * 0: (5, 0) 5, 5, 0, undefined - 9 / 0: undefined - 5 + 0: (9, 5) 14, 4, 45, fractional - 5 - 0: (9, 5) 14, 4, 45, fractional - 5 * 0: (9, 0) 9, 9, 0, undefined - 5 / 0: undefined Trying another promising first operation: 4. 5 / 5: (9, 5, 1) - 9 + 5: (25, 14) 39, 11, 350, fractional - 9 - 5: (25, 4) 29, 21, 100, fractional - 9 * 5: (45, 25) 70, 20, 1125, fractional - 9 / 5: (25, 1.8) fractional, fractional, 45, fractional - 9 + 1: (10, 5) 15, 5, 50, 2 - 9 - 1: (8, 5) 13, 3, 40, fractional - 9 * 1: (9, 5) 14, 4, 45, fractional - 9 / 1: (9, 5) 14, 4, 45, fractional - 5 + 1: (9, 6) 15, 3, 54, fractional - 5 - 1: (9, 4) 13, 5, 36, fractional - 5 * 1: (9, 5) 14, 4, 45, fractional - 5 / 1: (9, 5) 14, 4, 45, fractional Trying another promising first operation: 5. 9 * 5: (45, 5, 5) - 45 + 5: (50, 5) 55, 45, 250, 10 - 45 - 5: (40, 5) 45, 35, 200, 8 - 45 * 5: (225, 5) 230, 220, 1125, 45 - 45 / 5: (9, 5) 14, 4, 45, fractional - 5 + 5: (10, 4) 14, 6, 40, fractional - 5 - 5: (4, 0) 4, 4, 0, undefined - 5 * 5: (25, 4) 29, 21, 100, fractional - 5 / 5: (4, 1) 5, 3, 4, 4 Trying another promising first operation: 6. 5 + 5: (10, 9, 5) - 10 + 9: (19, 5) 24 = 19 + 5 -> found it! Backtracking the solution: Step 1: 5 + 5 = 10 Step 2: 10 + 9 = 19 Step 3: 19 + 5 = 24 Considering these steps: 24 = 19 + 5 = (10 + 9) + 5 = ((5 + 5) + 9) + 5 = 24. answer: ((5 + 5) + 9) + 5 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 6 4 4 Assistant: Trying a promising first operation: 1. 8 * 6: (48, 4, 4) - 48 + 4: (52, 4) 56, 48, 208, 13 - 48 - 4: (44, 4) 48, 40, 176, 11 - 48 * 4: (192, 4) 196, 188, 768, 48 - 48 / 4: (12, 4) 16, 8, 48, 3 - 4 + 4: (48, 8) 56, 40, 384, 6 - 4 - 4: (48, 0) 48, 48, 0, undefined - 4 * 4: (48, 16) 64, 32, 768, 3 - 4 / 4: (48, 1) 49, 47, 48, 48 Trying another promising first operation: 2. 4 - 4: (8, 6, 0) - 8 + 6: (14, 0) 14, 14, 0, undefined - 8 - 6: (2, 0) 2, 2, 0, undefined - 8 * 6: (48, 0) 48, 48, 0, undefined - 8 / 6: (1.3, 0) fractional - 8 + 0: (8, 6) 14, 2, 48, fractional - 8 - 0: (8, 6) 14, 2, 48, fractional - 8 * 0: (6, 0) 6, 6, 0, undefined - 8 / 0: undefined - 6 + 0: (8, 6) 14, 2, 48, fractional - 6 - 0: (8, 6) 14, 2, 48, fractional - 6 * 0: (8, 0) 8, 8, 0, undefined - 6 / 0: undefined Trying another promising first operation: 3. 4 / 4: (8, 6, 1) - 8 + 6: (14, 1) 15, 13, 14, 14 - 8 - 6: (2, 1) 3, 1, 2, 2 - 8 * 6: (48, 1) 49, 47, 48, 48 - 8 / 6: (1.3, 1) fractional - 8 + 1: (9, 6) 15, 3, 54, fractional - 8 - 1: (7, 6) 13, 1, 42, fractional - 8 * 1: (8, 6) 14, 2, 48, fractional - 8 / 1: (8, 6) 14, 2, 48, fractional - 6 + 1: (8, 7) 15, 1, 56, fractional - 6 - 1: (8, 5) 13, 3, 40, fractional - 6 * 1: (8, 6) 14, 2, 48, fractional - 6 / 1: (8, 1) 9, 7, 8, 8 Trying another promising first operation: 4. 8 - 6: (4, 4, 2) - 4 + 4: (8, 2) 10, 6, 16, 4 - 4 - 4: (2, 0) 2, 2, 0, undefined - 4 * 4: (16, 2) 18, 14, 32, 8 - 4 / 4: (2, 1) 3, 1, 2, 2 - 4 + 2: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 - 6 = 2 Step 2: 4 + 2 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (4 + 2) * 4 = (4 + (8 - 6)) * 4 = 24. answer: (4 + (8 - 6)) * 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 10 9 4 Assistant: Trying a promising first operation: 1. 13 - 4: (10, 9, 9) - 10 + 9: (19, 9) 28, 10, 171, fractional - 10 - 9: (9, 1) 10, 8, 9, 9 - 10 * 9: (90, 9) 99, 81, 810, fractional - 10 / 9: fractional - 9 + 9: (18, 10) 28, 8, 180, fractional - 9 - 9: (10, 0) 10, 10, 0, undefined - 9 * 9: (81, 10) 91, 71, 810, fractional - 9 / 9: (10, 1) 11, 9, 10, 10 Trying another promising first operation: 2. 13 / 10: (9, 4, 1.3) - 9 + 4: (13, 1.3) fractional, fractional, fractional, 10 - 9 - 4: (5, 1.3) fractional - 9 * 4: (36, 1.3) fractional - 9 / 4: (2.3, 1.3) fractional, 1, fractional, fractional - 9 + 1.3: (10.3, 4) fractional - 9 - 1.3: (7.7, 4) fractional - 9 * 1.3: (11.7, 4) fractional - 9 / 1.3: (6.9, 4) fractional - 4 + 1.3: (9, 5.3) fractional - 4 - 1.3: (9, 2.7) fractional - 4 * 1.3: (9, 5.2) fractional - 4 / 1.3: (9, 3.1) fractional Trying another promising first operation: 3. 9 / 4: (13, 10, 2.3) - 13 + 10: (23, 2.3) fractional, fractional, fractional, 10 - 13 - 10: (3, 2.3) fractional - 13 * 10: (130, 2.3) fractional - 13 / 10: (2.3, 1.3) fractional, 1, fractional, fractional - 13 + 2.3: (15.3, 10) fractional, fractional, 153, fractional - 13 - 2.3: (11.7, 10) fractional, fractional, 117, fractional - 13 * 2.3: (29.9, 10) fractional, fractional, 299, fractional - 13 / 2.3: (10, 5.6) fractional, fractional, 560, fractional - 10 + 2.3: (13, 12.3) fractional - 10 - 2.3: (13, 7.7) fractional - 10 * 2.3: (23, 13) 36, 10, 299, fractional - 10 / 2.3: (13, 4.3) fractional Trying another promising first operation: 4. 13 / 4: (10, 9, 3.3) - 10 + 9: (19, 3.3) fractional - 10 - 9: (3.3, 1) fractional - 10 * 9: (90, 3.3) fractional - 10 / 9: (3.3, 1.1) fractional, fractional, fractional, 3 - 10 + 3.3: (13.3, 9) fractional - 10 - 3.3: (9, 6.7) fractional - 10 * 3.3: (33, 9) 42, 24, 297, fractional - 10 / 3.3: (3.1, 9) fractional - 9 + 3.3: (12.3, 10) fractional, fractional, 123, fractional - 9 - 3.3: (10, 5.7) fractional, fractional, 57, fractional - 9 * 3.3: (29.7, 10) fractional, fractional, 297, fractional - 9 / 3.3: (10, 2.7) fractional, fractional, 27, fractional Trying another promising first operation: 5. 13 / 9: (10, 9, 1.4) - 10 + 9: (19, 1.4) fractional - 10 - 9: (1.4, 1) fractional - 10 * 9: (90, 1.4) fractional, fractional, 130, fractional - 10 / 9: (1.1, 1.4) fractional - 10 + 1.4: (11.4, 9) fractional - 10 - 1.4: (9, 8.6) fractional - 10 * 1.4: (140, 9) 149, 131, 1260, fractional - 10 / 1.4: (9, 7.1) fractional - 9 + 1.4: (10.4, 10) fractional, fractional, 104, fractional - 9 - 1.4: (10, 7.6) fractional, fractional, 76, fractional - 9 * 1.4: (12.6, 10) fractional, fractional, 126, fractional - 9 / 1.4: (10, 6.4) fractional, fractional, 64, fractional Trying another promising first operation: 6. 10 - 4: (13, 9, 6) - 13 + 9: (22, 6) 28, 16, 132, fractional - 13 - 9: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 10 - 4 = 6 Step 2: 13 - 9 = 4 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = 6 * (13 - 9) = (10 - 4) * (13 - 9) = 24. answer: (10 - 4) * (13 - 9) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.1.2 AoT (Random) ⬇ System: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. Each step, you are only allowed to choose two of the remaining numbers to obtain a new number. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 4 4 6 8. Assistant: Trying a promising first operation: (4 + 4) * 6 - 8 = 40. Trying another promising first operation: (8 - 6) * 4 + 4 = 12. Trying another promising first operation: (6 - 4) * (4 + 8) = 24. answer: (6 - 4) * (4 + 8) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 2 9 10 12. Assistant: Trying a promising first operation: ((12 - 10) / 2) + 9 = 10. Trying another promising first operation: (10 + 9) + (12 / 2) = 25. Trying another promising first operation: (12 * 2) * (10 - 9) = 24. answer: (12 * 2) * (10 - 9) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 4 9 10 13. Assistant: Trying a promising first operation: ((13 - 10) / 4) + 9 = 10. Trying another promising first operation: 4 * (9 - (13 - 10)) = 24. answer: 4 * (9 - (13 - 10)) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 1 4 8 8. Assistant: Trying a promising first operation: (8 + 1) + 4 * 8 + 32. Trying another promising first operation: (1 + 8 / 4) * 8 = 24. answer: (1 + 8 / 4) * 8 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 5 5 5 9. Assistant: Trying a promising first operation: ((5 + 5) + 5) + 9 = 24. answer: ((5 + 5) + 5) + 9 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.1.3 AoT (BFS) ⬇ System: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. When considering the next steps, do not choose operations that will result in a negative or fractional number. In order to help with the calculations, the numbers in the parenthesis represent the numbers that are left after the operations and they are in descending order. Another thing we do is when there are only two numbers left in the parenthesis, we check whether we can arrive at 24 only by using basic arithmetic operations (+ - * /). Some examples regarding this idea: (21, 2): 23, 19, 42, fractional (30, 6): 36, 24 = 30 - 6 -> found it! (8, 3): 11, 5, 24 = 8 * 3 yes (12, 8): 20, 4, 72, fractional In the second steps, when there are three numbers left, we choose the most promising operations so that when we are left with two numbers, we will be able to get to 24. So, when we reach 24 in the third steps, that means we found the solution, we directly backtrack to write the answer. In each step, we want to choose the best operations so that the numbers left, will have a good chance to reach 24. User: 14 8 8 2 Assistant: Let ’ s consider the most promising possible first steps: 1. 8 + 8: (16, 14, 2) 2. 14 - 8: (8, 6, 2) 3. 14 + 2: (16, 8, 8) 4. 8 / 2: (14, 8, 4) Let ’ s consider the most promising second steps: 1. (16, 14, 2) - 16 - 14: (2, 2) - 16 / 2: (8, 14) - 14 * 2: (28, 16) - 14 / 2: (7, 16) 2. (8, 6, 2) - 8 - 6: (2, 2) - 8 / 2: (4, 6) - 6 * 2: (12, 8) - 6 / 2: (3, 8) 3. (16, 8, 8) - 16 - 8: (8, 8) - 16 / 8: (2, 8) - 8 * 8: (64, 16) - 8 / 8: (1, 16) 4. (14, 8, 4) - 14 - 8: (6, 4) - 14 / 4: (3.5, 8) - 8 * 4: (32, 14) - 8 / 4: (2, 14) Let ’ s consider the most promising third steps: 1. 8 + 8 = 16 - 16 - 14: (2, 2): 4, 0, 4, 1 - 16 / 2: (8, 14): 22, 6, 112, fractional - 14 * 2: (28, 16): 44, 12, 448, fractional - 14 / 2: (7, 16): 23, 9, 112, fractional 2. 14 - 8 = 6 - 8 - 6: (2, 2): 4, 0, 4, 1 - 8 / 2: (6, 4): 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 14 - 8 = 6 Step 2: 8 / 2 = 4 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (14 - 8) * (8 / 2) = 24. answer: (14 - 8) * (8 / 2) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 5 5 5 Assistant: Let ’ s consider the most promising first steps: 1. 9 - 5: (5, 5, 4) 2. 5 + 5: (9, 10, 5) 3. 9 + 5: (14, 5, 5) 4. 5 - 5: (9, 5, 0) Let ’ s consider the most promising second steps: 1. (5, 5, 4) - 5 + 5: (10, 4) - 5 * 4: (20, 5) - 5 - 4: (5, 1) - 5 / 4: (5, 1.25) 2. (9, 10, 5) - 10 + 9: (19, 5) - 10 - 5: (9, 5) - 10 * 5: (50, 9) - 10 / 5: (9, 2) 3. (14, 5, 5) - 14 - 5: (9, 5) - 14 + 5: (19, 5) - 5 * 5: (25, 14) - 5 / 5: (14, 1) 4. (9, 5, 0) - 9 + 5: (14, 0) - 9 - 5: (4, 0) - 5 * 0: (9, 0) - 5 / 0: (9, infinity) Let ’ s consider the most promising third steps: 1. 9 - 5 = 4 - 5 + 5: (10, 4): 14, 6, 40, fractional - 5 * 4: (20, 5): 25, 15, 100, 4 - 5 - 4: (5, 1): 6, 4, 5, 5 - 5 / 4: (5, 1.25): 6.25, 3.75, 6.25, 4 2. 5 + 5 = 10 - 10 + 9: (19, 5): 24 = 19 + 5 -> found it! Backtracking the solution: Step 1: 5 + 5 = 10 Step 2: 10 + 9 = 19 Step 3: 19 + 5 = 24 Considering these steps: 24 = 19 + 5 = (10 + 9) + 5 = ((5 + 5) + 9) + 5 = 24. answer: ((5 + 5) + 9) + 5 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 6 4 4 Assistant: Let ’ s consider the most promising first steps: 1. 8 * 6: (48, 4, 4) 2. 8 + 4: (12, 6, 4) 3. 8 - 6: (4, 4, 2) 4. 6 - 4: (8, 4, 2) Let ’ s consider the most promising second steps: 1. (48, 4, 4) - 4 * 4: (48, 16) - 48 / 4: (12, 4) - 4 + 4: (48, 8) - 48 - 4: (44, 4) 2. (12, 6, 4) - 12 + 6: (18, 4) - 6 * 4: (24, 12) - 6 + 4: (12, 10) - 12 / 4: (6, 3) 3. (4, 4, 2) - 4 * 4: (16, 2) - 4 + 2: (6, 4) - 4 + 4: (8, 2) - 4 * 2: (8, 4) 4. (8, 4, 2) - 8 * 4: (32, 2) - 4 * 2: (8, 8) - 8 + 4: (12, 2) - 8 / 4: (4, 2) Let ’ s consider the most promising third steps: 1. 8 * 6 = 48 - 4 * 4: (48, 16): 64, 32, big, 3 - 48 / 4: (12, 4): 16, 8, 48, 3 - 4 + 4: (48, 8): 56, 40, big, 6 - 48 - 4: (44, 4): 48, 40, big, 11 2. 8 + 4 = 12 - 12 + 6: (18, 4): 22, 14, 72, fractional - 6 * 4: (24, 12): 36, 12, 288, 2 - 6 + 4: (12, 10): 22, 2, 120, fractional - 12 / 4: (6, 3): 9, 3, 18, 2 3. 8 - 6 = 2 - 4 * 4: (16, 2): 19, 14, 32, 8 - 4 + 2: (6, 4): 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 - 6 = 2 Step 2: 4 + 2 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (4 + 2) * 4 = (4 + (8 - 6)) * 4 = 24. answer: (4 + (8 - 6)) * 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 10 9 4 Assistant: Let ’ s consider the most promising first steps: 1. 13 - 4: (10, 9, 9) 2. 10 - 4: (13, 9, 6) 3. 13 + 9: (22, 10, 4) 4. 10 - 9: (13, 4, 1) Let ’ s consider the most promising second steps: 1. (10, 9, 9) - 10 + 9: (19, 9) - 10 - 9: (9, 1) - 9 + 9: (18, 10) - 9 / 9: (9, 1) 2. (13, 9, 6) - 9 + 6: (15, 13) - 9 * 6: (54, 13) - 13 - 9: (6, 4) - 13 - 6: (9, 7) 3. (22, 10, 4) - 22 - 10: (12, 4) - 22 - 4: (18, 10) - 10 * 4: (40, 22) - 10 / 4: (22, 5.5) 4. (13, 4, 1) - 13 - 4: (9, 1) - 13 * 4: (52, 1) - 4 - 1: (13, 3) - 13 - 1: (12, 4) Let ’ s consider the most promising third steps: 1. 13 - 4 = 9 - 10 + 9: (19, 9): 28, 10, 171, fractional - 10 - 9: (9, 1): 10, 8, 9, 9 - 9 + 9: (18, 10): 28, 8, 180, fractional - 9 / 9: (9, 1): 10, 8, 9, 9 2. 10 - 4 = 6 - 9 + 6: (15, 13): 28, 2, 195, fractional - 9 * 6: (54, 13): 67, 41, 702, fractional - 13 - 9: (6, 4): 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 13 - 9 = 4 Step 2: 10 - 4 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (10 - 4) * (13 - 9) = 24. answer: (10 - 4) * (13 - 9) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 8 5 4 Assistant: Let ’ s consider the most promising first steps: 1. 5 + 4: (9, 8, 8) 2. 8 / 4: (8, 5, 2) 3. 8 - 5: (8, 4, 3) 4. 8 / 8: (5, 4, 1) Let ’ s consider the most promising second steps: 1. (9, 8, 8) - 9 + 8: (17, 8) - 8 + 8: (16, 9) - 9 * 8: (72, 8) - 8 / 8: (9, 1) 2. (8, 5, 2) - 5 - 2: (8, 3) - 8 - 2: (6, 5) - 8 + 5: (13, 2) - 5 * 2: (10, 8) 3. (8, 4, 3) - 8 - 4: (4, 3) - 8 - 3: (5, 4) - 4 * 3: (12, 8) - 4 + 3: (8, 7) 4. (5, 4, 1) - 5 + 4: (9, 1) - 5 - 4: (1, 1) - 4 - 1: (5, 3) - 5 + 1: (6, 4) Let ’ s consider the most promising third steps: 1. 5 + 4 = 9 - 9 + 8: (17, 8): 25, 9, 136, fractional - 8 + 8: (16, 9): 25, 7, 144, fractional - 9 * 8: (72, 8): 80, 64, 576, 9 - 8 / 8: (9, 1): 10, 8, 9, 9 2. 8 / 4 = 2 - 5 - 2: (8, 3): 11, 5, 24 = 8 * 3 -> found it! Backtracking the solution: Step 1: 8 / 4 = 2 Step 2: 5 - 2 = 3 Step 3: 8 * 3 = 24 Considering these steps: 24 = 8 * 3 = 8 * (5 - 2) = 8 * (5 - (8 / 4)) = 24. answer: 8 * (5 - (8 / 4)) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 11 1 1 Assistant: Let ’ s consider the most promising first steps: 1. 11 - 1: (11, 10, 1) 2. 11 + 1: (12, 11, 1) 3. 11 + 11: (22, 1, 1) 4. 1 + 1: (11, 11, 2) Let ’ s consider the most promising second steps: 1. (11, 10, 1) - 11 + 10: (21, 1) - 10 + 1: (11, 11) - 10 / 1: (11, 10) - 11 / 1: (11, 10) 2. (12, 11, 1) - 12 + 11: (23, 1) - 11 + 1: (12, 11) - 11 - 1: (12, 10) - 12 + 1: (13, 11) 3. (22, 1, 1) - 22 + 1: (23, 1) - 1 + 1: (22, 2) - 22 / 1: (22, 1) - 1 / 1: (22, 1) 4. (11, 11, 2) - 11 + 11: (22, 2) - 11 * 2: (22, 11) - 11 + 2: (13, 11) - 2 * 11: (22, 11) Let ’ s consider the most promising third steps: 1. 11 - 1 = 10 - 11 + 10: (21, 1): 22, 20, 21, 20 - 10 + 1: (11, 11): 22, 10, 121, 0 - 10 / 1: (11, 10): 21, 1, 110, 0 - 11 / 1: (11, 10): 21, 1, 110, 0 2. 11 + 1 = 12 - 12 + 11: (23, 1): 24 = 23 + 1 -> found it! Backtracking the solution: Step 1: 11 + 1 = 12 Step 2: 12 + 11 = 23 Step 3: 23 + 1 = 24 Considering these steps: 24 = 23 + 1 = (12 + 11) + 1 = ((11 + 1) + 11) + 1 = 24. answer: ((11 + 1) + 11) + 1 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 7 4 1 Assistant: Let ’ s consider the most promising first steps: 1. 7 * 4: (28, 11, 1) 2. 7 + 1: (11, 8, 4) 3. 4 + 1: (11, 7, 5) 4. 11 - 4: (7, 3, 1) Let ’ s consider the most promising second steps: 1. (28, 11, 1) - 28 - 11: (17, 1) - 28 - 1: (27, 11) - 11 + 1: (29, 28) - 11 - 1: (28, 10) 2. (11, 8, 4) - 11 + 8: (19, 4) - 8 + 4: (12, 11) - 11 - 8: (4, 3) - 8 - 4: (7, 11) 3. (11, 7, 5) - 11 - 5: (7, 6) - 7 - 5: (11, 2) - 7 * 5: (35, 11) - 11 + 5: (16, 7) 4. (7, 3, 1) - 7 - 3: (4, 1) - 7 * 3: (21, 1) - 3 + 1: (7, 4) - 7 - 1: (6, 3) Let ’ s consider the most promising third steps: 1. 7 * 4 = 28 - 28 - 11: (17, 1): 18, 16, 17, 17 - 28 - 1: (27, 11): 38, 16, 297, 2.45 - 11 + 1: (29, 28): 57, 1, 812, 1.03 - 11 - 1: (28, 10): 38, 18, 280, 2.8 2. 7 + 1 = 8 - 11 + 8: (19, 4): 23, 15, 76, 4.75 - 8 + 4: (12, 11): 23, 7, 132, 3 - 11 - 8: (4, 3): 7, 1, 12, 1.33 - 8 - 4: (7, 11): 18, 4, 77, 1.75 3. 4 + 1 = 5 - 11 - 5: (7, 6): 13, 1, 42, 1.17 - 7 - 5: (11, 2): 13, 9, 22, 5.5 - 7 * 5: (35, 11): 46, 24 = 35 - 11 -> found it! Backtracking the solution: Step 1: 4 + 1 = 5 Step 2: 7 * 5 = 35 Step 3: 35 - 11 = 24 Considering these steps: 24 = 35 - 11 = (7 * 5) - 11 = (7 * (4 + 1)) - 11 = 24. answer: (7 * (4 + 1)) - 11 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 5 4 3 Assistant: Let ’ s consider the most promising first steps to reach 24: 1. 11 * 3: (33, 5, 4) 2. 5 + 4: (11, 9, 3) 3. 11 - 4: (7, 5, 3) 4. 4 + 3: (11, 7, 5) Let ’ s consider the most promising second steps to reach 24: 1. (33, 5, 4) - 33 - 5: (28, 4) - 33 - 4: (29, 5) - 5 + 4: (33, 9) - 5 * 4: (20, 33) 2. (11, 9, 3) - 11 * 3: (33, 9) - 11 + 3: (14, 9) - 9 + 3: (12, 11) - 9 / 3: (11, 3) 3. (7, 5, 3) - 7 - 3: (4, 5) - 7 + 5: (12, 3) - 5 + 3: (8, 7) - 5 * 3: (7, 6) 4. (11, 7, 5) - 11 + 7: (18, 5) - 11 - 5: (6, 7) - 7 + 5: (12, 11) - 7 - 5: (2, 11) Let ’ s consider the most promising third steps to reach 24: 1. 11 * 3 = 33 - 33 - 5: (28, 4): 32, 24 = 28 - 4 -> found it! Backtracking the solution: Step 1: 11 * 3 = 33 Step 2: 33 - 5 = 28 Step 3: 28 - 4 = 24 Considering these steps: 24 = 28 - 4 = (33 - 5) - 4 = ((11 * 3) - 5) - 4 = 24. answer: ((11 * 3) - 5) - 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 12 5 2 Assistant: Let ’ s consider the most promising first steps to reach 24: 1. 12 / 2: (13, 6, 5) 2. 13 + 5: (18, 12, 2) 3. 12 - 5: (13, 7, 2) 4. 5 - 2: (13, 12, 3) Let ’ s consider the most promising second steps to reach 24: 1. (13, 6, 5) - 13 + 6: (19, 5) - 13 - 5: (8, 6) - 6 * 5: (30, 13) - 6 - 5: (13, 1) 2. (18, 12, 2) - 18 * 2: (36, 12) - 12 / 2: (18, 6) - 18 - 2: (16, 12) - 12 * 2: (24, 18) 3. (13, 7, 2) - 13 + 7: (20, 2) - 13 - 2: (11, 7) - 7 * 2: (14, 13) - 7 - 2: (13, 5) 4. (13, 12, 3) - 13 * 3: (39, 12) - 13 - 3: (10, 12) - 12 * 3: (36, 13) - 12 / 3: (13, 4) Let ’ s consider the most promising third steps to reach 24: 1. 12 / 2 = 6 - 13 + 6: (19, 5): 24 = 19 + 5 -> found it! Backtracking the solution: Step 1: 12 / 2 = 6 Step 2: 13 + 6 = 19 Step 3: 19 + 5 = 24 Considering these steps: 24 = 19 + 5 = (13 + 6) + 5 = (13 + (12 / 2)) + 5 = 24. answer: (13 + (12 / 2)) + 5 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 8 2 1 Assistant: Let ’ s consider the most promising first steps to reach 24: 1. 9 - 1: (8, 8, 2) 2. 8 + 2: (10, 9, 1) 3. 9 + 1: (10, 8, 2) 4. 2 + 1: (9, 8, 3) Let ’ s consider the most promising second steps to reach 24: 1. (8, 8, 2) - 8 + 8: (16, 2) - 8 * 2: (16, 8) - 8 - 2: (6, 8) - 8 + 8: (16, 2) 2. (10, 9, 1) - 10 + 9: (19, 1) - 10 - 1: (9, 9) - 9 + 1: (10, 10) - 9 / 1: (8, 10) 3. (10, 8, 2) - 10 + 8: (18, 2) - 10 * 2: (8, 8) - 8 + 2: (10, 10) - 8 / 2: (6, 10) 4. (9, 8, 3) - 9 + 8: (17, 3) - 9 / 3: (6, 8) - 8 + 3: (11, 9) - 8 * 3: (24, 9) Let ’ s consider the most promising third steps to reach 24: 1. 9 - 1 = 8 - 8 + 8: (16, 2): 18, 14, 32, 8 - 8 * 2: (16, 8): 24 = 16 + 8 -> found it! Backtracking the solution: Step 1: 9 - 1 = 8 Step 2: 8 * 2 = 16 Step 3: 16 + 8 = 24 Considering these steps: 24 = 16 + 8 = (8 * 2) + 8 = ((9 - 1) * 2) + 8 = 24. answer: ((9 - 1) * 2) + 8 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.2 AoT (Short) ⬇ System: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. When considering the next steps, do not choose operations that will result in a negative or fractional number. In order to help with the calculations, the numbers in the parenthesis represent the numbers that are left after the operations and they are in descending order. Another thing we do is when there are only two numbers left in the parenthesis, we check whether we can arrive at 24 only by using basic arithmetic operations (+ - * /). Some examples regarding this idea: (21 2) no since 21 + 2 = 23, 21 - 2 = 19, 21 * 2 = 42, 21 / 2 = 10.5, none of which is equal to 24. (30 6) 30 - 6 = 24 yes (8 3) 8 * 3 = 24 yes (12 8) no (48 2) 48 / 2 = 24 yes Most importantly, do not give up, all the numbers that will be given has indeed a solution. User: 14 8 8 2 Assistant: Trying a promising first operation: 1. 8 / 2: (14, 8, 4) - 14 + 8: (22, 4) 26, 18, 88, fractional - 14 - 8: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 / 2 = 4 Step 2: 14 - 8 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (14 - 8) * 4 = (14 - 8) * (8 / 2) = 24. answer: (14 - 8) * (8 / 2) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 5 5 5 Assistant: Trying a promising first operation: 1. 5 + 5: (10, 9, 5) - 10 + 9: (19, 5) 24 = 19 + 5 -> found it! Backtracking the solution: Step 1: 5 + 5 = 10 Step 2: 10 + 9 = 19 Step 3: 19 + 5 = 24 Considering these steps: 24 = 19 + 5 = (10 + 9) + 5 = ((5 + 5) + 9) + 5 = 24. answer: ((5 + 5) + 9) + 5 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 6 4 4 Assistant: Trying a promising first operation: 1. 8 - 6: (4, 4, 2) - 4 + 4: (8, 2) 10, 6, 16, 4 - 4 - 4: (2, 0) 2, 2, 0, undefined - 4 * 4: (16, 2) 18, 14, 32, 8 - 4 / 4: (2, 1) 3, 1, 2, 2 - 4 + 2: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 8 - 6 = 2 Step 2: 4 + 2 = 6 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = (4 + 2) * 4 = (4 + (8 - 6)) * 4 = 24. answer: (4 + (8 - 6)) * 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 10 9 4 Assistant: Trying a promising first operation: 1. 10 - 4: (13, 9, 6) - 13 + 9: (22, 6) 28, 16, 132, fractional - 13 - 9: (6, 4) 10, 2, 24 = 6 * 4 -> found it! Backtracking the solution: Step 1: 10 - 4 = 6 Step 2: 13 - 9 = 4 Step 3: 6 * 4 = 24 Considering these steps: 24 = 6 * 4 = 6 * (13 - 9) = (10 - 4) * (13 - 9) = 24. answer: (10 - 4) * (13 - 9) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 8 8 5 4 Assistant: Trying a promising first operation: 1. 8 / 4: (8, 5, 2) - 8 + 5: (13, 2) 15, 11, 26, fractional - 8 - 5: (3, 2) 5, 1, 6, 1.5 - 8 * 5: (40, 2) 42, 38, 80, 20 - 8 / 5: fractional - 8 + 2: (10, 5) 15, 5, 50, 2 - 8 - 2: (6, 5) 11, 1, 30, fractional - 8 * 2: (16, 5) 21, 11, 80, fractional - 8 / 2: (5, 4) 9, 1, 20, fractional - 5 + 2: (7, 8) 15, 1, 56, fractional - 5 - 2: (8, 3) 11, 5, 24 = 8 * 3 -> found it! Backtracking the solution: Step 1: 8 / 4 = 2 Step 2: 5 - 2 = 3 Step 3: 8 * 3 = 24 Considering these steps: 24 = 8 * 3 = 8 * (5 - 2) = 8 * (5 - (8 / 4)) = 24. answer: 8 * (5 - (8 / 4)) = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 11 1 1 Assistant: Trying a promising first operation: 1. 11 + 1: (12, 11, 1) - 12 + 11: (23, 1) 24 = 23 + 1 -> found it! Backtracking the solution: Step 1: 11 + 1 = 12 Step 2: 12 + 11 = 23 Step 3: 23 + 1 Considering these steps: 24 = 23 + 1 = (12 + 11) + 1 = ((11 + 1) + 11) + 1 = 24. answer: ((11 + 1) + 11) + 1 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 7 4 1 Assistant: Trying a promising first operation: 1. 4 + 1: (11 7 5) - 11 + 7: (18, 5) 23, 13, 90, fractional - 11 - 7: (5, 4) 9, 1, 20, fractional - 11 * 7: (77, 5) 82, 72, 385, fractional - 11 / 7: fractional - 11 + 5: (16, 7) 23, 9, 112, fractional - 11 - 5: (7, 6) 13, 1, 42, fractional - 11 * 5: (55, 7) 62, 48, 385, fractional - 11 / 5: fractional - 7 + 5: (12, 11) 23, 1, 132, fractional - 7 - 5: (11, 2) 13, 9, 22, fractional - 7 * 5: (35, 11) 46, 24 = 35 - 11 -> found it! Step 1: 4 + 1 = 5 Step 2: 7 * 5 = 35 Step 3: 35 - 11 = 24 Considering these steps: Backtracking the solution: 24 = 35 - 11 = (7 * 5) - 11 = (7 * (4 + 1)) - 11 = 24. answer: (7 * (4 + 1)) - 11 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 11 5 4 3 Assistant: Trying a promising first operation: 1. 11 * 3: (33, 5, 4) - 33 + 5: (38, 4) 42, 34, 152, fractional - 33 - 5: (28, 4) 32, 24 = 28 - 4 -> found it! Backtracking the solution: Step 1: 11 * 3 = 33 Step 2: 33 - 5 = 28 Step 3: 28 - 4 = 24 Considering these steps: 24 = 28 - 4 = (33 - 5) - 4 = ((11 * 3) - 5) - 4 = 24. answer: ((11 * 3) - 5) - 4 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 13 12 5 2 Assistant: Trying a promising first operation: 1. 13 + 5: (18, 12, 2) - 18 + 12: (30, 2) 32, 28, 60, fractional - 18 - 12: (6, 2) 8, 4, 12, 3 - 18 * 12: (216, 2) 218, 214, 432, fractional - 18 / 12: fractional - 18 + 2: (20, 12) 32, 8, 240, fractional - 18 - 2: (16, 12) 28, 4, 192, fractional - 18 * 2: (36, 12) 48, 24 = 36 - 12 -> found it! Backtracking the solution: Step 1: 13 + 5 = 18 Step 2: 18 * 2 = 36 Step 3: 36 - 12 = 24 Considering these steps: 24 = 36 - 12 = (18 * 2) - 12 = ((13 + 5) * 2) - 12 = 24. answer: ((13 + 5) * 2) - 12 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: 9 8 2 1 Assistant: Trying a promising first operation: 1. 9 - 1: (8, 8, 2) - 8 + 8: (16, 2) 18, 14, 32, 8 - 8 - 8: (2, 0) 2, 2, 0, undefined - 8 * 8: (64, 2) 66, 62, 128, 32 - 8 / 8: (2, 1) 3, 1, 2, 2 - 8 + 2: (10, 8) 18, 2, 80, 4 - 8 - 2: (6, 8) 14, -2, 48, fractional - 8 * 2: (16, 8) 24 = 16 + 8 -> found it! Backtracking the solution: Step 1: 9 - 1 = 8 Step 2: 8 * 2 = 16 Step 3: 16 + 8 = 24 Considering these steps: 24 = 16 + 8 = (8 * 2) + 8 = ((9 - 1) * 2) + 8 = 24. answer: ((9 - 1) * 2) + 8 = 24. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.3 $5× 5$ Mini Crosswords Prompts In the following we give all the exact prompts we used for the methods given in the experiments section and some selection of the generations from the LLM. In the prompts the keywords “System”, “User” and “Assistant” used are the roles in the OpenAI API in chat completion mode. The line breaks are to show the turns between the user and assistant in the API. F.3.1 AoT ⬇ User: h1. A lunar valley h2. A fatty oil h3. To entice h4. To lower; to reduce h5. A solitary person v1. According to the roster v2. Another name for Port - Francqui v3. An illicit lover; a European lake v4. To lisp v5. To come in The words I already found are: h1. rille h3. tempt v2. ilebo v5. enter Assistant: There are two horizontal and two vertical words. Since they are the amount, let ’ s consider the next unfound horizontal word, which is h2. Since we are considering the second horizontal word, we need to look at the second letters of the vertical words we found: - What is the second letter of ilebo: e - What is the second letter of enter: n The word we are looking fits the clue ” A fatty oil ” and needs to have the letters e and n in its (v2, two) second and (v5, five) fifth letters: _ e _ _ n. The candidates are: oleic, olein, olive. - oleic (o l e i c, _ e _ _ n) doesn ’ t fit - olein (o l e i n, _ e _ _ n) fits - olive (o l i v e, _ e _ _ n) doesn ’ t fit We add the word olein for h2. The current words are: h1. rille h2. olein h3. tempt v2. ilebo v5. enter Now, let ’ s consider the next unfound vertical word, which is v1. Since we are considering the first vertical word, we need to look at the first letters of the horizontal words we found: - What is the first letter of rille: r - What is the first letter of olein: o - What is the first letter of tempt: t The word we are looking fits the clue ” According to the roster ” and needs to have the letters r, o, and t in its (h1, one) first, (h2, two) second, and (h3, three) third letters: r o t _ _. The candidates are: none Now, let ’ s consider the next unfound vertical word, which is v3. Since we are considering the third vertical word, we need to look at the third letters of the horizontal words we found: - What is the third letter of rille: l - What is the third letter of olein: e - What is the third letter of tempt: m The word we are looking fits the clue ” An illicit lover; a European lake ” and needs to have the letters l, e, and m in its (h1, one) first, (h2, two) second, and (h3, three) third letters: l e m _ _. The candidates are: leman. - leman (l e m a n, l e m _ _) fits We add the word leman for v3. The current words are: h1. rille h2. olein h3. tempt v2. ilebo v3. leman v5. enter Now, let ’ s consider the next unfound horizontal word, which is h4. Since we are considering the fourth horizontal word, we need to look at the fourth letters of the vertical words we found: - What is the fourth letter of ilebo: b - What is the fourth letter of leman: a - What is the fourth letter of enter: e The word we are looking fits the clue ” To lower; to reduce ” and needs to have the letters b, a, and e in its (v2, two) second, (v3, three) third, and (v5, five) fifth letters: _ b a _ e. The candidates are: abase, abate. - abase (a b a s e, _ b a _ e) fits - abate (a b a t e, _ b a _ e) fits Since both fits, let ’ s consider another word. Now, let ’ s consider the last unfound vertical word, which is v4. Since we are considering the fourth vertical word, we need to look at the fourth letters of the horizontal words we found: - What is the fourth letter of rille: l - What is the fourth letter of olein: i - What is the fourth letter of tempt: p The word we are looking fits the clue ” To lisp ” and needs to have the letters l, i, p, and s in its (h1, one) first, (h2, two) second, (h3, three) third letters: l i p _ _. The candidates are: The candidate is ” lisps ”. - lisps (l i s p s, l i p _ _) doesn ’ t fit We have considered all of the words for the horizontal words, we have two options: abase and abate, let ’ s choose one of them. answer: h1. rille h2. olein h3. tempt h4. abase v2. ilebo v3. leman v5. enter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: h1. One who saws h2. A fungus genus h3. An assessor h4. Pasture land h5. Receiving by the ear v1. To swell; to increase v2. The Brazilian macaw; an Australian bird v3. A Timorese island v4. Excessive fluid accumulation v5. Dewy; roscid The words I already found are: h1. sawer h3. rater v1. surge v4. edema Assistant: There are two horizontal and two vertical words. Since they are the amount, let ’ s consider the next unfound horizontal word, which is h2. Since we are considering the second horizontal word, we need to look at the second letters of the vertical words we found: - What is the second letter of surge: u - What is the second letter of edema: d The word we are looking fits the clue ” A fungus genus ” and needs to have the letters u and d in its (v1, one) first and (v4, four) fourth letters: u _ _ d _. The candidates are: uredo. - uredo (u r e d o, u _ _ d _) fits We add the word uredo for h2. The current words are: h1. sawer h2. uredo h3. rater v1. surge v4. edema Now, let ’ s consider the next unfound vertical word, which is v2. Since we are considering the second vertical word, we need to look at the second letters of the horizontal words we found: - What is the second letter of sawer: a - What is the second letter of uredo: r - What is the second letter of rater: a The word we are looking fits the clue ” The Brazilian macaw; an Australian bird ” and needs to have the letters a, r, and a in its (h1, one) first, (h2, two) second, and (h3, third) third letters: a r a _ _. The candidates are: arara. - arara (a r a r a, a r a _ _) fits We add the word arara for v2. The current words are: h1. sawer h2. uredo h3. rater v1. surge v2. arara v4. edema Now, let ’ s consider the next unfound horizontal word, which is h4. Since we are considering the fourth horizontal word, we need to look at the fourth letters of the vertical words we found: - What is the fourth letter of surge: g - What is the fourth letter of arara: r - What is the fourth letter of edema: m the word we are looking fits the clue ” Pasture land ” and needs to have the letters g, r, and m in its (v1, one) first, (v2, two) second, and (v4, four) fourth letters: g r _ m _. The candidates are: grama. - grama (g r a m a, g r a _ _) fits We add the word grama for h4. The current words are: h1. Sawer h2. Uredos h3. Rater h4. Grama v1. surge v2. arara v4. edema Now, let ’ s consider the next unfound vertical word, which is v3. Since we are considering the third vertical word, we need to look at the third letters of the horizontal words we found: - What is the third letter of sawer: w - What is the third letter of uredo: e - What is the third letter of rater: t - What is the third letter of grama: a The word we are looking fits the clue ” A Timorese island ” and needs to have the letters w, e, t, and a in its (h1, one) first, (h2, two) second, (h3, three) third, and (h4, four) fourth letters: w e t a _. The candidates are: The candidate is ” wetar ”. - wetar (w e t a r, w e t a _) fits We add the word wetar for v3. The current words are: h1. sawer h2. uredo h3. rater h4. grama v1. surge v2. arara v3. wetar v4. edema Now, let ’ s consider the last unfound horizontal word, which is h5. Since we are considering the fifth horizontal word, we need to look at the fifth letters of the vertical words we found: - What is the fifth letter of surge: e - What is the fifth letter of arara: a - What is the fifth letter of wetar: r - What is the fifth letter of edema: a The word we are looking fits the clue ” Receiving by the ear ” and needs to have the letters e, a, r, and a in its (v1, one) first, (v2, two) second, (v3, three) third, and (v4, four) fourth letters: e a r a _. The candidates are: The candidate is ” earal ”. - earal (e a r a l, e a r a _) fits We add the word earal for h5. The current words are: h1. sawer h2. uredo h3. rater h4. grama h5. earal v1. surge v2. arara v3. wetar v4. edema Since we found all horizontal words: answer: h1. sawer h2. uredo h3. rater h4. grama h5. earal v1. surge v2. arara v3. wetar v4. edema ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User: h1. Dandruff; scum; the bull - trout h2. One who greets; to vacillate; a British river h3. A Turkish written decree h4. Mignon; petty; little h5. A bishop ’ s permission for a priest to leave a diocese v1. To steal; to brush across v2. A sedge (a primitive three - sided grass) v3. Grape jam v4. A flatworm larva v5. Ore refuse; to prepare material for glass by heat h2. waver h5. exeat v1. swipe v2. carex Assistant: Let ’ s consider the next unfound horizontal word, which is h1. Since we are considering the first horizontal word, we need to look at the first letters of the vertical words we found: - What is the first letter of swipe: s - What is the first letter of carex: c The word we are looking fits the clue ” Dandruff; scum; the bull - trout ” and needs to have the letters s and c in its (v1, one) first and (v2, two) second letters: s c _ _ _. The candidates are: scurf. - scurf (s c u r f, s c _ _ _) fits We add the word scurf for h1. The current words are: h1. scurf h2. waver h5. exeat v1. swipe v2. carex Now, let ’ s consider the next unfound vertical word, which is v3. Since we are considering the third vertical word, we need to look at the third letters of the horizontal words we found: - What is the third letter of scurf: u - What is the third letter of waver: v - What is the third letter of exeat: e The word we are looking fits the clue ” Grape jam ” and needs to have the letters u, v, and e in its (h1, one) first, (h2, two) second, and (h5, five) fifth letters: u v _ _ e. The candidates are: The candidate is ” uvetta ”. - uvetta (u v e t t a, u v _ _ e) doesn ’ t fit Let ’ s consider the next unfound vertical word, which is v4. Since we are considering the fourth vertical word, we need to look at the fourth letters of the horizontal words we found: - What is the fourth letter of scurf: r - What is the fourth letter of waver: e - What is the fourth letter of exeat: a The word we are looking fits the clue ” A flatworm larva ” and needs to have the letters r, e, and a in its (h1, one) first, (h2, two) second, and (h5, five) fifth letters: r e _ _ a. The candidates are: redia. - redia (r e d i a, r e _ _ a) fits We add the word redia for v4. The current words are: h1. scurf h2. waver h5. exeat v1. swipe v2. carex v4. redia Now, let ’ s consider the next unfound horizontal word, which is h3. Since we are considering the third horizontal word, we need to look at the third letters of the vertical words we found: - What is the third letter of swipe: i - What is the third letter of carex: r - What is the third letter of redia: d The word we are looking fits the clue ” A Turkish written decree ” and needs to have the letters i, r, and d in its (v1, one) first, (v2, two) second, and (v4, four) fourth letters: i r _ d _. The candidates are: irade. - irade (i r a d e, i r _ d _) fits We add the word irade for h3. The current words are: h1. scurf h2. waver h3. irade h5. exeat v1. swipe v2. carex v4. redia Now, let ’ s consider the last unfound vertical word, which is v5. Since we are considering the fifth vertical word, we need to look at the fifth letters of the horizontal words we found: - What is the fifth letter of scurf: f - What is the fifth letter of waver: r - What is the fifth letter of irade: e - What is the fifth letter of exeat: t The word we are looking fits the clue ” Ore refuse; to prepare material for glass by heat ” and needs to have the letters f, r, e, and t in its (h1, one) first, (h2, two) second, (h3, three) third, and (h5, five) fifth letters: f r e _ t. The candidates are: The candidate is ” frett ”. - frett (f r e t t, f r e _ t) fits We add the word frett for v5. The current words are: h1. scurf h2. waver h3. irade h5. exeat v1. swipe v2. carex v4. redia v5. frett Since we found all horizontal words: answer: h1. scurf h2. waver h3. irade h5. exeat v1. swipe v2. carex v4. redia v5. frett ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.3.2 Propose Words ⬇ User: h1. A lunar valley h2. A fatty oil h3. To entice h4. To lower; to reduce h5. A solitary person v1. According to the roster v2. Another name for Port - Francqui v3. An illicit lover; a European lake v4. To lisp v5. To come in Assistant: All possible 5- letters words for each clue, for each clue we give 10 options: h1. RILLE, VALLIS, CRATER, BASINS, PLAINS h2. OLEIC, OLEIN, GREASE, BUTTER, ANIMAL h3. LURE, TEMPT, SEDUCE, ENTICE, DRAWIN h4. ABASE, LOWER, REDUCE, LESSEN, DEBASE h5. HERMIT, ALONE, LONER, SINGLE, SOLITARY v1. ONTAP, LISTED, ROSTER, SCHEDULE, PLANNED v2. ILEBO, PORTF, CONGO, AFRICA, COLONY v3. LOVER, AMOUR, GENEVA, LEMAN, ZURICH v4. SLUR, LISPS, STUTTER, MUMBLE, STAMMER v5. ENTER, ARRIVE, COMEIN, APPEAR, SHOWUP ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ F.4 Creative Writing F.4.1 AoT ⬇ ” Write a coherent passage of 4 short paragraphs. The end sentence of each paragraph must be: {0} Firstly, make five different plans for a coherent passage, then write. Your output should be of the following format: Plan 1: Your plan here. Plan 2: Your plan here. Plan 3: Your plan here. Plan 4: Your plan here. Plan 5: Your plan here. Secondly, given an instruction and several plans, decide which choice is most promising. Analyze each choice in detail, then conclude in the last line ” The best choice is {{s}}”, where s the integer id of the choice. Thirdly, write the passage according to that chosen plan in the most coherent way. Add ” Passage:” before writing the passage under it. Passage: Your passage here. Finally, refine the passage in the most coherent way, but you still have to end each paragraph with the given sentences as before. Final Passage: Final passage here. F.4.2 Score Prompt ⬇ Analyze the following passage, then at the last line conclude ” Thus the coherency score is {{s}}”, where s is an integer from 1 to 10. {0}

Rendering Paper...