2406.04271

Model: healer-alpha-free

# Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models > Equal Contribution. 🖂 ## Abstract We introduce Buffer of Thoughts (BoT), a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). Specifically, we propose meta-buffer to store a series of informative high-level thoughts, namely thought-template, distilled from the problem-solving processes across various tasks. Then for each problem, we retrieve a relevant thought-template and adaptively instantiate it with specific reasoning structures to conduct efficient reasoning. To guarantee the scalability and stability, we further propose buffer-manager to dynamically update the meta-buffer, thus enhancing the capacity of meta-buffer as more tasks are solved. We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Further analysis demonstrate the superior generalization ability and model robustness of our BoT, while requiring only 12% of the cost of multi-query prompting methods (e.g., tree/graph of thoughts) on average. Notably, we find that our Llama3-8B + BoT has the potential to surpass Llama3-70B model. Our project is available at https://github.com/YangLing0818/buffer-of-thought-llm ## 1 Introduction A series of Large Language Models (LLMs) [1, 2, 3, 4, 5] like GPT-4 [3], PaLM [2] and LLaMA [6, 7] have showcased the impressive performance in various reasoning tasks. In addition to scaling up the model size to improve the reasoning performance, there are more effective prompting methods that further enhance the functionality and performance of LLMs. We divide these methods into two categories: (i) single-query reasoning: these methods [8, 9, 10] usually focus on prompt engineering and their reasoning process can be finished within a single query, such as CoT [8] that appends the input query with ’Let’s think step by step’ to produce rationales for increasing reasoning accuracy, and Few-shot Prompting [11, 12, 9, 13] which provides task-relevant exemplars to assist the answer generation; (ii) multi-query reasoning: these methods [14, 15] focus on leveraging multiple LLM queries to elicit different plausible reasoning paths, thus decomposing a complex problem into a series of simpler sub-problems, such as Least-to-Most [16], ToT [14] and GoT [17]. However, both kinds of methods face some limitations: (1) single-query reasoning usually requires prior assumption or relevant exemplars of reasoning process, which makes it impractical to manually design them task by task, thus lacking universality and generalization; (2) Due to the recursive expansion of reasoning paths, multi-query reasoning is usually computationally-intensive when finding a unique intrinsic structure underlying the reasoning process for each specific task; (3) Both single-query and multi-query reasoning processes are limited by their designed exemplars and reasoning structures, and they neglect to derive general and high-level guidelines or thoughts from previously-completed tasks, which are informative for improving efficiency and accuracy when solving similar problems. <details> <summary>x1.png Details</summary> ![5747395f](/v1/image/5747395f5bfda2b3346058c5d55b1dbef5b7737eea4157e4ca9d3fefbd5ffd26) ### Visual Description ## [Diagram]: LLM Reasoning Approaches (Single-Query, Multi-Query, Buffer of Thoughts) ### Overview The image is a comparative diagram of three Large Language Model (LLM) reasoning workflows: **Single-Query**, **Multi-Query**, and **Buffer of Thoughts (BoT)**. It visually contrasts their components, processes, and performance (accuracy/efficiency) using icons, arrows, and labels. ### Components/Sections (Top to Bottom) The diagram is divided into three horizontal sections (separated by a dashed line), each representing a distinct LLM reasoning paradigm: #### 1. Top Section: Single-Query Approach - **Input**: "Input query" (with a plus icon) → fed to a gray rectangle labeled "LLM". - **Process**: "Single-Query" with "Manual Prompt for Specific Task (e.g., CoT, Few-shot Prompting)" → leads to a yellow rectangle labeled "Reasoning". - **Output**: Gray oval labeled "Output". - **Metric**: "Accuracy" with a downward arrow and a sad face (indicating low accuracy). #### 2. Middle Section: Multi-Query Approach - **Input**: "Input query" (plus icon) → fed to a gray rectangle labeled "LLM". - **Process**: "Multi-Query" with "Pre-defined Query Structure (e.g., ToT, GoT)" → leads to a yellow rectangle labeled "Thought Expansion", then to "Reasoning" (yellow rectangle). An "N-hop Iteration" loop (curved arrows) connects "Thought Expansion" and "Reasoning". - **Output**: Gray oval labeled "Output". - **Metric**: "Efficiency" with a downward arrow and a sad face (indicating low efficiency). #### 3. Bottom Section: Buffer of Thoughts (BoT) Approach - **Input**: "Input query" (plus icon) → fed to a gray rectangle labeled "LLM". - **Process**: - "Problem Distiller" (blue rectangle, lightbulb icon) → "Instantiated Reasoning" (blue rectangle) → "Output" (gray oval). - "Meta Buffer" (blue rectangle, cloud/computer icon) stores "High-level Thoughts" (arrow from Problem Distiller to Meta Buffer) and enables "Thought Retrieval" (arrow from Meta Buffer to Problem Distiller, group icon). - "Thought Distillation and Update" (dashed arrow from Output to Meta Buffer) refines stored thoughts. - **Metric**: "Accuracy & Efficiency" with an upward arrow and a happy face (indicating improved accuracy and efficiency). - **Label**: "Buffer of Thoughts (BoT)" at the bottom left. ### Detailed Analysis - **Single-Query**: Relies on a single, manual prompt (e.g., Chain-of-Thought, few-shot) for reasoning. Simple but yields low accuracy (sad face, down arrow). - **Multi-Query**: Uses a pre-defined query structure (e.g., Tree-of-Thought, Graph-of-Thought) with iterative "Thought Expansion" and "Reasoning" (N-hop loop). Improves reasoning but reduces efficiency (sad face, down arrow). - **BoT**: Introduces a "Problem Distiller" to process queries, "Instantiated Reasoning" for output, and a "Meta Buffer" to store/reuse "High-level Thoughts". Thoughts are retrieved, distilled, and updated, balancing accuracy and efficiency (happy face, up arrow). ### Key Observations - **Performance Contrast**: BoT is the only approach with a happy face (improved accuracy/efficiency), while Single-Query (low accuracy) and Multi-Query (low efficiency) have sad faces. - **Component Complexity**: BoT adds unique components (Problem Distiller, Meta Buffer) and processes (Thought Retrieval, Distillation/Update) absent in the other two. - **Trade-Off Resolution**: BoT addresses the accuracy-efficiency trade-off by reusing/refining thoughts via a meta-buffer, unlike the single/iterative approaches. ### Interpretation The diagram illustrates the evolution of LLM reasoning: - **Single-Query** is basic but ineffective (low accuracy). - **Multi-Query** improves reasoning via iteration but sacrifices efficiency. - **BoT** optimizes both by leveraging a "Meta Buffer" to store and refine high-level thoughts, mimicking human-like reasoning (reusing/refining processes). This suggests that a buffer of pre-processed thoughts (and distillation) can enhance LLM performance, addressing the limitations of simpler/iterative approaches. The BoT model likely aims to balance accuracy (quality) and efficiency (speed) by reusing and updating thought processes, making it a more robust solution for complex reasoning tasks. </details> Figure 1: Comparison between single-query [8, 11], multi-query [14, 17], and (c) our BoT methods. To address these limitations, we propose Buffer of Thoughts (BoT), a novel and versatile thought-augmented reasoning framework aimed at enhancing reasoning accuracy, efficiency and robustness of LLMs across various tasks. Specifically, we design meta-buffer, a lightweight library housing a series of universal high-level thoughts (thought-template), which are distilled from different problem-solving processes and can be shared across tasks. Then, for each problem, we retrieve a relevant thought-template and instantiate it with specific reasoning structure for efficient thought-augmented reasoning. In order to guarantee the scalability and stability of our BoT, we further propose buffer-manager to dynamically update the meta-buffer, which effectively enhances the capacity of meta-buffer as more tasks are solved. Our method has three critical advantages: (i) Accuracy Improvement: With the shared thought-templates, we can adaptively instantiate high-level thoughts for addressing different tasks, eliminating the need to build reasoning structures from scratch, thereby improving reasoning accuracy. (ii) Reasoning Efficiency: Our thought-augmented reasoning could directly leverage informative historical reasoning structures to conduct reasoning without complex multi-query processes, thus improving reasoning efficiency. (iii) Model Robustness: The procedure from thought retrieval to thought instantiation is just like the human thought process, enabling LLMs to address similar problems in a consistent way, thus significantly enhancing the model robustness of our method. Our empirical studies demonstrate that Buffer of Thoughts significantly improves precision, efficiency, and robustness over a diverse array of tasks. Here, we summarize our contributions as follows: 1. We propose a novel thought-augmented reasoning framework Buffer of Thoughts (BoT) for improving the accuracy, efficiency and robustness of LLM-based reasoning. 1. We propose meta-buffer for store informative high-level thoughts distilled from different problems, and adaptively instantiate each thought template to address each specific task. 1. We design buffer-manager to distill thought-templates from various solutions, and is continually improves the capacity of meta-buffer as more tasks are solved. 1. We conduct extensive experiments on 10 challenging reasoning-intensive tasks. Our BoT achieves significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One, while requiring only 12% of the cost of multi-query prompting methods on average. ## 2 Related Work and Discussions Retrieval-Augmented Language Models The retrieval-augmented (Large) Language Model is introduced as a solution to mitigate the phenomenon of hallucination and enhance the output quality of language models [18, 19, 20, 21, 22]. When presented with an input question, the retrieval-augmented LLM first queries an external database with billion-level tokens [23] for retrieving a subset of the text corpus to help generating the final answer. Notably, the retrieval-augmented LLM achieves superior question-answering performance using fewer parameters compared to conventional LLMs [19], and it has found application across various downstream tasks [24, 25, 26], including multi-modal generation [24, 22, 23, 25] and biomedical applications [26, 27]. In this paper, we construct a novel category of retrieval database, termed meta-buffer, which contains a series of high-level thoughts rather than specific instances, aiming to universally address various tasks for LLM-based reasoning. Prompt-based Reasoning with Large Language Models Prompting techniques have significantly enahnced the arithmetic and commonsense reasoning capabilities of LLMs. Chain-of-Thought (CoT) prompting [8] and its variants [28, 29, 30], such as Least-to-Most [16], Decomposed Prompting [31], and Auto-CoT [13] —prompt LLMs to break down complex questions into simpler subtasks and systematically solve them before summarizing a final answer. Numerous studies [32, 33, 34, 35, 36, 37] have demonstrated the effectiveness of these prompting methods across a wide range of tasks and benchmarks. Innovations like Tree-of-Thought [14] and Graph-of-Thought [17], have further advanced this field by exploring dynamic, non-linear reasoning pathways to expand heuristic capabilities of LLMs [38, 39]. However, they suffer from increased resource demands and greater time complexity, depend on manual prompt crafting, and are often tailored to specific task types. Recent meta prompting methods [15, 40] utilize a same task-agnostic form of prompting for various tasks and recursively guide a single LLM to adaptively addressing different input queries. Nevertheless, such a long meta prompt may require a considerable context window, and these methods fail to leverage historical informative guidelines or thoughts for potential similar tasks. Analogical Reasoning Analogical reasoning is a useful technique for natural language reasoning [41, 42, 43, 44, 45]. Recent works demonstrate that LLMs can perform analogical reasoning just like humans [46, 47, 12, 48, 49]. For example, Analogical Prompting [12] and Thought Propagation [48] prompt LLMs to self-generate a set of analogous problems, and then utilize the results of analogous problems to produce a solution for input problem. However, the specific solutions for self-explored problems may introduce additional noise and cause error accumulation. Recent Thought-Retriever [49] uses the intermediate thoughts generated when solving past user to address analogous queries, but it only focuses on textual comprehension/generation instead of general reasoning problems. Thus, a more high-level and general analogical approach for LLM complex reasoning is still lacking. ## 3 Buffer of Thoughts Overview of Buffer of Thoughts In this section, we introduce our Buffer of Thoughts in detail and we also illustrate our core thought-augmented reasoning process in Figure 2. Given a specific task, we utilize our problem-distiller (Section 3.1) to extract critical task-specific information along with relevant constraints. Based on the distilled information, we search in meta-buffer (Section 3.2) that contains a series of high-level thoughts (thought-template) and retrieve a most relevant thought-template for the task. Subsequently, we instantiate the retrieved thought-template with more task-specific reasoning structures and conduct reasoning process. Finally, we employs a buffer-manager (Section 3.3) for summarizing the whole problem-solving process and distilling high-level thoughts for imcreasing the capacity of meta-buffer. <details> <summary>x2.png Details</summary> ![15873189](/v1/image/158731899063b3ef5d16c5f43eade996b44e166d901539493eb77b198ad930fb) ### Visual Description ## Diagram Type: Problem-Solving Method Comparison (Math Problem: Shirt Profit Optimization) ### Overview The image illustrates multiple problem-solving approaches (Chain-of-Thought, Plan-and-Solve, Buffer of Thoughts, Instantiated Reasoning) for a math problem about optimizing shirt sales profit. The problem involves a mall selling shirts with initial sales (20/day) and profit (40 yuan/piece), where a 1 yuan price cut increases sales by 2/day, aiming for 1200 yuan daily profit. The diagram compares **incorrect** (Chain-of-Thought, Plan-and-Solve) and **correct** (Instantiated Reasoning) solutions, with a “Buffer of Thoughts” framework for structured problem-solving. ### Components/Axes (Sections) - **Input Problem (Top)**: Text describing the shirt sales problem. - **Chain-of-Thought (Left Top)**: Step-by-step (incorrect) solution with a red “X.” - **Plan-and-Solve (Left Bottom)**: Step-by-step (incorrect) solution with a red “X.” - **Buffer of Thoughts (Middle Top)**: Thought templates (T₁ for quadratic equations, Tₙ for general problem-solving) and a “Meta Buffer” (functions for processing elements). - **Problem Distillation & Thought Retrieval (Middle)**: Links the “Buffer of Thoughts” to “Instantiated Reasoning.” - **Instantiated Reasoning (Right)**: Correct solution with variable definitions, equation setup, and solution (marked with a red check). ### Detailed Analysis #### 1. Input Problem Text *“A certain shopping mall sells a batch of branded shirts, with an average daily sales of 20 pieces and a profit of 40 yuan per piece. In order to expand sales, increase profits, and reduce inventory as soon as possible, the mall has decided to take appropriate price reduction measures. After investigation, it was found that for every 1 yuan decrease in the price of this shirt, an average of 2 more shirts are sold per day. If the mall wants to make an average profit of 1200 yuan per day, how much price should each shirt be reduced?”* #### 2. Chain-of-Thought (Incorrect Solution) - **Step 1**: Current daily profit = $ 20 \times 40 = 800 $ yuan. - **Step 2**: Additional daily profit from more shirts = $ 2 \times 40 = 80 $ yuan (incomplete, ignores price reduction’s impact on profit per shirt). - **Step 5**: Price reduction needed = $ 320 / 2 = 160 $ yuan (flawed logic, marked with a red “X”). #### 3. Plan-and-Solve (Incorrect Solution) - **Steps**: Determine current profit → Set up equation → Solve. - **Equation**: $ 1200 = (20 \times 40) + (2p \times 40) - (20 \times p) - (2p \times p) $ → Simplifies to $ p^2 - 30p - 200 = 0 $ → $ p \approx 35.6155 $ (incorrect, marked with a red “X”). #### 4. Buffer of Thoughts (Structured Framework) - **Thought Template $ T_1 $** (Quadratic Equations): - Step 1: Calculate the discriminant $ D = b^2 - 4ac $. - Step 2: Determine the nature of roots ( $ D \geq 0 $: real roots; $ D = 0 $: repeated root; $ D < 0 $: complex roots). - Step 3: Compute roots using $ x = \frac{-b \pm \sqrt{D}}{2a} $. - **Meta Buffer**: Functions for general problem-solving: - `process_element(element)`: Define how to process individual elements (e.g., filter/transform). - `combine_elements(element1, element2)`: Define how to combine elements (e.g., sum numbers, concatenate strings). - `check_condition(accumulated_result)`: Define the condition the result must meet. - `solve_problem(input_list)`: Define the problem-solving logic. #### 5. Instantiated Reasoning (Correct Solution) - **Variables**: - $ p = 40 $ (original profit per shirt, yuan). - $ x $ = price reduction per shirt (yuan). - $ q = 20 $ (original daily sales, pieces). - $ r = 2 $ (sales increase per 1 yuan price cut, pieces/day). - **Equation**: Profit = (Profit per shirt − Reduction) × (Quantity + Sales Increase) $ (40 - x)(20 + 2x) = 1200 $ - **Simplification**: $ x^2 - 30x + 200 = 0 $ - **Discriminant**: $ D = (-30)^2 - 4(1)(200) = 100 $ - **Roots**: $ x = \frac{30 \pm \sqrt{100}}{2} = \frac{30 \pm 10}{2} $ → $ x = 20 $ or $ x = 10 $. - **Choice**: $ x = 20 $ (to reduce inventory faster, marked with a red check). ### Key Observations - **Incorrect Approaches**: Chain-of-Thought and Plan-and-Solve fail to account for the price reduction’s impact on *both* profit per shirt and quantity sold, leading to flawed logic. - **Correct Approach**: Instantiated Reasoning uses variable definitions, a correct profit equation, and solves the quadratic equation properly. - **Buffer of Thoughts**: Provides a structured framework (templates, meta-functions) to guide problem-solving, emphasizing quadratic equation methods. ### Interpretation The diagram highlights the importance of **systematic problem-solving** (via the “Buffer of Thoughts”) and correct variable/equation setup. The incorrect methods ignore the price reduction’s dual impact (on profit per shirt and sales volume), leading to wrong answers. The correct approach (Instantiated Reasoning) uses the quadratic formula correctly, choosing $ x = 20 $ to align with the goal of reducing inventory (higher sales volume). This demonstrates how structured thinking (templates, meta-functions) improves accuracy in optimization problems with multiple variables. (Note: All text is in English, with no other languages present.) </details> Figure 2: Illustration of different reasoning process. Buffer of Thoughts enables large language models to tackle complex reasoning tasks through our thought-augmented reasoning process. Thought template is marked in orange and instantiated thought is marked in blue. ### 3.1 Problem Distiller Most of complex tasks contain implicit constraints, complex object relationships, and intricate variables and parameters within their contexts. Consequently, during the reasoning stage, LLMs need to overcome three main challenges: extracting vital information, recognizing potential constraints, and performing accurate reasoning. These challenges would impose a significant burden on a single LLM. Therefore, we separate the extraction and comprehension stages of task information from the final reasoning stage, through prepending a problem distiller to the reasoning process. More concretely, we design a meta prompt $\mathcal{\phi}$ to first distill and formalize the task information. The distilled task information could be denoted as: $$ x_{d}=LLM(\mathcal{\phi}(x)), \tag{1} $$ where $x$ is the task statement. Due to the page limit, we put the detailed meta prompt for problem-distiller in Section A.2. Problem Condensation and Translation We use the problem distiller to extract key elements from input tasks, focusing on: (1). Essential parameters and variables for problem-solving; (2). The objectives of the input tasks and their corresponding constraints. We then re-organize this distilled information into a clear, comprehensible format for the subsequent reasoning stage. We then translate the specific problems into high-level concepts and structures. This translation procedure decomposes complex real-world problems, like intricate mathematical application scenarios, into simpler, multi-step calculations, making it easier for later retrieval of high-level thought. ### 3.2 Thought-Augmented Reasoning with Meta Buffer Motivation Human often summarize and induce higher-level guidelines when solving problems and then apply them to relevant problems. Motivated by this, we propose meta-buffer, a lightweight library that contains a series of high-level thoughts (thought-template) for addressing various types of problems. Unlike traditional methods [11, 46, 12, 36, 9] that require specific instructions or exemplars, our high-level thought-templates can be adaptively instantiated when solving different problems, thereby enhancing LLMs with superior precision and flexibility. Thought Template As a kind of high-level guideline, our thought-template is stored in meta-buffer , and is obtained from various problem-solving processes by our buffer-manager. The details about acquiring thought-templates would be introduced in Section 3.3. Since our BoT aims to provide a general reasoning approach for various tasks, we correspondingly classify the thought-templates into six categories: Text Comprehension, Creative Language Generation, Common Sense Reasoning, Mathematical Reasoning, Code Programming and Application Scheduling. We provide some example thought-templates in Section A.1. Such classification of thought-templates can facilitate the template retrieval for finding most suitable solutions to different problems. Here we denote thought template, template description and its corresponding category as $(T_{i},D_{T_{i}},C_{k})$ , where $i$ denotes the index of meta-template, $k\in\mathbb{Z^{+}}$ and $1\leq k\leq 6$ , which means $C_{k}$ is in one of the six categories, and $D_{T_{i}}$ is the description of thought template. Template Retrieval For each task, our BoT retrieves a thought-template $T_{i}$ that is highly similar to the distilled problem $x_{d}$ by calculating the embedding similarity between the description $D_{T_{i}}$ and $x_{d}$ . The retrieval process can be formulated as: $$ j=\text{argmax}_{i}(\text{Sim}(f(x_{d}),\{f(D_{T_{i}})\}_{i=1}^{N})),\quad \text{where}\quad\text{Sim}(f({x}_{d}),\{f(D_{T_{i}})\}_{i=0}^{n})>=\delta, \tag{2} $$ $N$ is the size of the meta-buffer, $f(\cdot)$ is a normal text embedding model, and $T_{j}$ denotes the retrieved thought template. We set a threshold $\delta$ (0.5 $\sim$ 0.7 is recommended) to determine whether the current task is new. Therefore, if $\text{Sim}(f({x}_{d}),\{f(D_{T_{i}})\}_{i=0}^{n})<\delta$ , we identify the task $x$ as a new task. Instantiated Reasoning For each specific task, we discuss two situations for the instantiated reasoning, depending on whether the current task is new: The first situation is that we successfully retrieve a thought-template $T_{j}$ for the task. In this case, as presented in Figure 2, our thought-augmented reasoning will be adaptively instantiated to suitable reasoning structures with our designed instantiation prompt (in Section A.3). For example, in a Checkmate-in-One problem, we instantiate the template of updating chess board state to solve the problem step by step. Thus we conduct the instantiated reasoning for task $x$ using the distilled information $x_{d}$ and the retrieved template $T_{j}$ , and produce its solution $S_{x}$ as: $$ S_{x}=LLM_{\text{instantiation}}(x_{d},T_{j}), \tag{3} $$ where $LLM_{\text{instantiation}}$ denotes the instantiated reasoner with a LLM. In the second situation, the task is identified as a new task. To enable proper instantiated reasoning, we prepare three general coarse-grained thought-templates for utilization. Based on the distilled task information $x_{d}$ , our BoT would automatically assign a suitable thought-template to the reasoning process. The detailed pre-defined thought-templates are included in Section A.3). ### 3.3 Buffer Manager We propose buffer-manager to summarize the high-level guidelines and thoughts that are gained from each problem-solving process. It can generalize each specific solution to more problems, storing the critical distilled knowledge in the form of thought-templates within the meta buffer. In contrast to methods that temporarily generate exemplars or instructions for each problem, our buffer-manager can ensure permanent advancements in accuracy, efficiency, and robustness for LLM-based reasoning. Template Distillation To extract a general though-template, we propose a three-step approach: (1) Core task summarization: identifying and describing basic types and core challenges of problems; (2) Solution steps description: summarize the general steps for solving a problem; (3) General answering template: based on the above analysis, propose a solution template or approach that can be widely applied to similar problems. Additionally, to boost the generalization ability and stability of template distillation, we carefully design two types of in-context examples of how to generate thought-template— in-task and cross-task examples. Cross-task means we choose the template distilled from one task to tackle the problem of other tasks, such as addressing a mathematical problem with a code-related thought-template. The new template distilled from input task $x$ can be denoted as: $$ T_{new}=LLM_{\text{distill}}(x_{d},S_{x}), \tag{4} $$ where $LLM_{\text{distill}}$ is the LLM-based template distiller initialized with the following prompt: Prompt for Template Distillation: User: [Problem Description] + [Solution Steps or Code] To extract and summarize the high-level paradigms and general approaches for solving such problems, please follow these steps in your response: 1. Core task summarization: Identify and describe the basic type and core challenges of the problem, such as classifying it as a mathematical problem (e.g., solving a quadratic equation), a data structure problem (e.g., array sorting), an algorithm problem (e.g., search algorithms), etc. And analyze the most efficient way to solve the problem. 2. Solution Steps Description: Outline the general solution steps, including how to define the problem, determine variables, list key equations or constraints, choose appropriate solving strategies and methods, and how to verify the correctness of the results. 3. General Answer Template: Based on the above analysis, propose a template or approach that can be widely applied to this type of problem, including possible variables, functions, class definitions, etc. If it is a programming problem, provide a set of base classes and interfaces that can be used to construct solutions to specific problems. Please ensure that your response is highly concise and structured, so that specific solutions can be transformed into generalizable methods. [Optional] Here are some exemplars of the thought-template: (Choose cross-task or in-task exemplars based on the analysis of the Core task summarization.) Dynamic Update of Meta-Buffer After template distillation, we need to consider whether the distilled template should be updated into the meta-buffer. If we initialize an empty meta-buffer or encounter a problem without a proper thought-template, the distilled thought-templates will be directly stored in the meta-buffer. If we solve problem with a retrieved thought-template, new insights may arise during the instantiation of a certain thought-template. Therefore, to avoid the redundancy of the meta-buffer while maintaining newly-generated informative thoughts, we will calculate the similarity between the embedding vectors of $D_{T_{new}}$ and $\{D_{T_{i}}\}_{i=0}^{n}$ and update the meta-buffer with the following rule: $$ \text{Max}(\text{Sim}(f(D_{T_{new}}),\{f(D_{T_{i}})\}_{i=0}^{n}))<\delta. \tag{5} $$ Otherwise, it means the meta-buffer has already possessed the necessary knowledge to solve this task and does not need to perform the update. Our dynamic update strategy effectively reduces the computational burden of template retrieval while ensuring the lightweight property of our meta-buffer. We further conduct ablation study to analyze it in Section 6. ## 4 Experiments Datasets and Tasks To evaluate the efficacy of our proposed Buffer of Thoughts and compare with previous methods, we consider a diverse set of tasks and datasets that require varying degrees of mathematical and algorithmic reasoning, domain-specific knowledge, and literary creativity: (a). The Game of 24 from ToT [14], where the objective is to form an arithmetic expression that equals 24 using each of four given numbers exactly once; (b). Three BIG-Bench Hard (BBH) [35] tasks: Geometric Shapes, Multi-Step Arithmetic Two, and Word Sorting; (c). Three reasoning tasks directly obtained from the BIG-Bench suite [50]: Checkmate-in-One, Penguins —where the task is to answer questions about penguins’ attributes based on a given table and additional natural language information, and DateUnderstanding —a task that involves inferring dates from natural language descriptions, performing arithmetic operations on dates, and utilizing global knowledge such as the number of days in February; (d). Python Programming Puzzles (P3) [51, 52], a collection of challenging programming puzzles written in Python with varying difficulty levels; (e). Multilingual Grade School Math (MGSM) [33], a multilingual version of the GSM8K dataset [53] featuring translations of a subset of examples into ten typologically diverse languages, including Bengali, Japanese, and Swahili; (f). Shakespearean Sonnet Writing from meta-prompting [15], a novel task where the goal is to write a sonnet following the strict rhyme scheme "ABAB CDCD EFEF GG" and incorporating three provided words verbatim. Implementation and Baselines For the fair comparisons with previous methods, we use GPT-4 as the base model of our BoT, including the main experiment and the ablation study (in Section 6). We also use Llama3-8B and Llama3-70B in our analysis part on NVIDIA A100-PCIE-40GB GPU. We compare our Buffer of Thoughts with the following prompting methods: 1. Standard Prompting: This is our most basic baseline, where an LLM is asked to generate a response directly from the input query, without any specific guiding input-output examples or additional instructions beyond the task description included in the query. 2. Single-query Method: This includes Zero-shot CoT [8] and PAL [10], which use the LLM to analyze natural language problems and generate intermediate reasoning steps. We also include Expert Prompting [9], which creates an expert identity tailored to the specific context of the input query, and then integrates this expert profile into the input to generate a well-informed response. 3. Multi-query Method: This includes ToT [14] and GoT [17], which enable LLMs to make deliberate decisions by considering multiple reasoning paths and self-evaluating choices to determine the next course of action. These methods also allow for looking ahead or backtracking when necessary to make global decisions. Additionally, we include Meta Prompting [15], which employs an effective scaffolding technique designed to enhance the functionality of LLMs. Table 1: Comparing BoT with previous methods across various tasks. We denote the best score in blue, and the second-best score in green. Our BoT significantly outperforms other methods on all tasks, especially on general reasoning problems. | Game of 24 MGSM (avg) Multi-Step Arithmetic | 3.0 84.4 84.0 | 11.0 85.5 83.2 | 3.0 85.0 83.2 | 64.0 72.0 87.4 | 74.0 86.4 88.2 | 73.2 87.0 89.2 | 67.0 84.8 90.0 | 82.4 89.2 99.8 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | WordSorting | 80.4 | 83.6 | 85.2 | 93.2 | 96.4 | 98.4 | 99.6 | 100.0 | | Python Puzzles | 31.1 | 36.3 | 33.8 | 47.3 | 43.5 | 41.9 | 45.8 | 52.4 | | Geometric Shapes | 52.6 | 69.2 | 55.2 | 51.2 | 56.8 | 54.2 | 78.2 | 93.6 | | Checkmate-in-One | 36.4 | 32.8 | 39. 6 | 10.8 | 49.2 | 51.4 | 57.2 | 86.4 | | Date Understanding | 68.4 | 69.6 | 68.4 | 76.2 | 78.6 | 77.4 | 79.2 | 88.2 | | Penguins | 71.1 | 73.6 | 75.8 | 93.3 | 84.2 | 85.4 | 88.6 | 94.7 | | Sonnet Writing | 62.0 | 71.2 | 74.0 | 36.2 | 68.4 | 62.8 | 79.6 | 80.0 | ### 4.1 BoT Achieves Better Accuracy, Efficiency and Robustness Reasoning Accuracy As shown in LABEL:tab-accuracy, our BoT consistently outperforms all previous prompting methods across multiple kinds of challenging benchmarks, particularly demonstrated in complicated reasoning tasks such as Game of 24 and Checkmate-in-One. Taking GPT-4 as a baseline, our method achieves an astonishing 79.4% accuracy improvement in Game of 24, and compared to ToT, which has a good performance on this task, we also achieve an 8.4% accuracy improvement. What’s more, compared to recent Meta-prompting method [15], we see significant accuracy improvements: 23% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Existing methods need complex, iterative, and heuristic search strategies to address these problems on a case-by-case basis. Conversely, our BoT leverages the historical insights and informative guidelines from thought-templates, and further adaptively instantiate a more optimal reasoning structure for addressing these complex problems. <details> <summary>x3.png Details</summary> ![53498c05](/v1/image/53498c05d52b267f989616f2d23f46512b82cf6370e3ed629d3f33335b0a39b3) ### Visual Description ## Bar Chart: Comparison of the Inference Time ### Overview This is a grouped bar chart comparing the inference time (in seconds, on a logarithmic scale) of five different methods across three distinct tasks or problem domains. The chart visually demonstrates the relative computational efficiency of each method for each task. ### Components/Axes * **Chart Title:** "Comparison of the inference time" * **Y-Axis:** Labeled "Logarithmic time (s)". The scale runs from 0 to 10 with major gridlines at intervals of 1. * **X-Axis:** Represents three distinct tasks/categories: 1. Game of 24 2. MGSM 3. Checkmate-in-One * **Legend:** Located at the top-right of the chart area. It defines the color coding for the five methods being compared: * **Expert:** Dark Blue * **PAL:** Orange * **ToT:** Gray * **Meta-prompting:** Yellow * **Ours:** Light Blue ### Detailed Analysis The chart presents the following numerical data points for each method across the three tasks. Values are read directly from the labels atop each bar. **1. Task: Game of 24** * **Expert (Dark Blue):** 4.64 s * **PAL (Orange):** 5.5 s * **ToT (Gray):** 8.73 s * **Meta-prompting (Yellow):** 8.47 s * **Ours (Light Blue):** 5.17 s **2. Task: MGSM** * **Expert (Dark Blue):** 4.16 s * **PAL (Orange):** 4.81 s * **ToT (Gray):** 8.34 s * **Meta-prompting (Yellow):** 8.04 s * **Ours (Light Blue):** 5 s **3. Task: Checkmate-in-One** * **Expert (Dark Blue):** 4.81 s * **PAL (Orange):** 5.21 s * **ToT (Gray):** 9.03 s * **Meta-prompting (Yellow):** 8.43 s * **Ours (Light Blue):** 6.39 s ### Key Observations * **Consistent Hierarchy:** Across all three tasks, the "ToT" (Gray) and "Meta-prompting" (Yellow) methods consistently exhibit the highest inference times, forming a distinct high-latency group. * **Lowest Latency:** The "Expert" (Dark Blue) method consistently shows the lowest or near-lowest inference time for each task. * **"Ours" Method Performance:** The "Ours" (Light Blue) method consistently performs in the middle range. It is significantly faster than "ToT" and "Meta-prompting" but slower than "Expert" and, in most cases, "PAL". * **Task Variation:** The absolute inference times vary by task. "Checkmate-in-One" generally results in higher times for most methods compared to "Game of 24" and "MGSM", with "ToT" reaching its peak of 9.03 s on this task. ### Interpretation The data suggests a clear trade-off between method complexity and computational speed. The "ToT" (Tree-of-Thought) and "Meta-prompting" methods, which likely involve more extensive search or reasoning processes, incur a substantial time cost. In contrast, the "Expert" method, which may rely on more direct or specialized heuristics, is the most efficient. The "Ours" method appears to strike a balance, offering a significant speed advantage over the most computationally intensive methods ("ToT", "Meta-prompting") while not achieving the minimal latency of the "Expert" approach. Its performance relative to "PAL" is mixed, being slightly slower in two tasks and slightly faster in one. The use of a logarithmic scale on the y-axis visually compresses the differences between the high and low values. However, the extracted numbers reveal that the slowest methods ("ToT", "Meta-prompting") are often **~1.7 to 2.2 times slower** than the fastest method ("Expert") for a given task. This chart effectively communicates that the proposed method ("Ours") reduces inference time compared to certain state-of-the-art techniques, positioning it as a potentially more efficient alternative for these problem-solving domains. </details> Figure 3: Comparison of logarithmic inference time between our Buffer of Thoughts and GPT4 [3], GPT4+CoT [8], Expert-prompting [9], PAL [10], ToT [14] across different benchmarks. <details> <summary>x4.png Details</summary> ![06e614bc](/v1/image/06e614bc0881050431c96d484e3008262735792f5ed8a3c9327986f565568879) ### Visual Description ## Bar Chart: Success Rate Comparison Across Tasks ### Overview This is a grouped bar chart titled "Success rate" that compares the performance of five different methods or models across three specific tasks and an overall average. The performance metric is "Average accuracy (%)". ### Components/Axes * **Chart Title:** "Success rate" (centered at the top). * **Y-Axis:** Labeled "Average accuracy (%)". The scale runs from 0 to 100 in increments of 10. * **X-Axis:** Represents four distinct categories or tasks: 1. Game of 24 2. MGSM 3. Checkmate-in-One 4. Average * **Legend:** Located in the top-right corner. It defines five data series by color: * **GPT4:** Dark blue square. * **Expert:** Orange square. * **PAL:** Gray square. * **ToT:** Yellow square. * **Ours:** Light blue square. * **Data Labels:** Each bar has its exact numerical value displayed at the top. ### Detailed Analysis The chart presents the following data points for each task category. The trend within each task group is generally ascending from left (GPT4) to right (Ours), with some variation. **1. Game of 24** * **Trend:** Clear ascending trend from GPT4 to Ours. * **Data Points:** * GPT4 (Dark blue): 27% * Expert (Orange): 36% * PAL (Gray): 61% * ToT (Yellow): 71% * Ours (Light blue): 98% **2. MGSM** * **Trend:** Ascending trend, with "Ours" achieving the highest score. "Expert" slightly outperforms "GPT4". * **Data Points:** * GPT4 (Dark blue): 85% * Expert (Orange): 87% * PAL (Gray): 76% * ToT (Yellow): 84% * Ours (Light blue): 96.8% **3. Checkmate-in-One** * **Trend:** Ascending trend from GPT4 to Ours, with a significant jump for "ToT" and "Ours". * **Data Points:** * GPT4 (Dark blue): 48.2% * Expert (Orange): 53.4% * PAL (Gray): 36.4% * ToT (Yellow): 78.4% * Ours (Light blue): 93.4% **4. Average** * **Trend:** Consistent ascending trend from GPT4 to Ours. * **Data Points:** * GPT4 (Dark blue): 67.13% * Expert (Orange): 71.82% * PAL (Gray): 70.12% * ToT (Yellow): 84.57% * Ours (Light blue): 95.15% ### Key Observations 1. **Dominant Performance:** The method labeled "Ours" (light blue) achieves the highest accuracy in every single category, including the overall average. 2. **Task Variability:** The relative performance of the other methods varies by task. For example, "PAL" is the second-best on "Game of 24" but the worst on "Checkmate-in-One". 3. **Significant Gains:** The performance gap between "Ours" and the next-best method is most pronounced in the "Game of 24" (27 percentage points higher than ToT) and "Checkmate-in-One" (15 percentage points higher than ToT) tasks. 4. **Consistent Ranking:** In the "Average" column, the final performance ranking from lowest to highest is: GPT4 < PAL < Expert < ToT < Ours. ### Interpretation This chart is designed to demonstrate the superior performance of a proposed method ("Ours") against several established baselines (GPT4, Expert, PAL, ToT) across a diverse set of reasoning or problem-solving tasks. The "Game of 24" is a mathematical puzzle, "MGSM" likely refers to a multilingual grade-school math benchmark, and "Checkmate-in-One" is a chess tactic problem. The consistent top placement of "Ours" suggests it is a more robust and effective approach for these types of challenges. The "Average" column synthesizes this advantage, showing a clear, incremental improvement over the other methods, with "Ours" achieving a 95.15% average accuracy. The chart effectively argues for the state-of-the-art capability of the presented method. </details> Figure 4: Comparison of reasoning robustness between our Buffer of Thoughts and GPT4 [3], GPT4+CoT [8], Expert-prompting [9], PAL [10], ToT [14] across different benchmarks. Reasoning Efficiency In addition to significant improvements in accuracy, as a multi-query method, our BoT can achieve comparable reasoning time to single-query method across various tasks, while being considerably less than conventional multi-query method like ToT [14] as shown in Figure 3. For example, in Game of 24, both single-query and multi-query methods necessitate iterative and heuristic searches to identify feasible solutions. This process is particularly time-consuming and inefficient, especially for the multi-query method, which involves conducting multi-query search and backtrace phases. In contrast, our BoT directly retrieves a thought-template in code format, thus a program is instantiated to traverse combinations of numbers and symbols, thereby eliminating the need to build the reasoning structure from scratch. This allows for solving the problem with just one query after invoking the problem-distiller, significantly reducing the time required for complex reasoning. Notably, our BoT requires only 12% of the cost of multi-query methods (e.g., tree of thoughts and meta-prompting) on average. Reasoning Robustness To better evaluate our BoT, we devise a new evaluation metric: success rate, which is used to assess the reasoning robustness. We randomly sample 1000 examples from various benchmarks as a test subset and evaluate different methods on this subset. As shown in Figure 4, we repeat this evaluation process 10 times and take the average accuracy as the success rate of different methods on each benchmark. Compared with other methods, our BoT consistently maintains a higher success rate across various tasks, surpassing the second-best by 10% in average success rate. We attribute our outstanding robustness to the great generalization ability of our distilled thought-templates during reasoning across different tasks. By offering high-level thought from the suitable thought-templates, the stability of our method across different tasks is greatly enhanced. ## 5 Model Analysis Distribution Analysis of Thought-Templates As depicted in the left figure of Figure 5, we choose six different benchmarks, each sampled with 100 distinct tasks. We update the meta-buffer from scratch, and after completing all sampled tasks, we display the number of derived thought-templates. We can observe that our BoT generates a greater number of thought-templates in the MGSM tasks that contain more diverse scenarios. In tasks with relatively simple requirements, such as Checkmate-in-One and Penguins, BoT produces more fixed thought-templates tailored for those specific issues. The distribution of templates indicates that our BoT can effectively discover appropriate thought templates for different benchmarks. <details> <summary>x5.png Details</summary> ![cd6ff6ed](/v1/image/cd6ff6eda914d7b6958e21255560dbfdec4f9b5a5b9fb8674442da1b2c3d1fc7) ### Visual Description ## Pie Chart: Template distribution across different tasks ### Overview This image is a pie chart titled "Template distribution across different tasks." It visually represents the distribution of a quantity (likely the number of templates) across six distinct task categories. The chart uses color-coded slices, with each slice labeled with the task name and its corresponding numerical value. A legend is provided in the bottom-right corner for reference. ### Components/Axes * **Title:** "Template distribution across different tasks" (centered at the top). * **Chart Type:** Pie chart. * **Data Series (Slices):** Six slices, each representing a task. The slices are labeled directly on the chart with the task name and a numerical value. * **Legend:** Located in the bottom-right quadrant of the image. It lists the six tasks with their corresponding color swatches. * **Legend Order (Top to Bottom):** 1. Sonnet writing (Blue) 2. Table of Penguins (Light Green) 3. Date understanding (Yellow) 4. MGSM (Red) 5. Python Puzzles (Light Blue) 6. Checkmate-in-One (Dark Green) ### Detailed Analysis The chart displays the following data points, cross-referenced with the legend colors and their spatial placement: 1. **MGSM (Red Slice):** * **Placement:** Occupies the largest portion of the chart, spanning the bottom-left and bottom-center. * **Value:** 78. * **Trend/Size:** This is the dominant category, representing the largest share of the distribution. 2. **Sonnet writing (Blue Slice):** * **Placement:** Located in the top-right quadrant. * **Value:** 37. * **Trend/Size:** A major category, tied for the second-largest share. 3. **Python Puzzles (Light Blue Slice):** * **Placement:** Located in the top-left quadrant. * **Value:** 37. * **Trend/Size:** A major category, tied with "Sonnet writing" for the second-largest share. 4. **Date understanding (Yellow Slice):** * **Placement:** A smaller slice located on the right side, between the "Table of Penguins" and "MGSM" slices. * **Value:** 14. * **Trend/Size:** A mid-sized category. 5. **Table of Penguins (Light Green Slice):** * **Placement:** A small slice located on the right side, between the "Sonnet writing" and "Date understanding" slices. * **Value:** 8. * **Trend/Size:** A minor category. 6. **Checkmate-in-One (Dark Green Slice):** * **Placement:** The smallest slice, located at the very top of the chart, between the "Python Puzzles" and "Sonnet writing" slices. * **Value:** 4. * **Trend/Size:** The smallest category in the distribution. **Total Sum of Values:** 78 + 37 + 37 + 14 + 8 + 4 = 178. ### Key Observations * **Dominant Category:** The "MGSM" task accounts for the largest share (78 out of 178, approximately 43.8%). * **Tied Categories:** "Sonnet writing" and "Python Puzzles" have identical values (37 each), together making up a significant portion (approximately 41.6% combined). * **Minor Categories:** "Table of Penguins" (8) and "Checkmate-in-One" (4) represent very small fractions of the total. * **Visual Hierarchy:** The chart clearly emphasizes the disparity between the high-volume tasks (MGSM, Sonnet writing, Python Puzzles) and the low-volume tasks. ### Interpretation This pie chart illustrates the allocation or occurrence of templates across a set of diverse tasks, likely from a benchmark or evaluation suite. The data suggests a highly uneven distribution: * **Resource Focus:** The "MGSM" task (likely a math or reasoning benchmark) requires or utilizes the greatest number of templates, indicating it may be a core or heavily templated component of the evaluation. * **Balanced Secondary Tasks:** "Sonnet writing" (creative/generative) and "Python Puzzles" (coding) are equally prominent, suggesting a balanced emphasis on these two distinct skill areas within the dataset. * **Niche Tasks:** "Date understanding," "Table of Penguins," and "Checkmate-in-One" appear to be more specialized or less template-dependent tasks, contributing minimally to the overall template count. * **Implication:** The distribution implies that the underlying system or benchmark is heavily weighted towards mathematical reasoning (MGSM) and has substantial, equal components for creative writing and programming puzzles. The minor tasks add variety but are not the primary focus in terms of template volume. This could reflect the design priorities of the benchmark, the inherent nature of the tasks (some requiring more structured prompts), or the availability of data for each category. </details> <details> <summary>x6.png Details</summary> ![ef7e234d](/v1/image/ef7e234d3e074e0d2e70a42af9b19791aa43de2e8dd529c43fe19ebc75adf490) ### Visual Description ## [Pie Chart]: Average time distribution for each part of our BoT ### Overview This image is a pie chart titled "Average time distribution for each part of our BoT". It displays the proportional distribution of average time spent across four distinct components of a system referred to as "BoT". The chart uses color-coded segments with direct labels and a supporting legend. ### Components/Axes * **Title:** "Average time distribution for each part of our BoT" (Top center). * **Chart Type:** Pie chart. * **Segments & Labels:** Four segments, each labeled with a component name and a numerical value. Labels are connected to their respective segments by dashed lines. * **Blue Segment:** Labeled "problem-distiller" with a value of "15.6". * **Green Segment:** Labeled "reasoner" with a value of "52.7". * **Yellow Segment:** Labeled "meta-buffer" with a value of "8.9". * **Red Segment:** Labeled "buffer-manager" with a value of "21.3". * **Legend:** Located at the bottom-right corner. It maps colors to component names: * Blue square: "problem-distiller" * Green square: "reasoner" * Yellow square: "meta-buffer" * Red square: "buffer-manager" ### Detailed Analysis The pie chart segments represent the following data points, which sum to 98.5 (likely due to rounding): 1. **reasoner (Green):** 52.7 2. **buffer-manager (Red):** 21.3 3. **problem-distiller (Blue):** 15.6 4. **meta-buffer (Yellow):** 8.9 **Spatial Grounding & Trend Verification:** * The **green "reasoner"** segment is the largest, occupying slightly more than half of the pie chart (from approximately the 3 o'clock to 9 o'clock position, moving clockwise). * The **red "buffer-manager"** segment is the second largest, located in the upper-left quadrant (from ~9 o'clock to 12 o'clock). * The **blue "problem-distiller"** segment is the third largest, located in the upper-right quadrant (from ~12 o'clock to 3 o'clock). * The **yellow "meta-buffer"** segment is the smallest, a thin slice between the red and green segments on the left side. ### Key Observations * The "reasoner" component consumes the majority of the average time (52.7), which is more than the combined time of all other components. * The "meta-buffer" is the least time-consuming part at 8.9. * The distribution is highly skewed, with one dominant component ("reasoner") and three supporting components. ### Interpretation This chart visualizes the computational or processing bottleneck within the "BoT" system. The data strongly suggests that the **"reasoner" is the core and most resource-intensive component**, accounting for over half of the total average time. This indicates that any performance optimization efforts should prioritize this module for the greatest impact. The other components—"buffer-manager", "problem-distiller", and "meta-buffer"—play supporting roles, with the "buffer-manager" being the most significant among them. The "meta-buffer" appears to be a lightweight, auxiliary function. The chart effectively communicates where system time is allocated, guiding developers or analysts toward the most critical area for investigation or improvement. The term "BoT" is not defined in the image but likely stands for a specific system architecture (e.g., "Buffer of Thoughts," "Bot," or a proprietary acronym). </details> Figure 5: Distribution Analysis of Thought-Templates and Time. Left: Distribution Analysis of Thought-Templates. Right: Distribution Analysis of Thought-Templates. Distribution Analysis of Time Cost As illustrated in Figure 5, we measured the average time cost for each component of BoT’s reasoning framework across different tasks. The time required for distilling task information and template retrieval is relatively short, whereas instantiated reasoning takes longer. Overall, considering the complexity of different components, our BoT achieves a relatively balanced distribution of time cost, demonstrating the efficiency of our BoT framework. Better Trade-off between Model Size and Performance As depicted in Figure 6, on Game of 24, word list sorting and Checkmate-in-One, Llama3-8B and Llama-70B models [6] may result in poor outcomes. However, equipped with our BoT, both models demonstrate a substantial accuracy improvement. Notably, BoT+Llama3-8B has the potential to surpass single Llama3-70B model. Our BoT enables smaller models to exhibit the capabilities that approximate or even surpass larger models, significantly bridging the gap between their reasoning abilities. Furthermore, it greatly diminishes the inference cost required by large language models when tackling complex problems. <details> <summary>x7.png Details</summary> ![82396357](/v1/image/8239635759fcd69b84f036e5d70fda8b7930b5ec9f310fbc72aa8b385fff85b1) ### Visual Description ## Horizontal Bar Chart: Trade-off between model size and performance ### Overview This image is a horizontal grouped bar chart titled "Trade-off between model size and performance." It compares the accuracy (in percentage) of four different language model configurations across three distinct reasoning or puzzle-solving tasks. The chart visually demonstrates how model size (8B vs. 70B parameters) and the addition of a component labeled "BoT" affect performance. ### Components/Axes * **Title:** "Trade-off between model size and performance" (centered at the top). * **Legend:** Positioned at the top, below the title. It defines four model configurations by color: * **Yellow Square:** BoT+Llama-3-70B * **Gray Square:** BoT+Llama-3-8B * **Orange Square:** Llama3-70B * **Blue Square:** Llama3-8B * **Y-Axis (Vertical):** Lists three task categories. From top to bottom: 1. Checkmate-in-One 2. Word list sorting 3. Game of 24 * **X-Axis (Horizontal):** Labeled "Accuracy (%)". The scale runs from 0 to 100 with major tick marks at every 10-unit interval (0, 10, 20, ..., 100). * **Data Bars:** For each task, four horizontal bars are grouped together, corresponding to the four models in the legend order (Yellow, Gray, Orange, Blue from top to bottom within each group). The exact accuracy value is printed at the end of each bar. ### Detailed Analysis **Task 1: Checkmate-in-One** * **Trend:** Performance varies dramatically. The BoT-enhanced models significantly outperform the base models. The larger model (70B) with BoT performs best. * **Data Points (Accuracy %):** * BoT+Llama-3-70B (Yellow): **75.6** * BoT+Llama-3-8B (Gray): **56.7** * Llama3-70B (Orange): **15** * Llama3-8B (Blue): **0.8** **Task 2: Word list sorting** * **Trend:** All models show moderate to high accuracy. The BoT+Llama-3-70B model leads. Notably, the base Llama3-70B model outperforms the smaller BoT+Llama-3-8B model. * **Data Points (Accuracy %):** * BoT+Llama-3-70B (Yellow): **92.3** * BoT+Llama-3-8B (Gray): **73.4** * Llama3-70B (Orange): **79** * Llama3-8B (Blue): **48.4** **Task 3: Game of 24** * **Trend:** A stark divide exists. BoT-enhanced models achieve high accuracy, while base models perform very poorly, near zero. * **Data Points (Accuracy %):** * BoT+Llama-3-70B (Yellow): **78.4** * BoT+Llama-3-8B (Gray): **73.4** * Llama3-70B (Orange): **2.4** * Llama3-8B (Blue): **1.2** ### Key Observations 1. **Dominance of BoT Enhancement:** The addition of "BoT" provides a massive performance boost across all tasks, especially for the more complex "Checkmate-in-One" and "Game of 24" tasks, where it elevates accuracy from near-zero to over 50%. 2. **Model Size Impact:** Within the same configuration (with or without BoT), the 70B parameter model consistently outperforms its 8B counterpart. The gap is most pronounced in the base models (e.g., 15% vs. 0.8% in Checkmate-in-One). 3. **Task Difficulty Spectrum:** The tasks appear to have varying inherent difficulty for the base models. "Word list sorting" is the most accessible (48.4% for the smallest model), while "Game of 24" is the most challenging (1.2% for the smallest model). 4. **Notable Anomaly:** In "Word list sorting," the base Llama3-70B (79%) outperforms the smaller BoT+Llama-3-8B (73.4%). This is the only instance where a base model beats a BoT-enhanced model, suggesting the BoT enhancement may be less critical for this specific type of task compared to raw model scale. ### Interpretation The data strongly suggests that the "BoT" component is a critical factor for enabling these language models to solve complex, multi-step reasoning puzzles ("Checkmate-in-One," "Game of 24"). Without it, even the large 70B model fails almost completely on these tasks. For more structured, possibly linguistic tasks like "Word list sorting," both model scale and the BoT enhancement contribute positively, but scale alone can yield respectable performance. The chart illustrates a clear trade-off: achieving high performance on advanced reasoning tasks requires not just a large model (70B parameters) but also a specialized enhancement (BoT). The smaller 8B model, when equipped with BoT, can achieve performance comparable to or even exceeding the much larger 70B base model on certain tasks, highlighting the efficiency gain from the BoT method. This implies that architectural or methodological innovations (like BoT) can be as important as, or more important than, simply increasing parameter count for specific capabilities. </details> Figure 6: We evaluate the trade-off between model size and performance with Llama3-8B and Llama3-70B models on three challenging benchmarks. ## 6 Ablation Study Impact of Problem-Distiller As illustrated in Figure 7, when the problem-distiller is disabled, both Llama3-70B and GPT-4 experience a certain degree of accuracy decline. More complex problems, such as Game of 24 and Checkmate-in-One, show a more significant accuracy reduction, whereas relatively simpler problems like word list sorting and MGSM exhibit smaller decreases. This is because LLMs can more easily extract key information in simpler tasks, making the impact of the problem-distiller less noticeable. In contrast, extracting key information and potential constraints in complex problems is more challenging, making the role of our problem-distiller more prominent, thereby explaining the differences depicted in the figure. <details> <summary>x8.png Details</summary> ![dee00979](/v1/image/dee00979d447762dad5953d16321143b62655535ebbd253700c9030b6feb4a59) ### Visual Description ## Bar Chart: Ablation study of problem-distiller ### Overview This is a grouped bar chart titled "Ablation study of problem-distiller." It compares the accuracy (in percentage) of four different model configurations across four distinct reasoning tasks. The chart evaluates the impact of a "problem-distiller" component by showing performance both with and without it for two base models (Llama-3-70B and GPT-4). ### Components/Axes * **Chart Title:** "Ablation study of problem-distiller" (centered at the top). * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 0 to 100 in increments of 10. * **X-Axis:** Lists four task categories: "Game of 24", "Word list sorting", "Checkmate-in-One", and "MGSM". * **Legend:** Positioned at the top of the chart, below the title. It defines four data series by color: * **Blue:** BoT+Llama-3-70B (w/o problem-distiller) * **Orange:** BoT+Llama-3-70B * **Grey:** BoT+GPT-4 (w/o problem-distiller) * **Yellow:** BoT+GPT-4 * *Note: "w/o" is an abbreviation for "without".* ### Detailed Analysis The chart presents accuracy data for each of the four tasks. The values are extracted directly from the labels above each bar. | Task | BoT+Llama-3-70B (w/o problem-distiller) [Blue] | BoT+Llama-3-70B [Orange] | BoT+GPT-4 (w/o problem-distiller) [Grey] | BoT+GPT-4 [Yellow] | | :--- | :--- | :--- | :--- | :--- | | **Game of 24** | 71.2% | 78.4% | 76.5% | 82.4% | | **Word list sorting** | 89.5% | 92.3% | 97.3% | 99.6% | | **Checkmate-in-One** | 64.3% | 75.6% | 78.9% | 86.4% | | **MGSM** | 85.6% | 86.8% | 87.4% | 89.2% | ### Key Observations * **Consistent Improvement:** For every task and both base models (Llama-3-70B and GPT-4), the configuration *with* the problem-distiller (Orange and Yellow bars) achieves higher accuracy than the configuration *without* it (Blue and Grey bars). * **Model Performance Hierarchy:** The BoT+GPT-4 configuration (Yellow) consistently achieves the highest accuracy across all four tasks. The BoT+Llama-3-70B (w/o problem-distiller) (Blue) consistently has the lowest accuracy. * **Task Variability:** The magnitude of improvement from adding the problem-distiller varies by task. The largest absolute gains are seen in "Checkmate-in-One" (+11.3% for Llama-3, +7.5% for GPT-4) and "Game of 24" (+7.2% for Llama-3, +5.9% for GPT-4). The smallest gains are in "MGSM" (+1.2% for Llama-3, +1.8% for GPT-4). * **Near-Perfect Performance:** The BoT+GPT-4 model achieves 99.6% accuracy on the "Word list sorting" task, which is the highest value on the chart. ### Interpretation This ablation study provides strong empirical evidence for the efficacy of the "problem-distiller" component. The data suggests that integrating this component systematically improves the reasoning accuracy of large language models (LLaMA-3-70B and GPT-4) when they are used within the "BoT" (likely "Brain of Thought" or similar) framework. The consistent positive delta across diverse tasks—from mathematical puzzles (Game of 24, MGSM) to logical sequencing (Word list sorting) and strategic planning (Checkmate-in-One)—indicates that the problem-distiller is a generally beneficial module, not one specialized for a single type of problem. The fact that it boosts both a smaller open-source model (Llama-3-70B) and a larger proprietary model (GPT-4) suggests it addresses a fundamental challenge in problem representation or decomposition that is common across model architectures. The varying degree of improvement implies that some tasks (like Checkmate-in-One) benefit more from the problem-distiller's intervention than others (like MGSM). This could be because MGSM tasks are already well-structured for the base models, or because the problem-distiller's specific methodology is particularly adept at clarifying the kind of multi-step, state-based reasoning required for chess puzzles. The near-ceiling performance on "Word list sorting" (99.6%) suggests this task may be approaching saturation for the GPT-4-based system with this augmentation. </details> Figure 7: We conduct ablation study on problem-distiller across four benchmarks, employing Llama3-70B and GPT-4 as the base models. Impact of Meta-Buffer As illustrated in Figure 8, when the meta-buffer is disabled, both Llama3-70B and GPT-4 models exhibit a noticeable decline in performance, particularly in benchmarks requiring complex reasoning, such as Game of 24 and Checkmate-in-One. This further underscores the superiority of our meta-buffer in addressing complex problems. <details> <summary>x9.png Details</summary> ![a37e91c6](/v1/image/a37e91c6485c369b3b27f06427fdbebe6c545e751735656758f5486762cf78ad) ### Visual Description ## Bar Chart: Ablation study of meta-buffer ### Overview This is a grouped bar chart titled "Ablation study of meta-buffer." It compares the performance (accuracy in %) of four different model configurations across four distinct tasks. The chart is designed to show the impact of including or excluding a "meta-buffer" component when using two base models (Llama-3-70B and GPT-4) within a framework called "BoT". ### Components/Axes * **Title:** "Ablation study of meta-buffer" (centered at the top). * **Legend:** Positioned at the top center, below the title. It defines four data series: * **Blue Square:** BoT + Llama-3-70B (w/o meta-buffer) * **Orange Square:** BoT+Llama-3-70B * **Gray Square:** BoT+GPT-4 (w/o meta-buffer) * **Yellow Square:** BoT+GPT-4 * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 0 to 100 with major tick marks every 10 units (0, 10, 20, ..., 100). * **X-Axis:** Represents four different tasks. The category labels are: 1. Game of 24 2. Word list sorting 3. Checkmate-in-One 4. MGSM ### Detailed Analysis The chart presents accuracy percentages for each model configuration on each task. The data is as follows: | Task | BoT + Llama-3-70B (w/o meta-buffer) [Blue] | BoT+Llama-3-70B [Orange] | BoT+GPT-4 (w/o meta-buffer) [Gray] | BoT+GPT-4 [Yellow] | | :--- | :---: | :---: | :---: | :---: | | **Game of 24** | 65.6 | 78.4 | 75.2 | 82.4 | | **Word list sorting** | 81.7 | 92.3 | 95.4 | 99.6 | | **Checkmate-in-One** | 27.4 | 75.6 | 56.7 | 86.4 | | **MGSM** | 79.6 | 86.8 | 85.4 | 89.2 | **Trend Verification per Data Series:** * **Blue Bars (BoT + Llama-3-70B w/o meta-buffer):** Performance varies significantly by task. It is lowest on "Checkmate-in-One" (27.4%) and highest on "Word list sorting" (81.7%). * **Orange Bars (BoT+Llama-3-70B):** Consistently shows higher accuracy than its blue counterpart (without meta-buffer) across all tasks. The improvement is most dramatic for "Checkmate-in-One". * **Gray Bars (BoT+GPT-4 w/o meta-buffer):** Generally performs well, but shows a notable dip on "Checkmate-in-One" (56.7%) compared to other tasks. * **Yellow Bars (BoT+GPT-4):** Consistently achieves the highest accuracy among all four configurations for every single task. The trend is a clear, step-wise improvement over the gray bars (its counterpart without meta-buffer). ### Key Observations 1. **Universal Benefit of Meta-Buffer:** For both base models (Llama-3-70B and GPT-4), the configuration *with* the meta-buffer (orange and yellow) always outperforms the configuration *without* it (blue and gray) on the same task. 2. **Task-Dependent Impact:** The performance gain from adding the meta-buffer is not uniform. It is most pronounced on the "Checkmate-in-One" task, where the Llama-3-70B configuration sees a 48.2 percentage point increase (27.4% to 75.6%), and the GPT-4 configuration sees a 29.7 point increase (56.7% to 86.4%). 3. **Model Comparison:** The GPT-4 based configurations (gray and yellow) generally outperform the Llama-3-70B based configurations (blue and orange) on the same task, with or without the meta-buffer. The exception is "Word list sorting," where BoT+GPT-4 (w/o meta-buffer) at 95.4% is very close to BoT+Llama-3-70B at 92.3%. 4. **Highest and Lowest Scores:** The highest accuracy recorded is 99.6% (BoT+GPT-4 on Word list sorting). The lowest is 27.4% (BoT + Llama-3-70B w/o meta-buffer on Checkmate-in-One). ### Interpretation This ablation study provides strong evidence for the efficacy of the "meta-buffer" component within the BoT framework. The data suggests that the meta-buffer acts as a critical performance enhancer, particularly for tasks that likely require complex reasoning or multi-step planning, such as "Checkmate-in-One" (a chess puzzle) and "Game of 24" (a mathematical puzzle). The consistent superiority of the yellow bars (BoT+GPT-4) indicates that the combination of a more powerful base model (GPT-4) with the meta-buffer yields the best results. However, the substantial relative improvements seen in the orange bars (BoT+Llama-3-70B) demonstrate that the meta-buffer can significantly elevate the capabilities of a smaller model, making it a valuable architectural addition regardless of the base model's scale. The near-ceiling performance on "Word list sorting" (99.6%) suggests this task may be less challenging for these models or that the meta-buffer is exceptionally well-suited for it. The chart effectively argues that the meta-buffer is not an optional add-on but a core component for achieving robust performance across diverse reasoning tasks. </details> Figure 8: We conduct ablation study on meta-buffer across four benchmarks, employing Llama3-70B and GPT-4 as the base models. <details> <summary>x10.png Details</summary> ![bb5850c2](/v1/image/bb5850c2a852d0e048354e2e642e72f20e1104f17e8b012744e9f167769a48ae) ### Visual Description ## Line Chart: Ablation study of buffer-manager -- Accuracy ### Overview This image is a line chart comparing the accuracy performance of two system configurations over four sequential rounds. The chart is titled "Ablation study of buffer-manager -- Accuracy" and demonstrates the impact of including or excluding a "buffer-manager" component on the overall accuracy of a system referred to as "BoT+GPT4". ### Components/Axes * **Chart Title:** "Ablation study of buffer-manager -- Accuracy" (Top Center) * **Y-Axis:** * **Label:** "Accuracy (%)" (Left side, vertical) * **Scale:** Linear scale from 0 to 100, with major gridlines and labels at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100). * **X-Axis:** * **Labels:** "Round 1", "Round 2", "Round 3", "Round 4" (Bottom, horizontal). These represent discrete, sequential evaluation points. * **Legend:** Located at the bottom center of the chart. * **Blue line with circular markers:** "BoT+GPT4" * **Orange line with circular markers:** "BoT+GPT4 (w/o buffer-manager)" ### Detailed Analysis The chart plots two data series, each with four data points corresponding to the four rounds. **Data Series 1: BoT+GPT4 (Blue Line)** * **Trend:** Shows a strong, positive, upward trend. The accuracy increases sharply from Round 1 to Round 2, continues to increase at a slower rate to Round 3, and then plateaus with a very slight increase to Round 4. * **Data Points (Values are labeled in red above each marker):** * Round 1: 56.8% * Round 2: 78.5% * Round 3: 87.4% * Round 4: 88.5% **Data Series 2: BoT+GPT4 (w/o buffer-manager) (Orange Line)** * **Trend:** Shows a relatively flat, stagnant trend with minor fluctuations. Accuracy increases slightly from Round 1 to Round 3, then decreases at Round 4. * **Data Points (Values are labeled in gray below each marker):** * Round 1: 52.8% * Round 2: 53.6% * Round 3: 57.4% * Round 4: 54.1% ### Key Observations 1. **Significant Performance Gap:** There is a large and growing accuracy gap between the two configurations. The system with the buffer-manager (blue) consistently outperforms the system without it (orange). 2. **Diverging Trajectories:** The two lines diverge significantly after Round 1. The blue line ascends rapidly, while the orange line remains nearly horizontal. 3. **Peak Performance:** The highest accuracy achieved is 88.5% by the "BoT+GPT4" configuration at Round 4. 4. **Performance Drop:** The configuration without the buffer-manager experiences a performance drop of 3.3 percentage points between Round 3 (57.4%) and Round 4 (54.1%), while the other configuration continues to improve. ### Interpretation This ablation study provides strong evidence for the critical role of the "buffer-manager" component in the BoT+GPT4 system. The data suggests that the buffer-manager is not merely an incremental improvement but a fundamental component for achieving high accuracy and, crucially, for enabling **continuous learning or improvement over successive rounds**. * **Without the buffer-manager (orange line):** The system's performance is capped at a low-to-mid 50% range and shows no capacity for meaningful improvement across rounds. The dip in Round 4 may indicate instability or an inability to leverage subsequent data effectively. * **With the buffer-manager (blue line):** The system demonstrates a clear capacity for learning, with accuracy improving by over 30 percentage points from Round 1 to Round 4. The most substantial gain occurs between the first two rounds, suggesting the buffer-manager is essential for initial knowledge integration or context management. In essence, the chart illustrates that the buffer-manager is the key differentiator that transforms the system from one with static, mediocre performance into one capable of progressive and significant accuracy gains. The "ablation" (removal) of this component severely cripples the system's functionality. </details> Figure 9: We conduct ablation study on buffer-manager regarding reasoning accuracy across four tasks, employing Llama3-70B and GPT-4 as the base models. Impact of Buffer-Manager In this ablation study, we divide the entire process into four rounds. In each round, we randomly sample 50 questions from each benchmark and conduct reasoning. In the subsequent round, we continue to randomly sample another 50 questions from each benchmark. As depicted in Figure 9, with the increase of the number of rounds, the model with the buffer-manager continually expands the meta-buffer while also utilizing the thought-templates obtained from previously solved problems to help addressing subsequent similar problems. Therefore, we can observe that the accuracy of BoT steadily improves with each round. In contrast, the model without the buffer-manager fails to exhibit an upward trend. Additionally, we have also measured the reasoning time as depicted in Figure 10. when the number of rounds increases, the model with the buffer-manager will experience a continual improvement in reasoning efficiency. This is because, with the continual expansion of the meta-buffer, the likelihood of retrieving suitable thought-templates also increases. Consequently, models can avoid constructing reasoning structures from scratch, thereby enhancing the inference efficiency accordingly. <details> <summary>x11.png Details</summary> ![b423ee11](/v1/image/b423ee11c79ada9c77f323a0a7a66ab7036a55e9b54da1cf32168398797875db) ### Visual Description ## Line Chart: Ablation Study of Buffer-Manager -- Time ### Overview This image is a line chart presenting the results of an ablation study. It compares the average inference time per problem across four sequential rounds for two different system configurations: one with a "buffer-manager" component and one without. The chart clearly demonstrates the impact of the buffer-manager on performance over time. ### Components/Axes * **Chart Title:** "Ablation study of buffer-manager -- Time" * **Y-Axis:** * **Label:** "Average inference time per problem (s)" * **Scale:** Linear scale from 0 to 350 seconds, with major gridlines at intervals of 50 seconds (0, 50, 100, 150, 200, 250, 300, 350). * **X-Axis:** * **Label:** Implicitly represents sequential rounds or iterations. * **Categories:** "Round 1", "Round 2", "Round 3", "Round 4". * **Legend:** Located at the bottom center of the chart. * **Blue Line with Square Marker:** "BoT+GPT4" (This is the configuration *with* the buffer-manager). * **Orange Line with Diamond Marker:** "BoT+GPT4 (w/o buffer-manager)" (This is the configuration *without* the buffer-manager). ### Detailed Analysis The chart plots two data series, with exact values labeled at each data point. **Data Series 1: BoT+GPT4 (Blue Line)** * **Trend:** Shows a strong, consistent downward slope from Round 1 to Round 4. * **Data Points:** * Round 1: 297 seconds * Round 2: 205 seconds * Round 3: 128 seconds * Round 4: 78.5 seconds **Data Series 2: BoT+GPT4 (w/o buffer-manager) (Orange Line)** * **Trend:** Remains relatively flat and stable across all rounds, with a slight peak in Round 2. * **Data Points:** * Round 1: 308 seconds * Round 2: 317 seconds * Round 3: 304 seconds * Round 4: 306 seconds ### Key Observations 1. **Performance Divergence:** The two configurations start with similar performance in Round 1 (297s vs. 308s). However, their paths diverge dramatically immediately after. 2. **Improvement Trend:** The "BoT+GPT4" system (with buffer-manager) exhibits a significant and continuous improvement in speed, reducing its average inference time by approximately 73.6% from Round 1 to Round 4 (from 297s to 78.5s). 3. **Stagnation Trend:** The system without the buffer-manager shows no meaningful improvement. Its performance fluctuates slightly around a high baseline of approximately 309 seconds (average of the four points). 4. **Crossover Point:** The performance lines cross between Round 1 and Round 2. By Round 2, the system with the buffer-manager is already substantially faster (205s vs. 317s). ### Interpretation This ablation study provides strong evidence for the efficacy of the "buffer-manager" component within the "BoT+GPT4" system architecture. * **What the data suggests:** The buffer-manager is not merely an incremental improvement; it is a critical component for achieving performance gains over successive operational rounds. The system without it fails to learn or optimize its process, resulting in stagnant, high inference times. * **How elements relate:** The x-axis (Rounds) likely represents iterative problem-solving or learning cycles. The y-axis measures efficiency. The stark contrast between the two lines isolates the buffer-manager as the causal factor for the observed efficiency gains. The blue line's steep negative slope indicates effective optimization or caching, while the orange line's flatness indicates a lack thereof. * **Notable Anomalies:** The slight peak for the orange line at Round 2 (317s) is a minor anomaly but does not change the overall narrative of stagnation. The most striking "anomaly" is the sheer magnitude of the performance gap that opens up by Round 4 (78.5s vs. 306s), highlighting the component's importance. * **Underlying Implication:** The study implies that the buffer-manager enables the system to retain and leverage information or state from previous rounds, leading to faster solutions in subsequent rounds. Without it, each round is treated as a largely independent, and therefore slower, computation. This has significant implications for the scalability and practical deployment of the system. </details> Figure 10: We conduct ablation study on buffer-manager regarding reasoning efficiency across four tasks, employing Llama3-70B and GPT-4 as the base models. ## 7 Discussion Limitations and Future Directions Despite our method’s significant improvement in accuracy while maintaining reasoning efficiency and robustness, our method’s enhancements are limited when addressing problems requiring human-like creativity, as this issue often does not rely on a specific thought-template. Besides, if our BoT initializes the meta-buffer with a weaker model, the quality of the derived thought-templates may be suboptimal due to the weaker model’s limited reasoning ability and instruction-following capability. Overall, our BoT brings out a set of future directions: 1. integrating external resources with BoT to build a open-domain system like agent models [54, 55]. 2. making the distillation of thought-templates optimizable, which may significantly enhance their template qualities for more complex tasks. Conclusion In this work, we introduce Buffer of Thoughts, a novel beffered reasoning framework that employs LLMs to utilize pre-accumulated experiences and methodologies from prior tasks as thought-templates stored within a meta-buffer. We further design buffer-manager to continuously refine the problem-solving processes and dynamically distill thought-templates, thereby progressively raising the LLM’s reasoning capacity. Our BoT demonstrates SOTA performance on 10 challenging tasks, and offers promising prospects for future research and application. ## References - [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. - [2] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023. - [3] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. - [4] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022. - [5] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024. - [6] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. - [7] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. - [8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022. - [9] B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao, “Expertprompting: Instructing large language models to be distinguished experts,” arXiv preprint arXiv:2305.14688, 2023. - [10] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” in International Conference on Machine Learning, pp. 10764–10799, PMLR, 2023. - [11] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in The Eleventh International Conference on Learning Representations, 2022. - [12] M. Yasunaga, X. Chen, Y. Li, P. Pasupat, J. Leskovec, P. Liang, E. H. Chi, and D. Zhou, “Large language models as analogical reasoners,” International Conference on Learning Representations, 2024. - [13] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” in The Eleventh International Conference on Learning Representations, 2022. - [14] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024. - [15] M. Suzgun and A. T. Kalai, “Meta-prompting: Enhancing language models with task-agnostic scaffolding,” arXiv preprint arXiv:2401.12954, 2024. - [16] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V. Le, et al., “Least-to-most prompting enables complex reasoning in large language models,” in The Eleventh International Conference on Learning Representations, 2022. - [17] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 17682–17690, 2024. - [18] A. Asai, S. Min, Z. Zhong, and D. Chen, “Retrieval-based language models and applications,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pp. 41–46, 2023. - [19] G. Mialon, R. Dessi, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Roziere, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, et al., “Augmented language models: a survey,” Transactions on Machine Learning Research, 2023. - [20] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih, “Replug: Retrieval-augmented black-box language models,” arXiv preprint arXiv:2301.12652, 2023. - [21] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023. - [22] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang, and B. Cui, “Retrieval-augmented generation for ai-generated content: A survey,” arXiv preprint arXiv:2402.19473, 2024. - [23] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning, pp. 2206–2240, PMLR, 2022. - [24] M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang, M. Lewis, L. Zettlemoyer, and W.-T. Yih, “Retrieval-augmented multimodal language modeling,” in International Conference on Machine Learning, pp. 39755–39769, PMLR, 2023. - [25] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,” Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023. - [26] Z. Wang, W. Nie, Z. Qiao, C. Xiao, R. Baraniuk, and A. Anandkumar, “Retrieval-based controllable molecule generation,” in The Eleventh International Conference on Learning Representations, 2022. - [27] L. Yang, Z. Huang, X. Zhou, M. Xu, W. Zhang, Y. Wang, X. Zheng, W. Yang, R. O. Dror, S. Hong, et al., “Prompt-based 3d molecular diffusion models for structure-based drug design,” 2023. - [28] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022. - [29] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the compositionality gap in language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711, 2023. - [30] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, and C. Re, “Ask me anything: A simple strategy for prompting language models,” in The Eleventh International Conference on Learning Representations, 2022. - [31] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,” in The Eleventh International Conference on Learning Representations, 2022. - [32] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022. - [33] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, et al., “Language models are multilingual chain-of-thought reasoners,” in The Eleventh International Conference on Learning Representations, 2022. - [34] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot, “Complexity-based prompting for multi-step reasoning,” in The Eleventh International Conference on Learning Representations, 2022. - [35] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, et al., “Challenging big-bench tasks and whether chain-of-thought can solve them,” in Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051, 2023. - [36] H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V. Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” arXiv preprint arXiv:2310.06117, 2023. - [37] P. Zhou, J. Pujara, X. Ren, X. Chen, H.-T. Cheng, Q. V. Le, E. H. Chi, D. Zhou, S. Mishra, and H. S. Zheng, “Self-discover: Large language models self-compose reasoning structures,” arXiv preprint arXiv:2402.03620, 2024. - [38] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” Transactions on Machine Learning Research, 2023. - [39] X. Ning, Z. Lin, Z. Zhou, Z. Wang, H. Yang, and Y. Wang, “Skeleton-of-thought: Large language models can do parallel decoding,” in The Twelfth International Conference on Learning Representations, 2023. - [40] Y. Zhang, “Meta prompting for agi systems,” arXiv preprint arXiv:2311.11482, 2023. - [41] J. Chen, R. Xu, Z. Fu, W. Shi, Z. Li, X. Zhang, C. Sun, L. Li, Y. Xiao, and H. Zhou, “E-kar: A benchmark for rationalizing natural language analogical reasoning,” in Findings of the Association for Computational Linguistics: ACL 2022, pp. 3941–3955, 2022. - [42] O. Sultan and D. Shahaf, “Life is a circus and we are the clowns: Automatically finding analogies between situations and processes,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3547–3562, 2022. - [43] N. Zhang, L. Li, X. Chen, X. Liang, S. Deng, and H. Chen, “Multimodal analogical reasoning over knowledge graphs,” in The Eleventh International Conference on Learning Representations, 2022. - [44] B. Bhavya, J. Xiong, and C. Zhai, “Analogy generation by prompting large language models: A case study of instructgpt,” in Proceedings of the 15th International Conference on Natural Language Generation, pp. 298–312, 2022. - [45] B. Bhavya, J. Xiong, and C. Zhai, “Cam: A large language model-based creative analogy mining framework,” in Proceedings of the ACM Web Conference 2023, pp. 3903–3914, 2023. - [46] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” in The Eleventh International Conference on Learning Representations, 2022. - [47] T. Webb, K. J. Holyoak, and H. Lu, “Emergent analogical reasoning in large language models,” Nature Human Behaviour, vol. 7, no. 9, pp. 1526–1541, 2023. - [48] J. Yu, R. He, and Z. Ying, “Thought propagation: An analogical approach to complex reasoning with large language models,” in International Conference on Learning Representations, 2024. - [49] T. Feng, P. Han, G. Lin, G. Liu, and J. You, “Thought-retriever: Don’t just retrieve raw data, retrieve thoughts,” in ICLR 2024 Workshop: How Far Are We From AGI. - [50] B. bench authors, “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” Transactions on Machine Learning Research, 2023. - [51] T. Schuster, A. Kalyan, A. Polozov, and A. T. Kalai, “Programming puzzles,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. - [52] A. T. K. Patrick Haluptzok, Matthew Bowers, “Language models can teach themselves to program better,” in Eleventh International Conference on Learning Representations (ICLR), 2023. - [53] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. - [54] G. Chen, S. Dong, Y. Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y. Shi, “Autoagents: A framework for automatic agent generation,” arXiv preprint arXiv:2309.17288, 2023. - [55] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation framework,” arXiv preprint arXiv:2308.08155, 2023. ## Appendix A Additional Method Details ### A.1 Detailed Thought-Templates Here we show six example thought-templates in six different categories: #### A.1.1 Text Comprehension Task Description: The task involves analyzing a table with various attributes of penguins, such as name, age, height, and weight, and answering questions about these attributes. The table may be updated with new entries, and additional context or comparisons may be provided in natural language. Solution Description: To accurately answer questions about the penguins’ attributes, one must be able to interpret the data presented in tabular form, understand any additional information provided in natural language, and apply logical reasoning to identify the correct attribute based on the question asked. Thought Template: Step 1: Parse the initial table, extracting the header information and each penguin’s attributes into a structured format (e.g., a list of dictionaries). Step 2: Read and integrate any additional natural language information that updates or adds to the table, ensuring the data remains consistent. Step 3: Identify the attribute in question (e.g., oldest penguin, heaviest penguin) and the corresponding column in the table. Step 4: Apply logical reasoning to compare the relevant attribute across all entries to find the correct answer (e.g., the highest age for the oldest penguin). Step 5: Select the answer from the provided options that matches the result of the logical comparison. #### A.1.2 Creative Language Generation Task Description: The task is to generate a sonnet that adheres to the traditional English sonnet rhyme scheme of "ABAB CDCD EFEF GG" and includes three specific words verbatim in the text. Solution Description: Writing a sonnet involves crafting 14 lines of poetry that follow a specific rhyme pattern. The lines are typically in iambic pentameter, though flexibility in rhythm can be allowed for creative reasons. The given rhyme scheme dictates the end sounds of each line, ensuring a structured poetic form. Incorporating the three provided words verbatim requires strategic placement within the lines to maintain the poem’s coherence and thematic unity. Thought Template: Step 1: Identify the three words that must be included in the sonnet. Step 2: Understand the rhyme scheme "ABAB CDCD EFEF GG" and prepare a list of rhyming words that could be used. Step 3: Develop a theme or story for the sonnet that can naturally incorporate the three provided words. Step 4: Begin drafting the sonnet by writing the first quatrain (four lines) following the "ABAB" rhyme scheme, ensuring one or more of the provided words are included. Step 5: Continue with the second quatrain "CDCD," the third quatrain "EFEF," and finally the closing couplet "GG," each time incorporating the provided words as needed. Step 6: Review the sonnet for coherence, flow, and adherence to the rhyme scheme, making adjustments as necessary. #### A.1.3 Common Sense Reasoning Task Description: Given a specific date and an event, such as a holiday or historical event, determine the following date. Solution Description: To determine the next date, we need to consider the structure of the calendar, the number of days in each month, and whether it’s a leap year. Typically, the number of days in a month is fixed, except February may vary due to leap years. The next day in a year is usually the date increased by one day unless it’s the end of the month, then the next day will be the first day of the following month. For the end of the year, the next day will be January 1st of the following year. Thought Template: Step 1: Identify the given date’s month and day number. Step 2: Check if it’s the end of the month; if so, confirm the start date of the next month. Step 3: If it’s not the end of the month, simply add one to the day number. Step 4: Pay special attention to the end of the year, ensuring the year increments. #### A.1.4 Mathematical Reasoning Task Description: Solve an quadratic equation of the form $ax^{2}+bx+c=0$ considering any situations. Solution Description: To solve any quadratic equation of the form $ax^{2}+bx+c=0$ , we can follow a general approach based on the method described. Here is the structured template for solving such equations: Thought Template: Step 1: Calculate the Discriminant - Compute the discriminant $D$ using the formula $D=b^{2}-4ac$ . Step 2: Determine the Nature of the Roots - If $D>0$ , the equation has two distinct real roots. - If $D=0$ , the equation has exactly one real root (also known as a repeated or double root). - If $D<0$ , the equation has two complex roots. Step 3: Compute the Roots - For $D\geq 0$ , calculate the roots using the formula $x=\frac{-b\pm\sqrt{D}}{2a}$ . - For $D<0$ , calculate the real and imaginary parts of the complex roots using the formula $x=\frac{-b}{2a}\pm\frac{\sqrt{-D}}{2a}i$ , where $i$ is the imaginary unit. #### A.1.5 Code Programming Task Description: When given a list of numbers, try to utilize 4 basic mathematical operations (+-*/) to get a target number. Thought Template: Listing 1: Python template ⬇ from itertools import permutations, product def perform_operation (a, b, operation): # Define the operation logic (e.g., addition, subtraction, etc.). pass def evaluate_sequence (sequence, operations): # Apply operations to the sequence and check if the result meets the criteria. pass def generate_combinations (elements, operations): # Generate all possible combinations of elements and operations. pass def format_solution (sequence, operations): # Format the sequence and operations into a human-readable string. pass def find_solution (input_elements, target_result): # Data Input Handling # Validate and preprocess input data if necessary. # Core Algorithm Logic for sequence in permutations (input_elements): for operation_combination in generate_combinations (sequence, operations): try: if evaluate_sequence (sequence, operation_combination) == target_result: # Data Output Formatting return format_solution (sequence, operation_combination) except Exception as e: # Error Handling # Handle specific exceptions that may occur during evaluation. continue # If no solution is found after all iterations, return a default message. # return No solution found message return # Example usage: input_elements = [1, 7, 10, 3] target_result = 24 print (find_solution (input_elements, target_result)) #### A.1.6 Application Scheduling Task Description: Given some Chess moves in SAN, update the chess board state. Listing 2: Python template ⬇ import chess def find_checkmate_move (moves_san): # Initialize a new chess board board = chess. Board () # Apply the moves to the board for move_san in moves_san: # Remove move numbers and periods (e.g., "1." or "2.") if len (move_san. split (’. ␣ ’)) > 1: move_san = move_san. split (’. ␣ ’)[1] # Skip empty strings resulting from the removal if move_san: # Apply each move in SAN format to the board move = board. parse_san (move_san) board. push (move) # Generate all possible legal moves from the current position for move in board. legal_moves: # Make the move on a copy of the board to test the result board_copy = board. copy () board_copy. push (move) # Check if the move results in a checkmate if board_copy. is_checkmate (): # Return the move that results in checkmate in SAN format return board. san (move) # return No solution found message return #Example usage: input = ’......’ # Check input format and transform the input into legal format # Remove move numbers and periods (e.g., "1." or "2.") checkmate_move = find_checkmate_move (moves_san) print (checkmate_move) ### A.2 Prompt for Problem Distiller [Problem Distiller]: As a highly professional and intelligent expert in information distillation, you excel at extracting essential information to solve problems from user input queries. You adeptly transform this extracted information into a suitable format based on the respective type of the issue. Please categorize and extract the crucial information required to solve the problem from the user’s input query, the distilled information should include. 1. Key information: Values and information of key variables extracted from user input, which will be handed over to the respective expert for task resolution, ensuring all essential information required to solve the problem is provided. 2. Restrictions: The objective of the problem and corresponding constraints. 3. Distilled task: Extend the problem based on 1 and 2, summarize a meta problem that can address the user query and handle more input and output variations. Incorporate the real-world scenario of the extended problem along with the types of key variables and information constraints from the original problem to restrict the key variables in the extended problem. After that, use the user query input key information as input to solve the problem as an example. ### A.3 Prompt for Instantiated Reasoning [Meta Reasoner] You are a Meta Reasoner who are extremely knowledgeable in all kinds of fields including Computer Science, Math, Physics, Literature, History, Chemistry, Logical reasoning, Culture, Language….. You are also able to find different high-level thought for different tasks. Here are three reasoning sturctures: i) Prompt-based structure: It has a good performance when dealing with problems like Common Sense Reasoning, Application Scheduling ii) Procedure-based structure It has a good performance when dealing with creative tasks like Creative Language Generation, and Text Comprehension iii) Programming-based: It has a good performance when dealing with Mathematical Reasoning and Code Programming, it can also transform real-world problems into programming problem which could be solved efficiently. (Reasoning instantiation) Your task is: 1. Deliberately consider the context and the problem within the distilled respond from problem distiller and use your understanding of the question within the distilled respond to find a domain expert who are suitable to solve the problem. 2. Consider the distilled information, choose one reasoning structures for the problem. 3. If the thought-template is provided, directly follow the thought-template to instantiate for the given problem.

Rendering Paper...