2312.04684v4

Model: healer-alpha-free

# LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning > Majority of this work was done when Zifan Xu was an intern at Amazon Web Service during the summer 2023. A portion of this work has taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (FAIN-2019844, NRT-2125858), ONR (N00014-18-2243), ARO (W911NF-23-2-0004, W911NF-17-2-0181), Lockheed Martin, and UT Austin’s Good Systems grand challenge. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research. ## Abstract Chain-of-thought (CoT) prompting is a popular in-context learning (ICL) approach for large language models (LLMs), especially when tackling complex reasoning tasks. Traditional ICL approaches construct prompts using examples that contain questions similar to the input question. However, CoT prompting, which includes crucial intermediate reasoning steps (rationales) within its examples, necessitates selecting examples based on these rationales rather than the questions themselves. Existing methods require human experts or pre-trained LLMs to describe the skill, a high-level abstraction of rationales, to guide the selection. These methods, however, are often costly and difficult to scale. Instead, this paper introduces a new approach named La tent R easoning S kills (LaRS) that employs unsupervised learning to create a latent space representation of rationales, with a latent variable called a reasoning skill. Concurrently, LaRS learns a reasoning policy to determine the required reasoning skill for a given question. Then the ICL examples are selected by aligning the reasoning skills between past examples and the question. This approach is theoretically grounded and compute-efficient, eliminating the need for auxiliary LLM inference or manual prompt design. Empirical results demonstrate that LaRS consistently outperforms SOTA skill-based selection methods, processing example banks four times faster, reducing LLM inferences during the selection stage by half, and showing greater robustness to sub-optimal example banks. Our code is publicly available here. LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning Zifan Xu 1 thanks: Majority of this work was done when Zifan Xu was an intern at Amazon Web Service during the summer 2023. A portion of this work has taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (FAIN-2019844, NRT-2125858), ONR (N00014-18-2243), ARO (W911NF-23-2-0004, W911NF-17-2-0181), Lockheed Martin, and UT Austin’s Good Systems grand challenge. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research., Haozhu Wang 2, Dmitriy Bespalov 2, Xian Wu 2, Peter Stone 1,3, Yanjun Qi 2 1 The University of Texas at Austin, 2 Amazon Web Service, 3 Sony AI ## 1 Introduction Large Language Models (LLMs) exhibit remarkable capabilities in solving various downstream tasks through in-context learning (ICL) Brown et al. (2020), even without being explicitly trained on the distribution of in-context examples Vaswani et al. (2017); Devlin et al. (2019); Rae et al. (2021); Chowdhery et al. (2022); Wei et al. (2022a). Using in-context learning, LLMs generate output for an input query by conditioning on a prompt that contains a few input-output demonstrations. <details> <summary>extracted/6556870/content/figures/similarity_based_selection.png Details</summary> ![949671d3](/v1/image/949671d331a6ca40b9d44ffb81a4062dbe75726d116cf0ccadd90189b93753ae) ### Visual Description ## Diagram: LLM Reasoning with Example Retrieval and Skill Mismatch ### Overview This image is a conceptual diagram illustrating a process where a Large Language Model (LLM) attempts to solve a math word problem by retrieving a similar example from a bank, but the retrieved example leads to incorrect reasoning due to a "Skill Mismatching." The diagram is divided into two main sections by a vertical dashed line: the left side depicts the input and example retrieval process, and the right side depicts the LLM's processing and flawed output. ### Components/Axes The diagram is not a chart with axes but a flow diagram with labeled components and directional arrows. **Left Section (Input & Retrieval):** * **Top-Left:** A text box labeled **"Input Query:"** followed by the problem text: *"2 toucans are sitting on a tree limb. 1 more toucan joins them. How many toucans in all?"* The text "2 toucans are sitting on a tree limb." and "How many toucans in all?" is in red font. * **Below Input Query:** A black circle icon containing a white question mark. * **Flow Arrow:** A black arrow points downward from the question mark icon to a text box. * **Text Box:** Labeled **"Select similar question"**. * **Flow Arrow:** A black arrow points from this text box to the right, towards an icon. * **Icon:** A black silhouette of a classical building (like a bank or library). * **Label:** Next to the building icon is the text **"Example Bank"**. * **Flow Arrow:** A black arrow points downward from the "Select similar question" text box to another text box. * **Text Box:** Labeled **"Examples:"**. * **Icon:** To the left of the "Examples:" label is a black network/node icon (a central dot connected to five outer dots). * **Example Text:** To the right of the "Examples:" label is the retrieved example: *"Question: 2 toucans are sitting on a tree limb. 1 toucan left them. How many toucans left? Rationale: We subtract 2 from 1 and get 1."* The text "2 toucans are sitting on a tree limb." and "How many toucans left?" is in red font. **Right Section (LLM Processing & Output):** * **Vertical Divider:** A black dashed line runs vertically down the center of the image. * **Vertical Label:** To the right of the dashed line, the text **"Skill Mismatching"** is written vertically in red font. * **Top-Right:** A text box labeled **"Rationale:"** followed by the LLM's output: *"We subtract 2 from 1 and get 1."* A large red "X" mark is superimposed over the word "subtract". * **Icon:** Below the "Rationale:" label is a black speech bubble icon with three dots inside. * **Flow Arrow:** A black arrow points upward from a central icon to the speech bubble. * **Central Icon:** A black gear-shaped icon with the text **"LLM"** inside it. * **Flow Arrow:** A black arrow points upward from a lower box to the "LLM" gear icon. * **Lower Box:** A dashed-line box containing two icons connected by a plus sign: the network/node icon (from the Examples) and the question mark icon (from the Input Query). * **Label:** Above this dashed box is the text **"CoT Prompt:"**. * **Flow Arrow:** A black arrow points from the "Examples:" section on the left, across the dashed line, into the dashed "CoT Prompt" box on the right. ### Detailed Analysis The diagram outlines a specific failure mode in an AI reasoning pipeline: 1. **Input:** The system receives an addition problem ("1 more joins them"). 2. **Retrieval:** It searches an "Example Bank" for a similar question. It retrieves an example that is structurally similar (same setup with toucans on a limb) but involves a different operation (subtraction: "1 toucan left them"). 3. **Prompt Construction:** The retrieved example (network icon) and the original query (question mark icon) are combined to form a Chain-of-Thought (CoT) Prompt. 4. **LLM Processing:** This combined prompt is fed into the LLM. 5. **Flawed Output:** The LLM, influenced by the retrieved example's rationale, generates an incorrect "Rationale" for the original problem. It incorrectly applies subtraction ("We subtract 2 from 1") instead of the required addition, leading to the wrong answer (1 instead of 3). The red "X" explicitly marks this as an error. 6. **Diagnosis:** The vertical red label "Skill Mismatching" identifies the root cause: the retrieved example, while superficially similar, required a different mathematical skill (subtraction vs. addition), which misled the LLM. ### Key Observations * **Visual Emphasis:** Red font is used strategically to highlight key problem elements: the critical parts of the word problems ("2 toucans...", "How many...") and the core issue ("Skill Mismatching"). * **Error Highlighting:** The large red "X" over "subtract" is a clear visual marker of the logical error in the generated rationale. * **Iconography:** Simple, universal icons (question mark, building, network, speech bubble, gear) are used to represent abstract concepts (query, bank, examples, output, model). * **Flow Clarity:** Arrows clearly trace the path from input, through retrieval and prompt construction, to the LLM and its erroneous output. * **Spatial Separation:** The dashed line cleanly separates the retrieval subsystem (left) from the model execution subsystem (right), with "Skill Mismatching" labeling the interface problem between them. ### Interpretation This diagram serves as a critical case study in the limitations of example-based or retrieval-augmented generation for reasoning tasks. It demonstrates that **similarity in surface form (topic, sentence structure) does not guarantee similarity in underlying reasoning skill**. The data suggests that a naive retrieval system can introduce harmful bias. By providing an example that uses subtraction, it "primes" the LLM to perform subtraction, even when the new problem context ("joins them") clearly calls for addition. This is a form of **negative transfer** or **distractor interference**. The diagram argues that for robust AI reasoning, systems need more than just similar examples; they require: 1. **Skill-aware retrieval:** Finding examples that match the *operation* or *reasoning pattern* needed, not just the topic. 2. **Meta-cognitive checks:** The ability for the model to recognize when a retrieved example's rationale conflicts with the problem's requirements. 3. **Disentanglement:** Separating the retrieval of relevant knowledge from the application of reasoning skills. In essence, the image warns that without careful design, the very mechanisms meant to help AI reason (like providing examples) can become the source of its failure, highlighting the importance of precision in the "skill" being matched during the retrieval process. </details> (a) Question-similarity-based selection. <details> <summary>extracted/6556870/content/figures/skill_based_selection.png Details</summary> ![30ce7d22](/v1/image/30ce7d22e4b73456e80e47d6eba0f02244f353750f8a21c21e4d654ec9ff6310) ### Visual Description ## Diagram: Skill-Based Reasoning Process for an LLM ### Overview The image is a flowchart diagram illustrating a process for solving a simple arithmetic word problem using a skill-based reasoning approach with a Large Language Model (LLM). The diagram is divided into two main sections by a vertical dashed line. The left side details the process of abstracting a skill from an input query and retrieving relevant examples. The right side shows how these elements are used to prompt an LLM to generate a rationale and answer. ### Components/Axes The diagram is composed of text labels, icons, and directional arrows on a light gray background. **Left Section (Pre-processing / Skill Retrieval):** 1. **Top-Left:** An icon of a question mark inside a circle, labeled **"Input Query:"**. The query text is: *"2 toucans are sitting on a tree limb. 1 more toucan joins them. How many toucans in all?"* 2. **Below Input Query:** A downward arrow points to a box labeled **"Inference Skill"**. 3. **Below Inference Skill:** A lightbulb icon next to a box labeled **"Skill abstraction: addition"**. The word "addition" is in blue italic font. 4. **Below Skill Abstraction:** A box labeled **"Select similar skill"**. 5. **Right of Select Similar Skill:** A horizontal arrow points to an icon of a classical building (representing a bank or repository) labeled **"Example Bank"**. 6. **Below Example Bank:** A downward arrow points to a section labeled **"Examples:"** with a network/node icon. The example text is: * *"Question: Seven red apples and two green apples are in the basket. How many apples are in the basket?"* * *"Rationale: We add 7 to 2 and get 9"*. The word "add" is in blue italic font. **Right Section (LLM Processing / Output):** 1. **Vertical Divider:** A dashed line separates the left and right sections. The text **"Skill Matching"** is written vertically in green along this line. 2. **Bottom-Right:** A box labeled **"CoT Prompt:"** (Chain-of-Thought Prompt). Inside a dashed rectangle, it shows the network/node icon (from "Examples") plus the question mark icon (from "Input Query"), connected by a plus sign. 3. **Above CoT Prompt:** An upward arrow points to a gear-shaped icon labeled **"LLM"**. 4. **Above LLM:** An upward arrow points to a speech bubble icon with three dots, labeled **"Rationale:"**. The rationale text is: *"We add 2 to 1 and get 3."* The word "add" is in blue italic font. A green checkmark is placed to the right of this text. ### Detailed Analysis The diagram outlines a multi-step workflow: 1. **Query Ingestion:** The process begins with a natural language input query about toucans. 2. **Skill Abstraction:** The system performs "Inference Skill" to abstract the core mathematical skill required, which is identified as "addition". 3. **Example Retrieval:** Using the abstracted skill ("Select similar skill"), the system queries an "Example Bank" to find relevant demonstration examples. The retrieved example is a similar addition problem about apples. 4. **Prompt Construction:** A Chain-of-Thought (CoT) prompt is constructed by combining the retrieved example(s) with the original input query. This is represented by the network icon (examples) + question mark icon (query). 5. **LLM Inference:** This combined prompt is fed into an "LLM". 6. **Output Generation:** The LLM processes the prompt and generates a "Rationale" that mirrors the structure of the example, providing a step-by-step reasoning ("We add 2 to 1") and the final answer ("and get 3"), which is marked as correct with a green checkmark. ### Key Observations * **Color Coding:** Blue italics are consistently used to highlight the key operational verb ("add"/"addition") across the abstraction, example, and final rationale. Green is used for the "Skill Matching" label and the success checkmark. * **Iconography:** Simple, universal icons (question mark, lightbulb, building, network, gear, speech bubble) are used to represent abstract concepts like query, idea, repository, data, processing, and output. * **Flow Direction:** The flow is primarily top-to-bottom on the left (processing the query) and bottom-to-top on the right (generating the output), connected by the central "Skill Matching" process. * **Structural Mirroring:** The final rationale ("We add 2 to 1 and get 3.") directly mirrors the format of the retrieved example's rationale ("We add 7 to 2 and get 9."), demonstrating the in-context learning mechanism. ### Interpretation This diagram demonstrates a **modular, skill-augmented reasoning framework** for LLMs. Instead of relying solely on the LLM's parametric knowledge to solve a problem from scratch, the system: 1. **Decomposes** the problem into an abstract skill ("addition"). 2. **Retrieves** a concrete, relevant example from an external bank that demonstrates that skill. 3. **Augments** the prompt with this example, providing the LLM with a clear template for the desired reasoning process (Chain-of-Thought). The core insight is that by matching the *skill* required by a new query to *examples* of that skill, the system can guide the LLM to produce more reliable and structured reasoning. The green checkmark signifies that this process leads to a correct answer. This approach aims to improve performance on reasoning tasks by reducing the cognitive load on the LLM and providing explicit, task-specific guidance through retrieved examples. It represents a move towards more controlled and interpretable AI reasoning systems. </details> (b) Skill-based selection. Figure 1: CoT prompting with examples selected by (a) similar questions and (b) similar skills that (mis)match the skills in their rationales. <details> <summary>extracted/6556870/content/figures/lars.png Details</summary> ![8e81f746](/v1/image/8e81f7460aa1d219c80d5815e20d4aa5836fb7eb0f7491fa2cc8496c6f135633) ### Visual Description ## Diagram: Reasoning Skill Encoding and Example Selection System ### Overview The image is a technical system diagram illustrating a two-stage process for handling reasoning tasks. The left stage, "Pre-Processing," details how reasoning skills are encoded from a bank of examples. The right stage, "Selection," shows how a new input query is processed to retrieve relevant examples from a learned skill space. The overall system appears designed for few-shot learning or in-context learning, where a model selects pertinent examples to solve new problems. ### Components/Axes The diagram is divided into two primary sections by a vertical dashed line. **Left Section: Pre-Processing** * **Header:** "Pre-Processing" (top-left, black text in a white box with a black border). * **Input Example:** * A blue-bordered box contains a question (Q): "Seven red apples and two green apples are in the basket. How many apples are in the basket?" * A yellow-bordered box contains the corresponding reasoning/response (R): "We add 7 to 2 and get 9." * **Example Bank:** An icon of a classical building (labeled "Example Bank") points to a circle containing multiple grey dots, representing a collection of stored examples. * **Core Processing Pipeline:** 1. **Off-the-Shelf Embedding Model:** A rounded rectangle. It receives the Question (Q) and Response (R) as inputs. 2. **Outputs of Embedding Model:** * A blue rectangle labeled **Q** (embedding of the question). * A yellow rectangle labeled **R** (embedding of the response). 3. **Reasoning Policy:** A rounded rectangle with a lightbulb icon. It takes the question embedding **Q** as input. 4. **Latent Variable (z):** The Reasoning Policy outputs a pink rectangle labeled **z**. This represents a latent skill code. 5. **Skill Space Visualization:** A large pink circle containing clusters of red dots. An arrow from **z** points to a specific cluster, indicating that **z** selects or represents a region in this skill space. The text "Reasoning Skills" is placed next to this circle. 6. **Conditional Variational Auto-Encoder (CVAE):** A dotted-line box enclosing: * **Reasoning Skill Encoder:** A trapezoid taking the response embedding **R** as input and outputting a pink rectangle labeled **z** (the encoded skill). * **Decoder:** A trapezoid taking the latent skill **z** as input and outputting a yellow dashed rectangle labeled **R̂** (reconstructed response embedding). * A wavy yellow line connects **R̂** to the original **R**, indicating a reconstruction loss or similarity measure. **Right Section: Selection** * **Header:** "Selection" (top-center, black text in a white box with a black border). * **Input Query:** * A black circle with a white question mark icon is labeled "Input Query:". * Below it, a blue-bordered box contains a new question (Q): "2 toucans are sitting on a tree limb. 1 more toucan joins them. How many toucans in all?" * **Processing Pipeline for New Query:** 1. **Off-the-Shelf Embedding Model:** A rounded rectangle receives the new query **Q**. 2. **Question Embedding (Q):** A blue rectangle, output of the embedding model. 3. **Reasoning Policy:** A rounded rectangle with a lightbulb icon takes **Q** as input. 4. **Latent Skill Code (z):** The Reasoning Policy outputs a pink rectangle labeled **z**. * **Example Selection:** * A large pink circle (same "Skill Space" as on the left) contains grey dots and one highlighted red cluster. * The latent code **z** points to this red cluster. * A red arrow originates from this cluster and points to a red-bordered box labeled "Selected examples," which contains three red dots. This indicates the retrieval of examples whose skill codes are near **z** in the latent space. ### Detailed Analysis The diagram describes a method to learn and utilize "reasoning skills" for problem-solving. 1. **Pre-Processing (Skill Encoding):** * The system starts with a bank of solved examples (Q, R pairs). * For each example, an off-the-shelf model creates embeddings for the question (Q) and the reasoning response (R). * A **Reasoning Policy** network analyzes the question embedding (Q) to produce a latent variable **z**. This **z** is intended to capture the *type of reasoning skill* required (e.g., addition, comparison). * A **Conditional Variational Auto-Encoder (CVAE)** is trained to ensure the latent space is meaningful. The encoder maps the response embedding (R) to a skill code **z**, and the decoder tries to reconstruct the response (R̂) from that code. The link between the policy's **z** and the encoder's **z** (shown by a wavy red line) suggests they are trained to be consistent or are the same network. * The result is a structured "Skill Space" (pink circle) where examples are clustered by the underlying reasoning skill they demonstrate. 2. **Selection (Skill-Based Retrieval):** * When a new, unseen query arrives, it is embedded and passed through the same **Reasoning Policy**. * The policy predicts the latent skill code **z** that this new query likely requires. * This **z** is used to query the pre-processed skill space. The system selects examples from the cluster nearest to **z**. * These "Selected examples" are then presumably provided as few-shot context to a final model to solve the new query. ### Key Observations * **Two-Stage Architecture:** The system clearly separates the offline encoding of existing knowledge (Pre-Processing) from the online application to new problems (Selection). * **Latent Skill Space:** The core innovation is the creation of an interpretable, structured latent space (**z**) where proximity corresponds to similarity in reasoning type, not just surface-level text similarity. * **Role of the CVAE:** The CVAE acts as a regularizer, forcing the latent variable **z** to contain sufficient information to reconstruct the reasoning response (R), thereby ensuring it captures meaningful skill information. * **Use of Off-the-Shelf Models:** The diagram specifies the use of pre-existing embedding models, indicating this framework is model-agnostic and can be built on top of existing language models. * **Visual Consistency:** Colors are used consistently: blue for questions/queries (Q), yellow for responses/reasoning (R), pink for the latent skill space and codes (z), and red for selected items. ### Interpretation This diagram outlines a sophisticated approach to improving AI reasoning through structured example retrieval. The key insight is that not all examples are equally useful for a given problem; usefulness depends on the underlying reasoning skill. * **What it demonstrates:** The system learns to abstract away from the specific content of a problem (apples vs. toucans) to identify the core reasoning operation (addition). By clustering examples by this abstract skill, it can retrieve the most pedagogically relevant examples for a new problem, which should lead to more efficient and accurate few-shot learning. * **How elements relate:** The **Reasoning Policy** is the central component, acting as a bridge between the input question and the skill space. The **CVAE** provides the training framework to make the skill space meaningful. The **Example Bank** is the raw material, and the **Selection** module is the application. * **Notable implications:** This method could reduce the prompt sensitivity of large language models by providing more consistently relevant examples. It also introduces a level of interpretability, as the latent space **z** could potentially be analyzed to understand what reasoning skills the model has learned. The separation of skill encoding from problem-solving allows the skill space to be built once and reused for many different queries. </details> Figure 2: An overview of LaRS including a pre-processing stage (left) and a selection stage (right). Reasoning tasks have proven to be particularly difficult for language models and NLP in general Rae et al. (2021); Bommasani et al. (2021); Nye et al. (2021). In the recent literature, chain-of-thought (CoT) prompting, an ICL method, has been proposed to improve LLMs on a wide spectrum of reasoning tasks by guiding LLMs to produce a sequence of intermediate steps (rationale) for generating a (better) final answer Cobbe et al. (2021a); Wei et al. (2022b); Suzgun et al. (2022). The prompts for CoT are composed of demonstrations that contain not only input and output, but also the rationales for why the output holds. The core challenge for ICL lies in designing effective demonstrations to prompt LLMs. Much evidence has indicated the significant impact of demonstrations on the performance of ICL Lu et al. (2021); Liu et al. (2021). To form a prompt, one important setting considers selecting demonstrations from an existing example bank, termed demonstration selection Dong et al. (2022). While a variety of methods exist in the ICL literature for automating this process, CoT prompts are distinct in that they include not only questions and answers but also specially-designed rationales. This distinction highlights the importance of rationales in selecting demonstrations for CoT prompting. Specifically, CoT prompting should select demonstrations that illustrate relevant skills within their rationales to effectively address a given question. For instance, in solving math word problems (as depicted in Fig. 1), a useful rationale involves computing addition to get the correct answer. Selecting few-shot examples based on the question similarity (Fig. 1(a)) might lead to examples showcasing subtraction and generate incorrect rationales. However, skill-based selection (Fig. 1(b)) can align the skills between examples and the given question, which leads to correct answers guided by relevant rationales. To achieve such a skill-based demonstration selection, An et al. (2023b) introduces Skill-KNN, which employs pre-trained LLMs to generate skill descriptions. Then, the few-shot examples are selected based on the embedding of the skill descriptions computed by another pre-trained embedding model. Although this approach is straightforward, the LLM-generated skill descriptions can be somewhat arbitrary, heavily relying on the manually crafted prompts. This reliance constrains its wider applicability across diverse reasoning tasks. Moreover, the approach requires to generate a unique skill description for each example, which limits its scalability to larger example banks. Rather than relying on LLMs, we introduce La tent R easoning S kill Discovery (LaRS), a new skill-based demonstration selection method. This approach learns skills as latent space representations of rationales through unsupervised learning. The essence of LaRS lies in a unique formulation for the generation of rationales, which we term the latent skill model. This model, inspired by the principles of topic models Xie et al. (2021a), conditions the generation of a rationale on both a given question and a latent variable, called a reasoning skill. This latent variable embodies a high-level abstraction of the rationales, such as formats, equations, or knowledge. <details> <summary>extracted/6556870/content/figures/TSNE.png Details</summary> ![4ecae2b2](/v1/image/4ecae2b28f4a741aa6d85067259718d485638df3b715371a7f52b2e6ae5b7cd2) ### Visual Description ## Scatter Plot Comparison: Question Embedding vs. LaRS Skill Embedding ### Overview The image displays two side-by-side scatter plots visualizing data points in a two-dimensional space, accompanied by a comprehensive legend. The plots compare the embedding representations of "Questions" (left) and "LaRS Skills" (right). Each data point is a colored symbol representing a specific "Reasoning skill." The visualization aims to show how questions and skills cluster based on their semantic or functional similarity in the embedding space. ### Components/Axes * **Titles:** * Left Plot: "Question Embedding" (blue header box). * Right Plot: "LaRS Skill Embedding" (red header box). * **Legend (Right Panel):** Titled "Reasoning skills". It lists 13 distinct categories, each with a unique symbol and color combination. * **Axes:** The plots are unlabeled 2D scatter plots. No axis titles, scales, or numerical markers are present. The spatial arrangement represents relative similarity in the embedding space. * **Data Series (from Legend):** 1. **Compute statistics:** Black circle (●) 2. **Compute rate of change:** Purple downward triangle (▼) 3. **Compute money cost:** Blue 'x' (×) 4. **Filter tree leaves:** Blue circle (●) 5. **Addtion/subtraction:** Light blue downward triangle (▼) *[Note: "Addtion" is a typo in the source image]* 6. **Search minimum/maximum:** Teal 'x' (×) 7. **Multiplication:** Green circle (●) 8. **Filter table entries:** Green downward triangle (▼) 9. **Compute probability:** Light green 'x' (×) 10. **Shortage or surplus?:** Yellow circle (●) 11. **Reason time schedule:** Orange downward triangle (▼) 12. **Compare numbers:** Red 'x' (×) 13. **Others:** Red circle (●) ### Detailed Analysis **1. Question Embedding Plot (Left):** * **Spatial Distribution:** Shows two primary, distinct clusters. * **Left Cluster:** A dense, vertically oriented cluster on the left side of the plot. It contains a high mixture of symbols, including blue circles ("Filter tree leaves"), light blue triangles ("Addtion/subtraction"), teal 'x's ("Search minimum/maximum"), green triangles ("Filter table entries"), and purple triangles ("Compute rate of change"). Black circles ("Compute statistics") are also present within this mix. * **Right Cluster:** A separate, more horizontally oriented cluster on the right side. This cluster is heavily dominated by blue 'x's ("Compute money cost"). It also contains a sub-cluster of black circles ("Compute statistics") and some purple triangles ("Compute rate of change"). * **Outliers/Isolated Points:** A small, tight cluster of orange triangles ("Reason time schedule") is located at the bottom-left, separate from the main left cluster. A few red 'x's ("Compare numbers") and red circles ("Others") are scattered sparsely. **2. LaRS Skill Embedding Plot (Right):** * **Spatial Distribution:** Shows more dispersed and separated clusters compared to the Question plot. * **Top-Center:** A distinct cluster of black circles ("Compute statistics"). * **Top-Left:** A cluster of teal 'x's ("Search minimum/maximum"). * **Center-Left:** A cluster of blue circles ("Filter tree leaves") mixed with some light blue triangles ("Addtion/subtraction"). * **Center:** A loose grouping containing light green 'x's ("Compute probability"), a yellow circle ("Shortage or surplus?"), and a red circle ("Others"). * **Right Side:** A large, dispersed cloud of blue 'x's ("Compute money cost"). * **Bottom-Right:** A small, tight cluster of green circles ("Multiplication"). * **Bottom-Center:** A distinct cluster of orange triangles ("Reason time schedule"). * **Left Side:** Scattered purple triangles ("Compute rate of change") and green triangles ("Filter table entries"). * **Outliers/Isolated Points:** A single red 'x' ("Compare numbers") is isolated near the center-right. A few light blue triangles ("Addtion/subtraction") are scattered near the center. ### Key Observations 1. **Clustering by Skill Type:** In both plots, points of the same color/symbol (same skill) tend to cluster together, indicating the embeddings capture skill-specific features. 2. **Increased Separation in Skill Embedding:** The "LaRS Skill Embedding" plot shows clearer separation between clusters of different skills (e.g., "Compute statistics" vs. "Compute money cost") compared to the more mixed "Question Embedding" plot. 3. **Dominant Skill in Questions:** The "Compute money cost" skill (blue 'x') forms a very prominent and separate cluster in the Question Embedding, suggesting many questions in the dataset require this specific skill. 4. **Skill Proximity:** In the Question Embedding, skills like "Filter tree leaves," "Addtion/subtraction," and "Search minimum/maximum" are intermingled, suggesting questions requiring these skills share similar embedding characteristics. In the Skill Embedding, they are more distinct. 5. **Consistent Outliers:** The "Reason time schedule" (orange ▼) and "Compare numbers" (red ×) skills form small, isolated clusters in both visualizations, indicating they are distinct from the more common skill groupings. ### Interpretation This visualization demonstrates the effectiveness of the "LaRS" model in learning distinct, separable representations for different reasoning skills. The "Question Embedding" plot likely shows how raw questions map into this skill space; the mixing indicates that a single question may involve or be similar to multiple skills, or that the question representation is less refined. The clearer clustering in the "LaRS Skill Embedding" plot suggests the model has successfully disentangled these skills into dedicated regions of the embedding space. The prominence of the "Compute money cost" cluster in the Question plot implies a significant portion of the evaluation dataset involves financial reasoning. The proximity of computational and filtering skills in the Question plot may reflect their frequent co-occurrence in multi-step reasoning problems. The isolation of skills like "Reason time schedule" suggests they involve unique semantic or structural features not strongly shared with other skill types. Overall, the plots provide visual evidence that the LaRS model's skill embeddings are semantically meaningful and can be used to analyze the composition of reasoning tasks. </details> Figure 3: t-SNE projections of question embedding and LaRS reasoning skill embedding of the exmaples from TabMWP Lu et al. (2022) dataset. The 12 different colors correspond to 12 skill labels annotated by human. Under the skill model formulation, LaRS utilizes a Conditional Variational Auto-encoder (CVAE) to approximate the generation of rationales on a small dataset from the example bank. As a result, two probabilistic models can be learned concurrently: (1) a reasoning skill encoder that maps an example to the actual reasoning skills demonstrated in the rationale; and (2) a reasoning policy that predicts the reasoning skills required for a particular question. This method of learning through a CVAE, especially when applied to a small dataset from the example bank, is both cost-efficient and fast compared to Skill-KNN. Fig. 2 presents an overview of LaRS. In addition, Figure 3 shows the learned reasoning skill embedding (right) that effectively separates examples with different skill labels, while the off-the-shelf question embedding does not. The efficacy of LaRS is evaluated on four different benchmarks based on five backbone LLMs with varying scales. The method is also compared with baseline approaches, including an oracle method that assumes access to ground truth rationales. LaRS consistently outperforms Skill-KNN and also matches the oracle performance in almost half of the experiments. In addition, LaRS reduces half of the LLM inference, eliminates the need of human prompt design, and maintains better robustness to sub-optimal example banks. A summary of this paper’s contribution is as follows: - We propose LaRS, a novel unsupervised demonstration selection approach for CoT prompting, and empirically verify its effectiveness through large scale experiments. - We introduce the latent skill model, a plausible formulation for CoT reasoning, which has illuminated a deeper understanding of CoT prompting. - We present theoretical analyses of the optimality of the latent-skill-based selection method. <details> <summary>extracted/6556870/content/figures/causal_graph.png Details</summary> ![a0fb6dad](/v1/image/a0fb6dad88661091add01bf8152f0a70fa0951bd40acaafbba1b5a116de6c175) ### Visual Description ## Diagram: Comparison of Prompting Techniques (Zero-shot vs. Chain-of-Thought) ### Overview The image is a technical diagram illustrating three different prompting or reasoning frameworks, likely in the context of artificial intelligence or cognitive science. It visually compares the flow of information in "Zero-shot/human," "Zero-shot CoT" (Chain-of-Thought), and "Few-shot CoT" paradigms. The diagram uses colored nodes and directional arrows to represent the relationships between a query, an intermediate reasoning step, and a final response. ### Components/Axes The diagram is divided into three vertical sections, separated by thin, vertical dotted lines. **1. Left Section: "Zero-shot/human"** * **Title:** "Zero-shot/human" (black text, top-left). * **Components:** * A blue circle labeled **"Q"** (Query/Question). * A pink circle labeled **"z"** (likely representing an intermediate variable, latent state, or reasoning step). * A yellow circle labeled **"R"** (Response/Answer). * **Flow/Arrows:** * A **dashed red arrow** points from **"Q"** to **"z"**. * A **solid black arrow** points from **"Q"** to **"R"**. * A **solid black arrow** points from **"z"** to **"R"**. **2. Middle Section: "Zero-shot CoT"** * **Title:** "Zero-shot CoT" (black text, top-center). * **Components:** * A blue rectangle containing the text **"(prefix, Q)"**. This represents the input prompt, which is the query `Q` preceded by a fixed prefix (e.g., "Let's think step by step."). * A pink circle labeled **"z"**. * A yellow circle labeled **"R"**. * **Flow/Arrows:** * A **solid red arrow** points from the **"(prefix, Q)"** rectangle to **"z"**. * A **solid black arrow** points from the **"(prefix, Q)"** rectangle to **"R"**. * A **solid black arrow** points from **"z"** to **"R"**. **3. Right Section: "Few-shot CoT"** * **Title:** "Few-shot CoT" (black text, top-right). * **Components:** * A long blue rectangle containing the text **"(Q₁, R₁, ..., Qₖ, Rₖ, Q)"**. This represents a prompt containing `k` examples of question-answer pairs, followed by the final query `Q`. * A pink circle labeled **"z"**. * A yellow circle labeled **"R"**. * **Flow/Arrows:** * A **solid red arrow** points from the **"(Q₁, R₁, ..., Qₖ, Rₖ, Q)"** rectangle to **"z"**. * A **solid black arrow** points from the **"(Q₁, R₁, ..., Qₖ, Rₖ, Q)"** rectangle to **"R"**. * A **solid black arrow** points from **"z"** to **"R"**. ### Detailed Analysis * **Node Consistency:** Across all three diagrams, the core nodes are consistent: a blue input/query element (`Q` or its augmented form), a pink intermediate element (`z`), and a yellow response element (`R`). * **Arrow Semantics:** * The **red arrow** (dashed in the first diagram, solid in the others) consistently connects the input to the intermediate step `z`. The change from dashed to solid may imply a stronger or more explicit connection in the CoT methods. * The **solid black arrow** from input to response (`Q -> R`) exists in all three, representing a direct path to the answer. * The **solid black arrow** from intermediate to response (`z -> R`) also exists in all three, representing the path where the intermediate step informs the final answer. * **Input Evolution:** The primary difference is the complexity of the blue input element: 1. **Zero-shot/human:** Simple query `Q`. 2. **Zero-shot CoT:** Query `Q` augmented with a generic prefix. 3. **Few-shot CoT:** Query `Q` augmented with `k` specific examples of question-answer pairs. ### Key Observations 1. **Structural Similarity:** The fundamental tripartite structure (Input -> Intermediate -> Response, with a direct Input -> Response link) is preserved across all three paradigms. 2. **Prompt Engineering Progression:** The diagram visually demonstrates the progression from a simple question (Zero-shot/human) to a question with a generic reasoning prompt (Zero-shot CoT), to a question embedded within a rich context of examples (Few-shot CoT). 3. **Role of 'z':** The intermediate node `z` is present in all models, suggesting that even in "zero-shot/human" reasoning, there is an implicit intermediate step. The CoT methods make this step more explicit and structured through the prompt design. ### Interpretation This diagram is a conceptual model for how different prompting strategies in large language models (LLMs) or cognitive architectures might structure the reasoning process to arrive at an answer `R`. * **Zero-shot/human** represents the baseline: a direct question, possibly with an implicit, unguided reasoning step (`z`), leading to an answer. The dashed red arrow may indicate this step is weak, automatic, or not explicitly prompted. * **Zero-shot CoT** introduces a **prompt prefix** (e.g., "Think step-by-step") designed to trigger an explicit reasoning process (`z`). The solid red arrow suggests this prompt successfully activates a more deliberate intermediate step, which then contributes to the final answer. * **Few-shot CoT** provides the richest context. By including `k` examples (`Q₁, R₁, ..., Qₖ, Rₖ`), it demonstrates the desired reasoning pattern before presenting the final query `Q`. This is hypothesized to most effectively guide the model to produce a high-quality intermediate reasoning step (`z`) and, consequently, a more accurate final response (`R`). The overarching message is that **structuring the input (the blue element) to explicitly encourage or demonstrate an intermediate reasoning step (`z`) is a key technique for improving the final output (`R`) in complex question-answering tasks. The diagram argues for the value of Chain-of-Thought prompting, showing it as a more structured and potentially more reliable pathway from question to answer compared to a simple zero-shot approach.** </details> Figure 4: Causal graphs for prompting with zero-shot/human (left), zero-shot CoT (middle), and few-shot CoT (right) for generating rationales via skills. The dashed arrow from $Q$ to $z$ indicates possible sub-optimal inference of the reasoning skills from both human and zero-shot LLM generations. ## 2 Related Work ### 2.1 CoT Reasoning CoT prompting is a special prompt design technique that encourages LLMs to generate intermediate rationales that guide them towards providing accurate final answers. These rationales can exhibit remarkable flexibility in their styles. For instance, the original work by Wei et al. (2022b) specially designs rationales in the in-context demonstrations to suit different reasoning tasks. Moreover, novel prompt designs that highlight diverse formats of the rationales have emerged to enhance CoT prompting. For example, Kojima et al. (2022) proposed Program of Thoughts (PoT) that disentangles textual reasoning from computation, with the latter specially handled through program generation. In contrast to manual design, our method LaRS can be thought of as automatic discovery of diverse rationale styles from an example bank. This method can also dynamically select reasoning skills based on the specific questions. Worth noting, Chen et al. (2023) introduces SKills-in-Context (SKiC), which confines rationale generation to predefined “skills” within the prompt. Although sharing a similar motivation to LaRS, we emphasize two crucial distinctions: (1) while SKiC relies on manual “skills” design, LaRS automatically discovers them, (2) SKiC presents a full list of “skills” in the prompt, allowing LLMs to select from them, whereas LaRS learns the skill selection from the example bank, explicitly instructing LLMs on which skill to employ through in-context examples. ### 2.2 Demonstration Selection Demonstration selection refers to a special setting, where the prompts are constructed by selecting examples from an example bank. In this context, our LaRS aligns with the paradigm of unsupervised demonstration selection, which involves designing heuristics for this selection process. A variety of heuristics have been explored, including similarity Gao et al. (2021); Hu et al. (2022), diversity Zhang et al. (2022), coverage Gupta et al. (2023), and uncertainty Diao et al. (2023). Among these, Skill-KNN (An et al. (2023b)) shares the closest resemblance to our approach. However, Skill-KNN relies on pre-trained LLMs to provide “skill” annotations, which could be arbitrary and resource-intensive, requiring extensive inferences of LLMs and human prompt design. In contrast, LaRS automatically discovers reasoning skills by learning a lightweight CVAE represented by two-layer MLPs and standard loss function. In addition, the selections based on these discovered reasoning skills are theoretically-grounded based on the latent skill model and the theoretical analyses presented in this paper. ## 3 Formulation In this section, we formally describe the skill model, a new formulation for explaining the generation of rationales in CoT reasoning. In Section 3.1, the skill model is first introduced to describe the human-generated rationales. Then, Section 3.2 illustrates how the skill model can be adapted to LLM-generated rationales. Finally, leveraging the concept of reasoning skill as outlined in the skill model, a new latent-skill-based demonstration selection method is formally described in Section 3.3. ### 3.1 Skill Model Let $\mathcal{X}$ be the set of all sequences of tokens, $\mathcal{Z}$ be the continuous vector space of latent reasoning skills, and $P_{H}$ denotes the probability distribution of real-world natural language. CoT reasoning is to generate a rationale $R\in\mathcal{X}$ given a question $Q\in\mathcal{X}$ , whose correctness For math word problems, whose answers are discrete labels, the correct rationale should contain the correct answer label as the final step. For code generation, the correct rationale should be the correct code. can be verified by an indicator function $\mathbb{1}(R,Q):=\mathbb{1}(R\text{ is the correct rationale for }Q)$ . The skill model assumes that the real-world conditional distribution of $R$ given $Q$ can be described as follows: where, $P_{H}(z\mid Q)$ is the posterior of selecting latent reasoning skills in human reasoning, called a reasoning policy. $P_{H}(R\mid z,Q)$ is the posterior distribution of generating $R$ given a question $Q$ and a reasoning skill $z$ . A causal graph illustrating such a generation process involving a latent reasoning skill $z$ is presented in Fig. 4 on the left. Unlike Wang et al. (2023), this formulation considers a dependency of $z$ on $Q$ reflecting a preference for selecting particular reasoning skills to solve a given question. We justify this formulation as follows: 1. Rationales can exhibit remarkable flexibility, manifesting diverse formats, topics, and knowledge, which can naturally be abstracted into the high-level concepts of reasoning skills. 1. The selection of these skills is not bound by strict determinism. For instance, diverse reasoning paths and formats could all contribute toward finding the correct final answer. Therefore, real-world data is a mixture of diverse skills captured by a stochastic reasoning policy $P_{H}(z\mid Q)$ . ### 3.2 CoT prompting LLMs are pre-trained conditional generators. Given an input query $X\in\mathcal{X}$ , the conditional distribution of an output $Y\in\mathcal{X}$ generated by LLMs can be written as $P_{M}(Y\mid X)$ . LLMs are usually trained on generic real-world data distribution such that $P_{M}(Y\mid X)\approx P_{H}(Y\mid X)$ . Prior studies have presented an implicit topic model formulation in explaining the in-context learning mechanisms of LLMs Wang et al. (2023); Xie et al. (2021a). Similarly, we posit that LLMs can be viewed as implicit skill models for generating rationales. To elaborate, when generating rationales, LLMs’ conditional distribution $P_{M}(R\mid Q)$ can be extended as follows (with illustrations in Fig. 4 on the left): This implicit skill model assumes that LLMs also infer reasoning skills $z$ , which resembles the real-world generation of rationales. The above formulation only encompasses the zero-shot generation of rationales. In practice, prompts are commonly provided to guide LLMs’ generation. In general, two CoT prompting strategies exist: zero-shot CoT, employing a prompt comprising a short prefix and a test question, and few-shot CoT, employing a prompt containing pairs of questions and rationales. Denoting $pt\in\mathcal{X}$ as a prompt, a unified formulation for both prompting strategies can be derived as follows: 0-shot CoT: $pt=(\text{prefix},Q)\text{ or }(Q,\text{prefix})$ $k$ -shot CoT: $pt=(Q_{1},R_{1},\cdots,Q_{k},R_{k},Q)$ Here, the formulation is simplified such that the use of prompts only influences the probability distribution of $z$ . For instance, a prefix specifying the generation’s format can be interpreted as specifying the reasoning skill $z$ by shaping the distribution from $P_{M}(z\mid Q)$ to $P_{M}(z\mid pt)$ . This simplification aligns with empirical evidence suggesting that in-context examples serve as mere pointers to retrieve already-learned knowledge within LLMs Shin et al. (2020); Min et al. (2022); Wang et al. (2022). Drawing upon this formulation, we can gain insight into the failure of zero-shot generation. In general, real-world data is inherently noisy, indicating that the reasoning policy $P_{H}(z\mid Q)$ may be sub-optimal, and the reasoning skills are not chosen to maximize the accuracy of answering a test question. Trained on this generic real-world data distribution, $P_{M}(z\mid Q)$ could also be sub-optimal, leading to the failure of zero-shot generation. On the other hand, CoT prompting improves the reasoning performance by shaping the distribution of reasoning skills using carefully-designed prompts that contain either prefix or few-shot examples. ### 3.3 Skill-Based Demonstration Selection The analysis above suggests that the key to the success of CoT prompting is to design an effective prompt that improve upon the posterior distribution of human’s preference of reasoning skills $P_{H}(z\mid Q)$ . To design an effective prompt, the demonstration selection problem assumes access to an example bank of question-rationale pairs, denoted as $\mathcal{D}_{E}=\{(R,Q)\}$ . This example bank is usually specially-crafted and has a distribution different from the real-world distribution. Denoting $P_{E}$ as the distribution of the example bank, $R$ is distributed according to $P_{E}(R\mid Q)$ for all $(R,Q)\in\mathcal{D}_{E}$ . Given $\mathcal{D}_{E}$ , the demonstration selection is to select a few question-rationale pairs from $\mathcal{D}_{E}$ . Assuming that each selected demonstration is i.i.d, a demonstration selection method can be uniquely defined as a probabilistic model $g(Q,R|Q_{\text{test}}):=\mathcal{X}\mapsto\Delta(\mathcal{X})$ that maps a test question $Q_{\text{test}}$ to a probability distribution of demonstrations. Then, we can formally define the skill-based demonstration selection method as follows: **Definition 1** *Skill-based demonstration selection is given by* Intuitively, this selection method maximizes the probability of a selected demonstration showcasing the reasoning skill that is likely to be chosen according to $P_{E}(z\mid Q)$ . Since the example bank is usually specially-crafted and contains rationales showcasing “better” reasoning skills, the in-context examples that align with $P_{E}(z\mid Q)$ are intuitively more effective. In Section 4.3, we provide theoretical analysis of the optimality of this skill-based selection when conditioned on certain ideal assumptions of the example bank and LLMs. ## 4 Method To enable the skill-based demonstration selection (Definition 1), we introduce our approach LaRS , which involves learning a conditional variational autoencoder (CVAE) to approximate $P_{E}$ using the data from the example bank $\mathcal{D}_{E}$ . We then outline a practical demonstration selection process aligning with the skill-based selection. The schematic overview of LaRS (right) and the corresponding demonstration selection process (left) are illustrated in Figure 2. ### 4.1 Latent Reasoning Skill Discovery The conditional variational autoencoder (CVAE) has emerged as a popular approach for modeling probabilistic conditional generation. As one specific case, the skill model, introduced in this paper, can effectively be represented as a CVAE. Therefore, we introduce LaRS that employs a CVAE to approximate the generation of rationales using the data from the example bank $\mathcal{D}_{E}=\{(Q,R)\}$ . In particular, this CVAE includes three coupled models: an encoder model, a decoder model, and a reasoning policy model, independently parameterized by $\omega$ , $\psi$ , and $\phi$ respectively. Drawing from the notations introduced in the skill model, the reasoning policy model is a conditional Bayesian network $\pi_{\phi}(z\mid Q)$ , determining the posterior distribution of latent reasoning skill $z$ given a question $Q$ . The decoder model is also a conditional Bayesian network $p_{\psi}(R\mid z,Q)$ that generates a rationale $R$ , conditioned on both $Q$ and $z$ , where $z$ is sampled from $\pi_{\phi}(z\mid Q)$ . Finally, the encoder model $q_{\omega}(z\mid Q,R)$ is another conditional Bayesian network, mapping a question-rationale pair to $z$ . In this paper, we train this CVAE using classical variational expectation maximization and the reparameterization trick. Specifically, the classical variational expectation maximization optimizes a loss function as follows: $\displaystyle\mathcal{L}_{\text{CVAE}}(\phi,\omega,\psi)=\mathcal{L}_{\text{ recon}}+\mathcal{L}_{\text{KL}}$ (4) $\displaystyle\mathcal{L}_{\text{recon}}=-\mathbb{E}_{(Q,R)\sim\mathcal{D}_{E}, z\sim q_{\omega}(\mid Q,R)}[\log p_{\psi}(R|z,Q)]$ $\displaystyle\mathcal{L}_{\text{KL}}=~{}\mathbb{E}_{(Q,R)\sim\mathcal{D}_{E}}[ \text{D}_{\text{KL}}(q_{\omega}(z\mid Q,R)\parallel\pi_{\phi}(z\mid Q))]$ By training to minimize this loss function, $q_{\omega}$ and $\pi_{\phi}$ can be learned to effectively approximate the conditional distributions $P_{E}(z\mid Q,R)$ and $P_{E}(z\mid Q)$ . It is worth noting that the decoder model acts an auxiliary model that only roughly reconstructs rationales for the purpose of training the encoder and the reasoning policy model, and is not deployed to generate rationales in the downstream tasks. Ideally, all three models would be represented by language models, processing token sequences as input and generating token sequences as output. However, training full language models for demonstration selections can be computationally expensive. Instead, we adopt a pre-trained embedding model denoted as $f:\mathcal{X}\mapsto\Theta$ , which maps the token space $\mathcal{X}$ to an embedding space $\Theta$ . Consequently, the decoder model, encoder model, and reasoning policy model transform into $p_{\psi}(f(R)|z,f(Q))$ , $q_{\omega}(z|f(Q,R))$ , and $\pi_{\phi}(z|f(Q))$ , respectively. They now condition on and generate the embeddings instead of the original tokens. In the actual implementation, we use the same feed-forward neural network to represent both $\pi_{\phi}$ and $q_{\omega}$ , predicting the mean and variance of Gaussian distributions of latent reasoning skills. On the other hand, $p_{\psi}$ is a feed-forward neural network that deterministically predicts a value in the embedding space. ### 4.2 Demonstration Selection Since the distribution $P_{E}(Q,R\mid z)$ in Definition 1 is practically intractable, we propose a selection process that effectively aligns with the skill-based selection using the learned $\pi_{\phi}$ and $q_{\omega}$ . For a given test question $Q_{\text{test}}$ , the desirable reasoning skill $z_{\text{test}}=\operatorname*{arg\,max}_{z}[\pi_{\phi}(z|f(Q_{\text{test}}))]$ can be computed using the reasoning policy. Subsequently, each example from the example bank can be scored based on the cosine similarity between $z_{\text{test}}$ and $z_{\text{post}}$ , where $z_{\text{post}}=\operatorname*{arg\,max}_{z}[q_{\omega}(z|Q,R))]$ represents the maximum likelihood skill of the current example. Finally, a CoT prompt can be constructed by selecting the top- $k$ examples according to the computed scores. The step-by-step procedure is outlined in Algorithm 1. ### 4.3 Theoretical Analysis In this section, we provide a theoretical analysis of the optimality of the skill-based selection by Definition 1. Let $P_{M}(R\mid Q,g)$ denotes LLMs’ conditional distribution of a rationale $R$ given a test question $Q$ under a demonstration selection method $g$ . $P_{M}(R\mid Q,g)$ can be extended as follows: $\displaystyle P_{M}(R\mid Q,g)$ $\displaystyle=\int_{\mathcal{X}^{k}}P_{M}(R\mid pt)\Pi_{i=1}^{k}[g(Q_{i},R_{i} \mid Q)d(Q_{i},R_{i})]$ Here, each demonstrations $(Q_{i},R_{i})$ is independently sampled from $g(Q_{i},R_{i}\mid Q),\forall i=1,\cdots,k$ . These $k$ demonstrations form a prompt $pt=(Q_{1},R_{1},\cdots,Q_{k},R_{k},Q)$ . We want to show that $P_{M}(R\mid Q,g)$ is the optimal conditional distribution that maximizes the accuracy of rationales if the selection follows skill-based selection method or $g=g_{skill}$ . We begin by defining the optimal conditional distribution as follows: **Definition 2** *Optimal conditional distribution of rationales given questions $P^{*}(R\mid Q)$ is given by: $P^{*}(R\mid Q)=\operatorname*{arg\,max}_{P(\cdot\mid Q)\in\Delta(\mathcal{X})} \int_{\mathcal{X}}\mathbb{1}(R,Q)P(R\mid Q)dR$ Here $\mathbb{1}(R,Q)$ is the indicator function of the correctness of $R$ given a question $Q$ (see Section 3.1).* Then, we state two major assumptions as follows: **Assumption 1** *Example bank is sampled from the optimal conditional distribution, or $P_{E}(R\mid Q)=P^{*}(R\mid Q)$ .* **Assumption 2** *Humans and LLMs are expert rationale generators given reasoning skills and questions, meaning that $P_{H}(R\mid z,Q)=P_{E}(R\mid z,Q)=P_{M}(R\mid z,Q)$ .* Assumption 1 is rooted in the fact that example banks are human-crafted that contains the most useful rationales for answering the questions. In Assumption 2, $P_{M}$ capturing $P_{H}$ is a common assumption in the literature studying LLMs Xie et al. (2021b); Saunshi et al. (2020); Wei et al. (2021). $P_{E}(R\mid z,Q)=P_{H}(R\mid z,Q)$ is based on the assumption that reasoning skills are shared across humans, and the generation of rationales is identical given the same reasoning skills and questions. Based on the above definiton and two assumptions, we prove the following theorem. **Theorem 1** *A LLM gives the optimal conditional distribution of rationales given questions: $$ P_{M}(R\mid Q,g_{skill})=P^{*}(R\mid Q) $$ If (1) it is prompted by $k\rightarrow\infty$ in-context examples selected by the skill-based selection $g_{skill}$ defined by Definition 1, (2) Assumption 2 and Assumption 1 hold.* Appendix E presents the proof for Theorem 1. ## 5 Experiments This section describes the experimental settings, baselines, metrics, and main results. ### 5.1 Dataset For benchmarking, the selection methods are evaluated on four challenging datasets, including two datasets of Math Word Problem (MWP): TabMWP, GSM8K, one text-to-SQL dataset: Spider, and one semantic parsing dataset: COGS. Each dataset is split into a training set used to learn LaRS models and a test set used to evaluate the selection methods. While the training sets may potentially be large, we use randomly sampled 1K examples from the training set as the example bank, from which, the examples can be selected for CoT prompting. Detailed descriptions of the datasets and splitting are presented in Appendix B. To measure the performances, we use the answer accuracy for TabMWP and GSM8K, with the answers extracted by searching the texts right after a prefix The answer is. For Spider, we use the official execution-with-values accuracy We use the official evaluation scripts for Spider in https://github.com/taoyds/test-suite-sql-eval.. For COGS, we report the exact-match accuracy for semantic parsing. ### 5.2 Selection Methods Our method LaRS is compared with the following four baselines. All the hyper-parameters related to these methods are listed in Appendix B. #### Skill-KNN This baseline represents a state-of-the-art (SOTA) skill-based selection method. It employs pre-trained LLMs to generate skill descriptions for both the questions in the example bank and the test question. Then, the method selects examples whose skill descriptions most closely match that of the test question to form the prompt, using cosine similarity computed with a pre-trained embedding model. To examine the dependency on the LLMs’ ability to generate skill descriptions, we introduce two variations: Skill-KNN-large, which uses the larger LLM gpt-3.5-turbo, and Skill-KNN-small, which uses the smaller LLM Falcon-40B-instruct. Additionally, to evaluate the effect of human-annotated skill descriptions prompting the LLMs to generate new skills, we introduce Skill-KNN-zero, which uses gpt-3.5-turbo to generate skill descriptions in a zero-shot fashion. Skill-KNN-zero closely resembles the setting of LaRS , as it does not rely on human prompt design. Therefore, LaRS is primarily compared with Skill-KNN-zero. #### Random This baseline randomly selects $k$ in-context examples from the example bank. For each test question, the accuracy is reported as an average over three independent random selections. #### Retrieval-Q This baseline employs a pre-trained embedding model to encode a test question, and selects in-context examples based on the cosine similarity between embeddings from examples’ questions and the test question. #### Retrieval-R (oracle) This baseline employs a pre-trained embedding model to encode the ground-truth rationale of a test question, and selects in-context examples based on the cosine similarity between examples’ rationales and the ground-truth rationale. ### 5.3 Backbones and Hyper-parameters In terms of the backbone models, the ICL is conducted by two OpenAI language models: gpt-4o and gpt-3.5-turbo, two Anthropic model: claude-3-sonnet and claude-3-haiku, and one smaller-scale Falcon-40B-Instruct Xu et al. (2023). All the embedding is computed by a pre-trained embedding model, Deberta-v2-xlarge He et al. (2021). We also investigate different choices of embedding model in Section C. During inference, the temperature is set to 0 (i.e., greedy decoding) to reduce the variance. The CoT prompts contain $k=2,4,4,8$ in-context examples for TabMWP, GSM8K, Spider, and COGS, respectively. ### 5.4 Performance comparison results Table 7 presents experiment result summary. Detailed descriptions are as follows: | Method Backbone: gpt-3.5-turbo Random | TabMWP 62.4 +0.0 | GSM8K 75.7 +0.0 | Spider 46.8 +0.0 | COGS 67.5 +0.0 | | --- | --- | --- | --- | --- | | Retrieval-Q | 72.3 +9.9 | 75.6 –0.1 | 49.9 +3.1 | 88.5 +21.0 | | Skill-KNN-zero | 77.7 +15.3 | 75.0 –0.7 | 49.0 +2.2 | 77.9 +10.8 | | LaRS (ours) | 78.1 +15.7 | 76.8 +1.1 | 53.0 +6.2 | 94.8 +27.2 | | Retrieval-R (oracle) | 77.4 +15.0 | 75.5 –0.2 | 64.4 +17.6 | 95.7 +28.2 | | Backbone: gpt-4o | | | | | | Random | 87.6 +0.0 | 78.1 +0.0 | 74.1 +0.0 | 73.0 +0.0 | | Retrieval-Q | 85.9 –1.7 | 78.1 +0.0 | 75.9 +1.8 | 86.9 +16.9 | | Skill-KNN-zero | 87.7 +0.1 | 78.6 –0.5 | 76.6 +2.5 | 78.1 +5.1 | | LaRS (ours) | 87.9 +0.3 | 78.3 +0.2 | 77.2 +3.1 | 90.2 +17.2 | | Retrieval-R (oracle) | 88.8 +1.2 | 77.1 –1.0 | 78.1 +4.0 | 92.8 +19.8 | | Backbone: claude-3-sonnet | | | | | | Random | 92.6 +0.0 | 93.3 +0.0 | 61.7 +0.0 | 79.2 +0.0 | | Retrieval-Q | 93.1 +0.5 | 92.4 –0.9 | 61.8 +0.1 | 94.6 +15.4 | | Skill-KNN-zero | 93.1 +0.5 | 92.1 –1.2 | 61.9 +0.2 | 86.6 +7.4 | | LaRS (ours) | 93.7 +1.1 | 93.6 +0.3 | 62.2 +0.5 | 96.9 +17.7 | | Retrieval-R (oracle) | 94.1 +1.5 | 92.8 –0.5 | 62.4 +0.7 | 97.6 +18.4 | | Backbone: claude-3-haiku | | | | | | Random | 88.6 +0.0 | 88.6 +0.0 | 60.2 +0.0 | 66.2 +0.0 | | Retrieval-Q | 92.2 +3.6 | 88.6 +0.0 | 60.0 –0.2 | 88.5 +22.3 | | Skill-KNN-zero | 93.3 +4.7 | 88.8 +0.2 | 61.0 +0.8 | 79.7 +13.5 | | LaRS (ours) | 93.3 +4.7 | 87.6 –1.0 | 61.3 +1.1 | 89.9 +23.7 | | Retrieval-R (oracle) | 92.4 +3.8 | 88.9 +0.3 | 61.2 +1.0 | 96.5 +30.3 | | Backbone: Falcon-40B-Instruct | | | | | | Random | 45.7 +0.0 | 38.8 +0.0 | 20.6 +0.0 | 45.1 +0.0 | | Retrieval-Q | 51.9 +6.2 | 37.3 –1.5 | 22.1 +1.5 | 73.9 +28.8 | | Skill-KNN-small | 51.4 +5.7 | 36.5 –2.3 | 20.3 –0.3 | 59.4 +14.3 | | Skill-KNN-zero | 55.2 +9.5 | 38.7 –0.1 | 23.3 +2.7 | 82.1 +37.0 | | LaRS (ours) | 57.7 +12.0 | 39.1 +0.3 | 24.8 +4.2 | 89.5 +44.4 | | Retrieval-R (oracle) | 61.2 +15.5 | 40.4 +1.6 | 39.9 +19.3 | 90.3 +45.2 | Table 1: Main results (%) across all backbone models and datasets. Numbers in bold represent the best results for each backbone model across all selection methods. The subscripted gray values indicate the relative improvement over Random selection. #### LaRS matches SOTA skill-based selection methods with superior computational efficiency. As shown in Table 7, across all four benchmarks and five backbone models tested, LaRS outperforms Skill-KNN-zero in 18 out of 20 experiments. This result highlights the effectiveness of the latent reasoning skills learned through unsupervised learning with small CVAE models, achieving comparable performance to the skill descriptions crafted by extensively pre-trained LLMs. Notably, Skill-KNN-zero uses the powerful LLM gpt-3.5-turbo for skill generations. However, in scenarios where only less capable LLMs are available, such as lacking an internet connection and requiring local inference, Skill-KNN-small, which uses the less capable LLM Falcon-40B-instruct, suffers significant performance drops across all four benchmarks. In contrast, LaRS does not require powerful LLMs and achieves similar performance boosts for smaller backbone models like Falcon-40B-Instruct compared to Skill-KNN-zero. Furthermore, in Table 3, we present a comparison of computational overhead, including computing time, estimated cost for pre-processing the example bank, and cost for each input query during selection, among Retrieval-Q, LaRS, Skill-KNN-zero, and a supervised selection method PromptPG Lu et al. (2022). Our method achieves accuracy comparable to Skill-KNN-zero, requiring no LLM inferences (approximately $30 savings per 1k examples) and reducing computing time by 1.5 hours per 1k examples during pre-processing, along with more than 100% less cost per input query. Detailed experimental settings for estimating these costs can be found in Appendix B. #### LaRS is more robust to sub-optimal example banks. Skill-KNN selects examples based solely on the questions. For example, it selects examples whose questions require the same skills as the given question. However, sub-optimal example banks may include examples with incorrect or sub-optimal rationales, which should be avoided. In contrast, LaRS considers both questions and rationales when computing the reasoning skill embedding, enhancing its robustness to sub-optimality. Table 2 presents the answer accuracy of Skill-KNN-zero and LaRS on the TabMWP and COGS benchmark with sub-optimal example banks, where 10%, 20% and 30% of rationales are replaced by random rationales from the same example banks. Skill-KNN-zero suffers from a 3% and 11.7% performance drop at the replacement rate of 30%, while LaRS experiences only a 0.1% and 1.9% performance drop under the same conditions. | Skill-KNN-zero LaRS Benchmark | 77.7 78.1 COGS | 77.0 –0.9% 78.1 –0.0% | 76.2 –1.9% 78.0 –0.1% | 75.4 –3.0% 77.9 –0.1% | | --- | --- | --- | --- | --- | | Replace Rate (%) | 0 | 0.1 | 0.2 | 0.3 | | Skill-KNN-zero | 77.9 | 75.8 –2.7% | 73.8 –5.3% | 68.8 –11.7% | | LaRS | 94.8 | 94.7 –0.1% | 93.3 –1.6% | 93.0 –1.9% | Table 2: Answer accuracy (%) of Skill-KNN-zero and LaRS on TabMWP and COGS benchmark with 0%, 10%, 20%, and 30% of the rationales in the example bank being replaced with random rationales. The subscripted gray values indicate the percentage drop relative to optimal example banks. | | Accuracy (%) $\uparrow$ Time (h/1k) $\downarrow$ | Pre-processing Cost ($/1k) $\downarrow$ | Selection Cost per query ($) $\downarrow$ | | | --- | --- | --- | --- | --- | | LaRS (ours) | 78.1 | 0.5 +0% | $0 | $0.02 +%0 | | Skill-KNN-zero | 77.7 | 2 +300% | $30 | $0.05 +150% | | PromptPG | 74.2 | 6 +1100% | $50 | $0.02 +0% | | Retrieval-Q | 72.3 | 0 –100% | $0 | $0.02 +0% | Table 3: Comparison of accuracy and computational overhead, including computing time, estimated cost for pre-processing an example bank of 1k, and average cost per input query during selection, among four selection methods on the TabMWP dataset. The grey percentages represent the increased cost ratio associated with each selection method. ## 6 Conclusions This paper introduces LaRS, a novel demonstration selection method designed for CoT prompting. LaRS bases the selection on reasoning skills, which are latent representations discovered by unsupervised learning from rationales via a CVAE. Based on the experiments conducted across four LLMs and over four different reasoning tasks, LaRS manifests comparable performance on selecting effective few-shot examples for CoT reasoning while requiring no extra LLM inference and saving hours in pre-processing the example bank. ## 7 Limitations Despite the success of LaRS, a few limitations and potential future directions are worth noting. First, the impact of the order of examples in the prompts is not considered. Introducing additional heuristics to sort the examples could potentially lead to better performances. Second, in the CVAE, the decoder is represented by an MLP neural network. However, it would be ideal to represent the decoder as a prompt-tuning module, which aligns better with the implicit skill model assumption. Finally, one single reasoning skill might not be sufficient to represent the entire rationale that might contain multiple steps of reasoning. Learning and selecting reasoning skills for each individual reasoning step is an interesting direction to explore. ## References - An et al. (2023a) Shengnan An, Zeqi Lin, Qiang Fu, B. Chen, Nanning Zheng, Jian-Guang Lou, and D. Zhang. 2023a. How do in-context examples affect compositional generalization? ArXiv, abs/2305.04835. - An et al. (2023b) Shengnan An, Bo Zhou, Zeqi Lin, Qiang Fu, Bei Chen, Nanning Zheng, Weizhu Chen, and Jian-Guang Lou. 2023b. Skill-based few-shot selection for in-context learning. arXiv preprint arXiv:2305.14210. - Bommasani et al. (2021) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel J. Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the opportunities and risks of foundation models. ArXiv, abs/2108.07258. - Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. ArXiv, abs/2005.14165. - Chen et al. (2023) Jiaao Chen, Xiaoman Pan, Dian Yu, Kaiqiang Song, Xiaoyang Wang, Dong Yu, and Jianshu Chen. 2023. Skills-in-context prompting: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304. - Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier García, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, Jason Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311. - Cobbe et al. (2021a) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021a. Training verifiers to solve math word problems. ArXiv, abs/2110.14168. - Cobbe et al. (2021b) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021b. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. - Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805. - Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. ArXiv, abs/2302.12246. - Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234. - Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. ArXiv, abs/2012.15723. - Gupta et al. (2023) Shivanshu Gupta, Sameer Singh, and Matt Gardner. 2023. Coverage-based example selection for in-context learning. ArXiv, abs/2305.14907. - He et al. (2021) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations. - Hu et al. (2022) Yushi Hu, Chia-Hsuan Lee, Tianbao Xie, Tao Yu, Noah A. Smith, and Mari Ostendorf. 2022. In-context learning for few-shot dialogue state tracking. ArXiv, abs/2203.08568. - Kim and Linzen (2020) Najoung Kim and Tal Linzen. 2020. Cogs: A compositional generalization challenge based on semantic interpretation. ArXiv, abs/2010.05465. - Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213. - Liu et al. (2021) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What makes good in-context examples for gpt-3? In Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out. - Lu et al. (2022) Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and A. Kalyan. 2022. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. ArXiv, abs/2209.14610. - Lu et al. (2021) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. ArXiv, abs/2104.08786. - Min et al. (2022) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837. - Neelakantan et al. (2022) Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. 2022. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005. - Nye et al. (2021) Maxwell Nye, Anders Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114. - Rae et al. (2021) Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John F. J. Mellor, Irina Higgins, Antonia Creswell, Nathan McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, L. Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, N. K. Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Tobias Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew G. Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem W. Ayoub, Jeff Stanway, L. L. Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. 2021. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446. - Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. - Saunshi et al. (2020) Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. 2020. A mathematical exploration of why language models help solve downstream tasks. ArXiv, abs/2010.03648. - Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. Eliciting knowledge from language models using automatically generated prompts. In Conference on Empirical Methods in Natural Language Processing. - Suzgun et al. (2022) Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed Huai hsin Chi, Denny Zhou, and Jason Wei. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. In Annual Meeting of the Association for Computational Linguistics. - Vaswani et al. (2017) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. - Wang et al. (2022) Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2022. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001. - Wang et al. (2023) Xinyi Wang, Wanrong Zhu, and William Yang Wang. 2023. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. arXiv preprint arXiv:2301.11916. - Wei et al. (2021) Colin Wei, Sang Michael Xie, and Tengyu Ma. 2021. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. ArXiv, abs/2106.09226. - Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed Huai hsin Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022. - Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. 2022b. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903. - Xie et al. (2021a) Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2021a. An explanation of in-context learning as implicit bayesian inference. ArXiv, abs/2111.02080. - Xie et al. (2021b) Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2021b. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080. - Xu et al. (2023) Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196. - Yu et al. (2018) Tao Yu, Rui Zhang, Kai-Chou Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Z Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. ArXiv, abs/1809.08887. - Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alexander J. Smola. 2022. Automatic chain of thought prompting in large language models. ArXiv, abs/2210.03493. Appendix: LaRS: Latent Reasoning Skill for Chain-of-Thought Reasoning ## Appendix A LaRS Demonstration Selection A practical desmonstration selection process for LaRS that tackle the difficulty of sampling from an unknown distribution $P_{E}(Q,R\mid z)$ is described as follows. To begin with, LaRS learns reasoning skill encoder $\pi_{\phi}$ and reasoning policy $q_{\omega}$ . For a given test question $Q_{\text{test}}$ , the desirable reasoning skill $z_{\text{test}}=\operatorname*{arg\,max}_{z}[\pi_{\phi}(z|f(Q_{\text{test}}))]$ can be computed using the reasoning policy. Subsequently, each example from the example bank can be scored based on the cosine similarity between $z_{\text{test}}$ and $z_{\text{post}}$ , where $z_{\text{post}}=\operatorname*{arg\,max}_{z}[q_{\omega}(z|Q,R))]$ represents the maximum likelihood skill of the current example. Finally, a CoT prompt can be constructed by selecting the top- $k$ examples according to the computed scores. The step-by-step procedure is outlined in Algorithm 1. Algorithm 1 Demonstration selection Input: Test question $Q_{\text{test}}$ , a pre-trained embedding model $f$ , a reasoning policy $\pi_{\phi}(z|f(Q))$ , a reasoning skill encoder $q_{\omega}(z|f(Q,R))$ , and an example bank $\mathcal{D}_{E}=\{(Q^{j},R^{j})\}_{j}$ . Parameter: shot number $k$ Output: $(Q_{1},R_{1},Q_{2},R_{2},\cdots,Q_{k},R_{k})$ 1: Compute $z_{\text{test}}\leftarrow$ mean of $\pi(z|f(Q_{\text{test}}))$ 2: for each $(Q^{j},R^{j})$ in $\mathcal{D}_{E}$ do 3: Compute $z^{j}_{\text{post}}\leftarrow$ mean of $q_{\omega}(z|f(Q^{j},R^{j}))$ 4: Compute $r^{j}=\frac{z_{\text{test}}\cdot{z^{j}_{\text{post}}}^{\intercal}}{|z_{\text{ test}}|\cdot|z^{j}_{\text{post}}|}$ 5: end for 6: Select top- $k$ demonstrations with the largest $r^{j}$ and sort them in ascending order, denoted as $(Q_{1},R_{1},Q_{2},R_{2},\cdots,Q_{k},R_{k})$ . 7: return $(Q_{1},R_{1},Q_{2},R_{2},\cdots,Q_{k},R_{k})$ ## Appendix B Experimental Details ### B.1 Dataset We provide a detailed description of the dataset and the split of train and test set as follows: #### TabMWP Lu et al. (2022) This dataset consists of semi-structured mathematical reasoning problems, comprising 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. We use the train set, containing 23,059 examples, to train our LaRS models, and test1k set containing 1K examples to evaluate the selection methods. #### Spider Yu et al. (2018) Spider is a large-scale text-to-SQL dataset. It includes a train set with 7,000 examples and a dev set with 1,034 examples. We use the train set to train our LaRS models, and the dev set as the test set to evaluate the selection methods. #### COGS Kim and Linzen (2020) is a synthetic benchmark for testing compositional generalization in semantic parsing. We transform the output format in the same way as An et al. (2023a), and consider a mixture of two sub-tasks: primitive substitution (P.S.) and primitive structural alternation (P.A.). This results in a train set of 6916 examples to train our LaRS models and a test set of 1000 examples to evaluate the selection method. #### GSM8k Cobbe et al. (2021b) GSM8k is a dataset containing 8.5K high-quality, linguistically diverse grade school math word problems. It includes a train set of 7.5K problems and a test set of 1319 problems. We use the train set to train our LaRS models, and the test set to evaluate the selection methods. ### B.2 LaRS Implementation Details LaRS contains a encoder, a decoder, and a reasoning policy model. The reasoning skill is represented as a 128-dimensional continuous space. Both the encoder and the reasoning policy model are represented as a feed-forward multiple layer perception (MLP) with two 256-unit hidden layers, predicting the mean and variance of a multivariate Gaussian distribution in the latent space of reasoning skills. The decoder is a MLP with two 256-unit hidden layers that predicts a value in the embedding space deterministically. The dimension of the embedding space depends on the choice of pre-trained embedding models. The models are trained using the loss function in Equation 4 with a batch size of 256 and a learning rate of 0.0001 for 1000 epochs on a machine with 48 CPU cores and a Nvidia A40 GPU. Those hyper-parameters apply for all four datasets. ### B.3 Skill-KNN Implementation Details We used the same skill annotations as the original Skill-KNN implementation for COGS and Spider dataset. For TabMWP and GSM8K, we manually create skill annotations for 8 questions for each dataset. The new skill annotations are shown in Table 4 and 5. | 1 2 3 | Name | Score Jackson | 32 Madelyn | 31 Gary | 36 Suzie | 33 Edgar | 31 Ben | 32 Felipe | 29 x | y 17 | 13 18 | 6 19 | 2 box of tissues | $0.90 of hand lotion | $0.94 tube of toothpaste | $0.84 package of dental floss | $0.85 box of bandages | $0.87 bottle of nail polish | $0.99 | Some friends played miniature golf and wrote down their scores. What is the range of the numbers? The table shows a function. Is the function linear or nonlinear? Sophie has $1.50. Does she have enough to buy a box of tissues and a package of dental floss? | To solve this problem, we need to find the greatest number and the least number. Then, subtract the least number from the greatest number. To solve this problem, we need to compare the rate of change between any two rows of the table. To solve this problem, we need to compute the total cost and compare it with the budget. | | --- | --- | --- | --- | | 4 | Day | Number of fan letters Monday | 3,985 Tuesday | 1,207 Wednesday | 6,479 Thursday | 2,715 Friday | 8,078 | An actor was informed how many fan letters he received each day. How many more fan letters were received on Friday than on Tuesday? | To solve the problem, we need to locate the two values in the table and do subtraction. | | 5 | Stem | Leaf 3 | 1, 5, 7, 8 4 | 0, 3, 5, 5, 8 5 | 2, 4, 5, 7, 9 6 | 4, 5, 6 7 | 1, 1, 7, 8 8 | 9 | 0 | Daniel counted the number of silver beads on each bracelet at Lowell Jewelry, the store where he works. What is the largest number of silver beads? | To solve this problem, we need to locate the largest number from a stem-and-leaf plot. | | 6 | Number of tanks | Number of tadpoles 1 | 10 2 | 20 3 | 30 4 | 40 5 | ? | Each tank has 10 tadpoles. How many tadpoles are in 5 tanks? | To solve this problem, we need to complete the table according to the tendency of the columns. | | 7 | | Blue sticker | Green sticker Front door of the house | 2 | 4 Back door of the house | 3 | 3 | Lester keeps all his spare keys in a box under his bed. Recently, Lester decided the box was becoming unmanageable, as none of the keys were labeled. He set about labeling them with colored stickers that indicated what each key opened. What is the probability that a randomly selected key opens the front door of the house and is labeled with a green sticker? Simplify any fractions. | To solve this problem, we need to find the number of outcomes in the event and the total number of outcomes. Then compute the probability. | | 8 | Sparrowtown | 8:00 A.M. | 2:00 P.M. | 4:45 P.M. Danville | 9:15 A.M. | 3:15 P.M. | 6:00 P.M. Princeton | 10:30 A.M. | 4:30 P.M. | 7:15 P.M. Westminster | 11:45 A.M. | 5:45 P.M. | 8:30 P.M. Oakdale | 1:30 P.M. | 7:30 P.M. | 10:15 P.M. | Look at the following schedule. Lee just missed the 4.30 P.M. train at Princeton. What time is the next train? | To solve this problem, we need to locate the entry from the table and read the next entry. | Table 4: Skill description annotation for TabMWP dataset. | 1 2 3 | Angela slept 6.5 hours every night in December. She decided she should get more sleep and began sleeping 8.5 hours a night in January. How much more sleep did Angela get in January? Edith is a receptionist at a local office and is organizing files into cabinets. She had 60 files and finished organizing half of them this morning. She has another 15 files to organize in the afternoon and the rest of the files are missing. How many files are missing? Rosalina receives gifts from three people on her wedding day. How many gifts did she get if Emilio gave 11 gifts, Jorge gave 6 gifts, and Pedro gave 4 gifts? | To solve this question, we need to do subtraction, inference the total number of days in a month, and do multiplication. To solve this question, we need to do division, addition, and subtraction. To solve this question, we need to do addition. | | --- | --- | --- | | 4 | A store puts out a product sample every Saturday. The last Saturday, the sample product came in boxes of 20. If they had to open 12 boxes, and they had five samples left over at the end of the day, how many customers tried a sample if the samples were limited to one per person? | To solve this question, we need to do multiplication and subtraction. | | 5 | Billy is counting the rings in two trees. Weather fluctuations in this area mean that each tree’s rings are in groups of two fat rings and four thin rings. If Billy counts 70 ring groups in the first tree and 40 ring groups in the second tree, how much older is the first tree? (Trees grow 1 ring per year.) | To solve this question, we need to do addition, subtraction, and multiplication. | | 6 | A group of six friends planned to buy a car. The cost of the car is $1700 and they plan to share the cost equally. They had a car wash to help raise funds, which would be taken out of the total cost. The remaining cost would be split between the six friends. At the car wash, they earn $500. However, Brad decided not to join in the purchase of the car. How much more does each friend have to pay now that Brad isn’t participating? | To solve this question, we need to do subtraction, division, and multiplication. | | 7 | In Fifi’s closet, she hangs all of her clothes on colored plastic hangers. She has clothes hanging on 7 pink hangers, 4 green hangers, one less blue hanger than there are green hangers, and one less yellow hanger than there are blue hangers. What is the total number of colored hangers in Fifi’s closet? | To solve this question, we need to do subtraction and addition. | | 8 | At the family reunion, everyone ate too much food and gained weight. Orlando gained 5 pounds. Jose gained two pounds more than twice what Orlando gained. Fernando gained 3 pounds less than half of what Jose gained. How much weight, in pounds, did the three family members gain at their reunion? | To solve this question, we need to do multiplication, addition, and subtraction. | Table 5: Skill description annotation for GSM8K dataset. For Skill-KNN-zero with zero-shot generation of the skill description, the prompts used for the four datasets are shown in Table 6. | TabMWP | Describe the required skills to solve the following problems based on the data from the tables in one sentence | | --- | --- | | GSM8K | Describe the required skills to solve the following questions in one sentence | | Spider | Describe the needed skills to solve the task on the database schema in one sentence. | | COGS | Describe the required skills to parse the following sentences in one sentence. | Table 6: Prompts for zero-shot skill generation. ## Appendix C Analysis and Ablation This section provides in-depth analysis and explains the reasoning of the success of LaRS . #### Why reasoning skill is a better guidance for demonstration selection? <details> <summary>x1.png Details</summary> ![c46ed97d](/v1/image/c46ed97deabebb7f6076e18418d8106705845180e653c986f7037202b420a2b5) ### Visual Description ## Scatter Plot Matrix: Reasoning Skill Embeddings ### Overview The image displays a 2x2 grid of scatter plots visualizing the clustering of different reasoning skills within various embedding spaces. A shared legend on the right maps specific symbols and colors to 13 distinct reasoning skill categories. The plots compare how these skills group based on different representations: combined question and rationale skill, question skill alone, raw question embeddings, and raw rationale embeddings. ### Components/Axes * **Titles:** * Top-Left Plot: "Reasoning skill of (Q, R)" * Top-Right Plot: "Reasoning skill of Q" * Bottom-Left Plot: "Raw question embedding" * Bottom-Right Plot: "Raw rationale embedding" * **Legend (Positioned to the right of the top-right plot):** * Title: "Reasoning skills" * Entries (Symbol - Label): 1. Black Circle (●) - "Compute statistics" 2. Purple Downward Triangle (▼) - "Compute rate of change" 3. Blue 'x' (x) - "Compute money cost" 4. Blue Circle (●) - "Filter tree leaves" 5. Cyan Downward Triangle (▼) - "Addition/subtraction" 6. Cyan 'x' (x) - "Search minimum/maximum" 7. Green Circle (●) - "Multiplication" 8. Green Downward Triangle (▼) - "Filter table entries" 9. Light Green 'x' (x) - "Compute probability" 10. Yellow Circle (●) - "Shortage or surplus?" 11. Orange Downward Triangle (▼) - "Reason time schedule" 12. Red 'x' (x) - "Compare numbers" 13. Red Circle (●) - "Others" * **Axes:** None of the four scatter plots have labeled X or Y axes or numerical scales. They are abstract 2D projections of embedding spaces. ### Detailed Analysis **1. Reasoning skill of (Q, R) [Top-Left Plot]:** * **Trend/Clustering:** Skills form distinct, separated clusters. * **Spatial Distribution:** * **Top-Left Quadrant:** A dense cluster of green circles ("Multiplication") and green downward triangles ("Filter table entries"). * **Top-Center:** A cluster of orange downward triangles ("Reason time schedule"). * **Top-Right Quadrant:** A dense cluster of black circles ("Compute statistics"). * **Center-Left:** A large, dispersed cluster dominated by blue 'x's ("Compute money cost"). * **Center:** Scattered cyan 'x's ("Search minimum/maximum") and blue circles ("Filter tree leaves"). * **Center-Right:** A cluster of purple downward triangles ("Compute rate of change"). * **Bottom-Left:** A small cluster of light green 'x's ("Compute probability"). * **Bottom-Center:** A small cluster of yellow circles ("Shortage or surplus?"). * **Isolated Points:** A few red 'x's ("Compare numbers") and red circles ("Others") are scattered sparsely. **2. Reasoning skill of Q [Top-Right Plot]:** * **Trend/Clustering:** Clusters are more dispersed and less distinct than in the (Q,R) plot. * **Spatial Distribution:** * **Top-Left Quadrant:** A cluster of purple downward triangles ("Compute rate of change"). * **Top-Center:** A dense cluster of black circles ("Compute statistics"). * **Top-Right Quadrant:** A large, dispersed cluster of blue 'x's ("Compute money cost"). * **Center:** Scattered cyan 'x's ("Search minimum/maximum"), blue circles ("Filter tree leaves"), and green downward triangles ("Filter table entries"). * **Bottom-Center:** A cluster of orange downward triangles ("Reason time schedule"). * **Bottom-Right Quadrant:** A cluster of green circles ("Multiplication"). * **Isolated Points:** A few red 'x's ("Compare numbers") and a red circle ("Others") are present. **3. Raw question embedding [Bottom-Left Plot]:** * **Trend/Clustering:** Shows two primary, large clusters with some internal structure. * **Spatial Distribution:** * **Left Cluster:** A dense, vertical cluster containing a mix of blue 'x's ("Compute money cost"), cyan 'x's ("Search minimum/maximum"), blue circles ("Filter tree leaves"), green downward triangles ("Filter table entries"), and scattered points from other categories. * **Right Cluster:** A dense cluster containing black circles ("Compute statistics"), purple downward triangles ("Compute rate of change"), green circles ("Multiplication"), and yellow circles ("Shortage or surplus?"). * **Bottom:** A distinct, small cluster of orange downward triangles ("Reason time schedule"). * **Isolated Points:** Red 'x's ("Compare numbers") and red circles ("Others") are scattered. **4. Raw rationale embedding [Bottom-Right Plot]:** * **Trend/Clustering:** Shows a more continuous, elongated distribution compared to the other plots. * **Spatial Distribution:** * **Top-Right to Bottom-Left Diagonal:** A broad, dispersed band containing a mix of many skills, notably blue 'x's ("Compute money cost"), cyan 'x's ("Search minimum/maximum"), and green circles ("Multiplication"). * **Top-Left Quadrant:** A cluster of black circles ("Compute statistics"). * **Center-Left:** A cluster of purple downward triangles ("Compute rate of change"). * **Bottom-Center:** A cluster of blue circles ("Filter tree leaves"). * **Bottom-Left Quadrant:** A cluster of light green 'x's ("Compute probability"). * **Isolated Points:** Red 'x's ("Compare numbers"), red circles ("Others"), and yellow circles ("Shortage or surplus?") are scattered. ### Key Observations 1. **Cluster Coherence:** The "Reasoning skill of (Q, R)" plot shows the most distinct and separated clusters for each skill, suggesting the combined representation best disentangles these reasoning abilities. 2. **Skill Similarity:** "Compute statistics" (black circles) and "Compute rate of change" (purple triangles) often cluster near each other, implying similarity in their required reasoning. 3. **Dominant Skills:** "Compute money cost" (blue 'x') and "Search minimum/maximum" (cyan 'x') are highly prevalent and form large, dispersed clusters across multiple plots. 4. **Embedding Shift:** The transition from "Raw question embedding" to "Raw rationale embedding" shows a significant change in the spatial organization of skills, indicating the rationale provides different or additional information. 5. **Outliers:** The "Others" (red circle) and "Compare numbers" (red 'x') categories are consistently sparse and scattered, suggesting they are either rare or not well-defined within this embedding space. ### Interpretation This visualization demonstrates how different reasoning skills are represented in an AI model's internal embedding space. The clear clustering in the "Reasoning skill of (Q, R)" plot indicates the model has learned to distinguish these skills when considering both the question and its rationale. The proximity of certain clusters (e.g., statistics and rate of change) suggests the model perceives an underlying relationship between these tasks. The stark difference between the raw question and rationale embeddings highlights that the rationale (the step-by-step reasoning) carries distinct semantic information crucial for task classification. The dispersion of skills like "Compute money cost" may indicate it is a broad category encompassing varied sub-tasks. Overall, the plots provide evidence that the model's internal representations align with human-defined reasoning skill categories, with the combined (Q,R) representation being the most discriminative. </details> Figure 5: t-SNE projections of reasoning skills predicted from $(Q,R)$ (top-left), reasoning skills predicted from $Q$ (top-right), raw question embedding (bottom-left), and raw rationale embedding (bottom-right). The 12 different colors correspond to 12 skill labels provided by human. In TabMWP dataset, 200 examples are labeled based on the skills being showcased out of 12 manually-crafted skills labels, including “compute statistics”, “compute rate of change”, “Reason time schedule”, “Compute probability”, et. al. We investigate how the unsupervisedly discovered reasoning skills by LaRS align with human’s understanding of skills. More specifically, a visualization of how human-labeled skills distribute based on the t-SNE projections of four different types of embedding is shown in Fig. 5. Both the reasoning skill encoder (reasoning skill of $(Q,R)$ ) and the reasoning policy (reasoning skill of $Q$ ) trained by LaRS demonstrate clear separation of the labeled 12 skills. At the mean time, the human-labeled skills are not well-separated by raw question embedding, and even raw rationale embeddings. This indicates that the discovered reasoning skills aligns well with human-labeled skills even without explicit labels being provided during the training. This sheds the light on why the demonstration selection based on similar reasoning skills can improve the CoT prompting. <details> <summary>extracted/6556870/content/figures/ablation_embedding_model.png Details</summary> ![8ec2bb5d](/v1/image/8ec2bb5d8e74aac4e1c34c11ea3cabc222dbc9a6715c9a8345ed435de8e7309e) ### Visual Description ## Grouped Bar Chart: Accuracy Comparison of Retrieval Methods Across Embedding Models ### Overview The image is a grouped bar chart comparing the accuracy (in percentage) of three different retrieval or selection methods ("Random", "Retrieval-Q", and "LaRS") when applied to three distinct text embedding models. The chart visually demonstrates the performance hierarchy of the methods and the relative effectiveness of the underlying embedding models. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis (Vertical):** * **Label:** "Accuracy (%)" * **Scale:** Linear scale ranging from 55 to 85, with major tick marks at intervals of 5 (55, 60, 65, 70, 75, 80, 85). * **X-Axis (Horizontal):** * **Label:** "Embedding Models" * **Categories (from left to right):** 1. Sentence-BERT 2. Deberta-v2-xlarge 3. text-embedding-ada-02 * **Legend (Top Center):** * **Random:** Represented by a green bar with a diagonal line pattern (\\). * **Retrieval-Q:** Represented by a light purple bar with a diagonal line pattern (\\). * **LaRS:** Represented by an orange bar with a black dot pattern (•). ### Detailed Analysis The chart presents accuracy data for nine distinct scenarios (3 methods x 3 models). Values are approximate, estimated from the bar heights relative to the y-axis. **1. For the "Sentence-BERT" embedding model (leftmost group):** * **Random (Green, \\):** Accuracy is approximately **60%**. * **Retrieval-Q (Purple, \\):** Accuracy is approximately **75%**. * **LaRS (Orange, •):** Accuracy is approximately **77%**. * *Trend:* LaRS performs best, followed closely by Retrieval-Q. Both significantly outperform the Random baseline. **2. For the "Deberta-v2-xlarge" embedding model (center group):** * **Random (Green, \\):** Accuracy is approximately **60%**. * **Retrieval-Q (Purple, \\):** Accuracy is approximately **72%**. * **LaRS (Orange, •):** Accuracy is approximately **78%**. * *Trend:* LaRS again performs best, with a more pronounced lead over Retrieval-Q compared to the first model. The Random baseline remains consistent. **3. For the "text-embedding-ada-02" embedding model (rightmost group):** * **Random (Green, \\):** Accuracy is approximately **60%**. * **Retrieval-Q (Purple, \\):** Accuracy is approximately **81%**. * **LaRS (Orange, •):** Accuracy is approximately **84%**. * *Trend:* This model yields the highest overall accuracies for the non-random methods. LaRS achieves the peak performance on the chart, with Retrieval-Q also showing a strong result. The Random baseline is unchanged. ### Key Observations 1. **Consistent Baseline:** The "Random" method's accuracy is remarkably stable at approximately 60% across all three embedding models, serving as a fixed performance baseline. 2. **Method Hierarchy:** A clear and consistent performance hierarchy exists across all models: **LaRS > Retrieval-Q > Random**. LaRS is always the top-performing method. 3. **Model Impact:** The choice of embedding model significantly impacts the accuracy of the intelligent methods (Retrieval-Q and LaRS). Performance improves from Sentence-BERT to Deberta-v2-xlarge to text-embedding-ada-02 for these methods. 4. **Peak Performance:** The highest accuracy on the chart (~84%) is achieved by the **LaRS** method when used with the **text-embedding-ada-02** embedding model. 5. **Visual Patterns:** The chart uses distinct colors (green, purple, orange) and fill patterns (diagonal lines, dots) to ensure clear differentiation between the methods, even if printed in grayscale. ### Interpretation This chart provides strong evidence for the effectiveness of the **LaRS** method over both a **Retrieval-Q** approach and a random baseline in a retrieval or selection task. The data suggests that LaRS is a robust technique that consistently extracts better performance from various underlying text representation models. The significant variance in accuracy for LaRS and Retrieval-Q across the three embedding models (from ~72% to ~84%) indicates that the quality of the base embeddings is a critical factor for success. The **text-embedding-ada-02** model appears to provide the most informative representations for this specific task, as it enables the highest accuracies for both intelligent methods. The flat performance of the "Random" baseline is an important control. It confirms that the improvements seen with Retrieval-Q and LaRS are due to their intelligent mechanisms and not artifacts of the test set or embedding models themselves. The chart effectively argues for adopting LaRS, particularly when paired with a high-performing embedding model like text-embedding-ada-02, to maximize task accuracy. </details> (a) The accuracy of Random, Retrieval-Q, and, LaRS based on three different pre-trained embedding models. <details> <summary>extracted/6556870/content/figures/ablation_num_example.png Details</summary> ![b407c715](/v1/image/b407c715372052db1189719c799d765efa350f0deed7be13f68d1a16f068c67b) ### Visual Description ## Grouped Bar Chart: Accuracy vs. Number of In-Context Examples ### Overview This is a grouped bar chart comparing the performance (accuracy) of three different methods—Random, Retrieval-Q, and LaRS—across three different settings for the number of in-context examples provided (2, 4, and 8). The chart visually demonstrates how accuracy changes for each method as the number of examples increases. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **X-Axis:** Labeled "Number of in-context examples". It has three categorical tick marks: `2`, `4`, and `8`. * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from approximately 55% to just above 85%, with major tick marks at 60, 70, and 80. * **Legend:** Positioned at the top center of the chart area. It defines three data series: * **Random:** Represented by a green bar with diagonal stripes (top-left to bottom-right). * **Retrieval-Q:** Represented by a light purple bar with diagonal stripes (top-right to bottom-left). * **LaRS:** Represented by an orange bar with a black dot pattern. ### Detailed Analysis Data values are approximate, read from the visual alignment of bar tops with the y-axis. **For 2 in-context examples:** * **Random (Green, Striped):** Accuracy is approximately **60%**. * **Retrieval-Q (Purple, Striped):** Accuracy is approximately **75%**. * **LaRS (Orange, Dotted):** Accuracy is approximately **77%**. **For 4 in-context examples:** * **Random (Green, Striped):** Accuracy increases to approximately **72%**. * **Retrieval-Q (Purple, Striped):** Accuracy increases to approximately **84%**. * **LaRS (Orange, Dotted):** Accuracy increases to approximately **86%**. **For 8 in-context examples:** * **Random (Green, Striped):** Accuracy increases further to approximately **75%**. * **Retrieval-Q (Purple, Striped):** Accuracy is approximately **85%**. * **LaRS (Orange, Dotted):** Accuracy is approximately **86%**. ### Key Observations 1. **Consistent Hierarchy:** At every data point (2, 4, and 8 examples), the performance order is consistent: LaRS > Retrieval-Q > Random. 2. **Positive Trend:** All three methods show a positive trend; accuracy increases as the number of in-context examples increases from 2 to 4 to 8. 3. **Diminishing Returns:** The most significant performance jump for all methods occurs when moving from 2 to 4 examples. The improvement from 4 to 8 examples is much smaller, especially for LaRS and Retrieval-Q, suggesting a plateau effect. 4. **Performance Gap:** The gap between the best method (LaRS) and the baseline (Random) is substantial at all points, ranging from approximately 17 percentage points (at 2 examples) to 11 percentage points (at 8 examples). 5. **LaRS vs. Retrieval-Q:** LaRS consistently outperforms Retrieval-Q, but the margin is relatively small (approximately 2-3 percentage points). ### Interpretation This chart presents a clear performance comparison for a machine learning or AI task, likely related to few-shot learning or in-context learning. The data suggests that: * **Method Superiority:** The **LaRS** method is the most effective of the three, providing the highest accuracy regardless of the number of examples given. * **Value of Examples:** Providing more in-context examples (moving from 2 to 4) yields a substantial benefit for all methods, indicating the model's ability to learn from provided examples. * **Saturation Point:** The minimal gain from 4 to 8 examples implies that the models may be reaching a saturation point where additional examples provide limited new information for improving accuracy on this specific task. The task or model capacity might be the limiting factor. * **Baseline Comparison:** The **Random** method serves as a baseline. Its lower performance confirms that the task requires non-trivial reasoning or pattern recognition that is not achieved by chance. The fact that its accuracy also improves with more examples suggests even a random selection might contain some useful signal or that the model's inference process benefits from a larger context window. The chart effectively argues for the adoption of the LaRS method over Retrieval-Q and a random baseline, particularly when working with a small number of in-context examples (2-4). </details> (b) The accuracy of Random, Retrieval-Q, and LaRS using different number of in-context examples. Figure 6: Performances of three different selection methods under (a) different pre-trained embedding models, and (b) different number of in-context examples. #### Robustness to different pre-trained embedding models. Fig. 6(a) compares the performances of Random, Retrieval-Q, and LaRS based on three pre-trained embedding models, including Sentence-BERT Reimers and Gurevych (2019), Deberta-v2-xlarge, and, text-embedding-ada-02 Neelakantan et al. (2022) from OpenAI. We observe that the performances of retrieval-based selection methods monotonously improve with more capable pre-trained embedding models. However, our LaRS shows consistent improvements over Retrieval-Q given the same embedding models. #### Robustness to $k$ : the number of in-context examples. This study compares three selection methods, including Random, Retrieval-Q, and LaRS under three different number of in-context examples 2, 4, and 8. The results are summarized in Fig. 6(b). While the accuracy monotonously improves with the increasing number of in-context examples, LaRS consistently outperforms Retrieval-Q. #### How does Skill-KNN perform under stricter conditions? | Method Backbone: gpt-3.5-turbo Skill-KNN-large | TabMWP 78.3 +15.9 | GSM8K 75.0 –0.7 | Spider 58.4 +11.6 | COGS 94.6 +27.2 | | --- | --- | --- | --- | --- | | Skill-KNN-small | 75.5 +13.2 | 74.9 –0.8 | 37.3 –9.5 | 79.9 +12.7 | | Skill-KNN-zero | 77.7 +15.3 | 75.0 –0.7 | 49.0 +2.2 | 77.9 +10.8 | | LaRS (ours) | 78.1 +15.7 | 76.8 +1.1 | 53.0 +6.2 | 94.8 +27.2 | | Backbone: gpt-4o | | | | | | Skill-KNN-large | 80.6 +11.3 | 62.0 –0.2 | 56.3 +9.8 | 96.8 +23.4 | | Skill-KNN-small | 77.4 +8.1 | 62.3 +0.1 | 47.4 +0.3 | 79.4 +6.0 | | Skill-KNN-zero | 87.7 +0.1 | 78.6 –0.5 | 76.6 +2.5 | 78.1 +5.1 | | LaRS (ours) | 87.9 +0.3 | 78.3 +0.2 | 77.2 +3.1 | 90.2 +17.2 | | Backbone: claude-3-sonnet | | | | | | Skill-KNN-large | | 93.2 –0.1 | 25.9 +7.6 | 96.2 +17.0 | | Skill-KNN-small | | 92.3 –1.0 | 18.2 –0.1 | 86.6 +7.4 | | Skill-KNN-zero | 93.1 +0.5 | 92.1 –1.2 | 61.9 +0.2 | 86.6 +7.4 | | LaRS (ours) | 93.7 +1.1 | 93.6 +0.3 | 62.2 +0.5 | 96.9 +17.7 | | Backbone: claude-3-haiku | | | | | | Skill-KNN-zero | 93.3 +4.7 | 88.8 +0.2 | 61.0 +0.8 | 79.7 +13.5 | | LaRS (ours) | 93.3 +4.7 | 87.6 –1.0 | 61.3 +1.1 | 89.9 +23.7 | | Backbone: Falcon-40B-Instruct | | | | | | Skill-KNN-large | 55.9 +10.2 | 40.3 +1.5 | 23.7 +2.9 | 81.0 +35.9 | | Skill-KNN-small | 51.4 +5.7 | 36.5 –2.3 | 20.3 –0.3 | 59.4 +14.3 | | Skill-KNN-zero | 55.2 +9.5 | 38.7 –0.1 | 23.3 +2.7 | 82.1 +37.0 | | LaRS (ours) | 57.7 +12.0 | 39.1 +0.3 | 24.8 +4.2 | 89.5 +44.4 | Table 7: Skill-KNN-large, Skill-KNN-small, and Skill-KNN-zero compare with LaRS . ## Appendix D Case Study To explore the examples categorized as distinct skills within the learned latent reasoning skill representation, we employed K-means clustering on the latent reasoning skills of 1,000 examples from the TabMWP dataset. The centroids of these clusters are detailed in Table 8. The analysis presented in this table reveals that our method effectively discerns examples showcasing specific skills, such as “Searching minimum/maximum” and “Computing rate change”. | 0 | [TITLE]: School play committees Committee | Boys | Girls Casting | 17 | 5 Set design | 14 | 17 Lighting | 20 | 20 Costume | 7 | 4 Music | 2 | 13 | Some students at Dayton Middle School signed up to help out with the school play. Which committee has the most boys? Options: (A) set design (B) lighting (C) casting (D) costume | Search minimum/maximum | | --- | --- | --- | --- | | 1 | [TITLE]: Pairs of shoes per store Stem | Leaf 1 | 9 2 | 3, 3 3 | 0, 2 4 | 2, 4 5 | 5, 7 6 | 2, 5 7 | 7 8 | 0, 2, 4, 4 9 | 0, 0 | Ivan counted the number of pairs of shoes for sale at each of the shoe stores in the mall. How many stores have exactly 23 pairs of shoes? | Search tree leaves | | 2 | [TITLE]: None piece of licorice | $0.07 gum drop | $0.05 gumball | $0.08 cinnamon candy | $0.01 peppermint candy | $0.08 lemon drop | $0.07 | Derek has $0.06. Does he have enough to buy a piece of licorice and a cinnamon candy? Options: (A) yes (B) no | Compute money cost | | 3 | [TITLE]: None Number of offices | Number of chairs 1 | 2 2 | 4 3 | 6 4 | 8 5 | ? | Each office has 2 chairs. How many chairs are in 5 offices? | Multiplication | | 4 | [TITLE]: None popcorn balls | $1/kilogram coffee cake | $3/kilogram blueberry bars | $2/kilogram cream cheese bars | $2/kilogram lemon bars | $3/kilogram | Sarah went to the store and bought 2 kilograms of blueberry bars. How much did she spend? (Unit: $) | Compute money cost | | 5 | [TITLE]: None x | y 12 | 19 13 | 9 14 | 2 | The table shows a function. Is the function linear or nonlinear? Options: (A) linear (B) nonlinear | Compute rate of change | | 6 | [TITLE]: Tractors Farmer | Number of tractors Farmer Judy | 4 Farmer Joe | 7 Farmer Megan | 7 Farmer Rick | 4 Farmer Jane | 4 | Some farmers compared how many tractors they own. What is the mode of the numbers? | Compute statistics | | 7 | [TITLE]: None pink sweater | $6.69 pair of brown pants | $9.66 plaid scarf | $2.45 pair of sandals | $7.69 white polo shirt | $4.86 | How much money does Heather need to buy a pair of brown pants and a plaid scarf? (Unit: $) | Compute money cost | | 8 | [TITLE]: Tour bus schedule Location | Arrive | Depart the riverfront | 9:55 A.M. | 10:20 A.M. the zoo | 10:35 A.M. | 11:30 A.M. art museum | 12:05 P.M. | 12:30 P.M. science museum | 1:00 P.M. | 1:45 P.M. skyscraper | 1:50 P.M. | 2:20 P.M. governor’s mansion | 2:50 P.M. | 3:45 P.M. old building | 4:00 P.M. | 4:45 P.M. famous bridge | 5:15 P.M. | 5:40 P.M. the aquarium | 6:20 P.M. | 7:00 P.M. landmark sculpture | 7:45 P.M. | 8:20 P.M. | Look at the following schedule. Which stop does the bus depart from at 11.30 A.M.? Options: (A) zoo (B) riverfront (C) old building (D) science museum | Reason time schedule | | 9 | [TITLE]: None poppyseed muffin | $2.31 bowl of yogurt | $1.35 blueberry pancakes | $7.28 hash browns | $4.56 bowl of granola | $2.97 bagel with cream cheese | $2.56 | Max has $13.33. How much money will Max have left if he buys a bagel with cream cheese and blueberry pancakes? (Unit: $) | Compute money cost | | --- | --- | --- | --- | | 10 | [TITLE]: Balloons sold Day | Number of balloons Wednesday | 568 Thursday | 586 Friday | 558 Saturday | 565 | The manager of a party supply store researched how many balloons it sold in the past 4 days. On which day did the store sell the most balloons? Options: (A) Wednesday (B) Thursday (C) Friday (D) Saturday | Search minimum/maximum | | 11 | [TITLE]: None forklift | $9,987.00 dump truck | $9,543.00 race car | $8,370.00 crane | $6,996.00 bulldozer | $7,547.00 hydrofoil | $8,047.00 | How much more does a forklift cost than a dump truck? (Unit: $) | Compute money cost | Table 8: The closest examples to the 12 cluster centers computed by K-Means clustering method on reasoning skill latent variables. ## Appendix E Theoretical Analysis To prove Theorem 1, we start with the equation of rationale generation via CoT prompting, employing the skill-based demonstration selection method denoted as $g_{skill}$ . The process can be formalized as follows: $$ \displaystyle P_{M}( \displaystyle R\mid Q,g_{skill})=\int_{\mathcal{X}^{k}}P_{M}(R\mid pt)\Pi_{i=1 }^{k}[g_{skill}(Q_{i},R_{i}\mid Q)d(Q_{i},R_{i})] \tag{5} $$ where Equation 5 is integrated by substituting $pt=(Q_{1},R_{1},\cdots,Q_{k},R_{k},Q)$ as outlined in Equation 3, leading to: $$ \displaystyle P_{M}( \displaystyle R\mid Q,g_{skill})=\int_{\mathcal{Z}}P_{M}(R\mid z,Q)P_{M}(z\mid Q )\Pi_{i=1}^{k}[P_{skill}(z\mid Q)]dz \tag{6} $$ In this context, $P_{skill}(z\mid Q)$ is defined as: $$ \displaystyle P_{skill}(z\mid Q)=\int_{(Q^{\prime},R^{\prime})\in\mathcal{X}}P _{M}(z\mid Q^{\prime},R^{\prime})g_{skill}(Q^{\prime},R^{\prime}\mid Q)d(Q^{ \prime},R^{\prime})dz^{\prime} \tag{7} $$ Substituting the Definition 1 into Equation 7, leading to: $$ \displaystyle P_{skill}(z\mid Q)=\int_{(Q^{\prime},R^{\prime})\in\mathcal{X}} \int_{z^{\prime}\in\mathcal{Z}}P_{M}(z\mid Q^{\prime},R^{\prime})P_{E}(Q^{ \prime},R^{\prime}\mid z^{\prime})P_{E}(z^{\prime}\mid Q)dz^{\prime} \tag{8} $$ Applying Assumption 2 into the above equation, replacing $P_{M}(z\mid Q^{\prime},R^{\prime})$ with $P_{E}(z\mid Q^{\prime},R^{\prime})$ : $$ \displaystyle P_{skill}(z\mid Q) \displaystyle=\int_{(Q^{\prime},R^{\prime})\in\mathcal{X}}\int_{z^{\prime}\in \mathcal{Z}}P_{E}(z\mid Q^{\prime},R^{\prime})P_{E}(Q^{\prime},R^{\prime}\mid z ^{\prime})P_{E}(z^{\prime}\mid Q)dz^{\prime} \displaystyle=\int_{z^{\prime}\in\mathcal{Z}}\delta(z=z^{\prime})P_{E}(z^{ \prime}\mid Q)dz^{\prime} \displaystyle=P_{E}(z\mid Q) \tag{9} $$ By reintegrating the derived expression for $P_{skill}(z\mid Q)$ back into Equation 6, we arrive at: $$ \displaystyle P_{M}(R\mid Q,g_{skill})=\int_{\mathcal{Z}}P_{M}(R\mid z,Q)P_{M} (z\mid Q)\Pi_{i=1}^{k}[P_{E}(z\mid Q)]dz \tag{10} $$ Take the limit of $k\rightarrow\infty$ , above equation siplifies to: $$ \displaystyle P_{M}(R\mid Q,g_{skill})=\int_{\mathcal{Z}}P_{M}(R\mid z,Q)P_{E} (z\mid Q)dz \tag{11} $$ Applying Assumption 2 into the above equation, replacing $P_{M}(R\mid z,Q)$ with $P_{E}(R\mid z,Q)$ : $$ \displaystyle P_{M}(R\mid Q,g_{skill})=\int_{\mathcal{Z}}P_{E}(R\mid z,Q)P_{E} (z\mid Q)dz=P_{E}(R\mid Q) \tag{12} $$ According to Assumption 1, the example bank can approximate expert rationale generation, or $P_{E}(R\mid Q)=P^{*}(R\mid Q)$ , we then conclude: $$ \displaystyle P_{M}(R\mid Q,g_{skill})=P^{*}(R\mid Q) \tag{13} $$ Equation 13 means that the CoT prompting under the skill-based demonstration selection method give the optimal conditional distribution of rationales given questions by Definition 2. This proves the Theorem 1 under Assumption 1 and Assumption 2.

Rendering Paper...