## Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization
Miyoung Ko 1 * Sue Hyun Park 1 ∗ Joonsuk Park 2 , 3 , 4 †Minjoon Seo 1 â€
1 KAIST, 2 NAVER AI Lab, 3 NAVER Cloud, 4 University of Richmond
{miyoungko, suehyunpark, minjoon}@kaist.ac.kr park@joonsuk.org
## Abstract
Despite the advances in large language models (LLMs), how they use their knowledge for reasoning is not yet well understood. In this study, we propose a method that deconstructs complex real-world questions into a graph, representing each question as a node with predecessors of background knowledge needed to solve the question. We develop the DEPTHQA dataset, deconstructing questions into three depths: (i) recalling conceptual knowledge, (ii) applying procedural knowledge, and (iii) analyzing strategic knowledge. Based on a hierarchical graph, we quantify forward discrepancy , a discrepancy in LLM performance on simpler sub-problems versus complex questions. We also measure backward discrepancy where LLMs answer complex questions but struggle with simpler ones. Our analysis shows that smaller models exhibit more discrepancies than larger models. Distinct patterns of discrepancies are observed across model capacity and possibility of training data memorization. Additionally, guiding models from simpler to complex questions through multiturn interactions improves performance across model sizes, highlighting the importance of structured intermediate steps in knowledge reasoning. This work enhances our understanding of LLM reasoning and suggests ways to improve their problem-solving abilities.
## 1 Introduction
With the rapid advancement of Large Language Models (LLMs), research interest has increasingly centered on their reasoning capabilities, particularly in solving complex questions. While many studies have assessed the general reasoning capabilities of LLMs (Wei et al., 2022a; Qin et al., 2023; Srivastava et al., 2023), the specific aspect of how these models recall and then utilize factual knowl-
* Equal contribution.
†Equal advising.
Figure 1: Example of reasoning across depths, showing a sequence of questions from D 1 (conceptual knowledge) to D 3 (strategic knowledge). Questions that ask deeper levels of knowledge require reasoning from multiple areas of shallower knowledge, which are represented as sub-questions.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Knowledge Hierarchy Diagram: Question Categorization
### Overview
The image presents a diagram illustrating a hierarchy of knowledge types, categorized as Conceptual, Procedural, and Strategic. Each category is associated with example questions, demonstrating the type of knowledge it represents. The diagram shows how answering questions at lower levels (Conceptual) can contribute to answering questions at higher levels (Strategic).
### Components/Axes
* **Title:** The diagram is implicitly titled as a knowledge hierarchy.
* **Knowledge Categories:**
* D1: Conceptual Knowledge (i.e., What is it?)
* D2: Procedural Knowledge (i.e., How can it be used?)
* D3: Strategic Knowledge (i.e., Why can it be used?)
* **Questions:** Each knowledge category contains example questions.
* D1:
* [Q1] What does the gradient of a function represent?
* [Q2] How is the speed of neural network training measured?
* [Q3] What role does an activation function play in neural network training?
* [Q4] What is backpropagation in the context of neural networks?
* [Q14] What is the vanishing gradient problem?
* ... (Indicates more questions exist)
* D2:
* [Q1] How do the gradients of activation functions affect the speed of neural network training?
* ... (Indicates more questions exist)
* D3:
* [Target Q] Why does ReLU training take less time than sigmoid or tanh training?
* **Arrows:** Arrows point from the questions in D1 and D2 to the question in D3, indicating a relationship or dependency.
### Detailed Analysis or ### Content Details
* **Conceptual Knowledge (D1):** This level focuses on understanding fundamental concepts. The example questions cover gradients, training speed, activation functions, backpropagation, and the vanishing gradient problem.
* **Procedural Knowledge (D2):** This level focuses on how to use the concepts. The example question asks how gradients of activation functions affect training speed.
* **Strategic Knowledge (D3):** This level focuses on the "why" behind the concepts and procedures. The example question asks why ReLU training is faster than sigmoid or tanh training.
* **Relationships:** The arrows indicate that answering the conceptual and procedural questions can help in answering the strategic question. The questions in D1 and D2 are prerequisites to answering the question in D3.
### Key Observations
* The diagram illustrates a hierarchical structure of knowledge, moving from basic concepts to more complex strategic understanding.
* The example questions are specific to the field of neural network training.
* The "..." notation indicates that the lists of questions are not exhaustive.
### Interpretation
The diagram demonstrates a structured approach to learning and problem-solving. It suggests that a solid foundation of conceptual knowledge (understanding "what" things are) and procedural knowledge (understanding "how" to use them) is essential for developing strategic knowledge (understanding "why" things work the way they do). In the context of neural networks, this means understanding the underlying mathematical concepts and training procedures before being able to reason about the effectiveness of different training techniques like ReLU. The diagram highlights the importance of breaking down complex problems into smaller, more manageable questions that address different levels of knowledge.
</details>
edge during reasoning has not been thoroughly explored. Some research (Dziri et al., 2023; Press et al., 2023; Wang et al., 2024) concentrate on straightforward reasoning tasks such as combining and comparing simple biographical facts to investigate the implicit reasoning skills of LLMs. However, real-world questions often demand more intricate reasoning processes that cannot be easily broken down into simple factual units. For instance, as presented in Figure 1, to answer 'Why does ReLU training take less time than sigmoid or tanh training?', one must not only recall what an activation function is but also compare the characteristics of activation functions and understand the causal relationship between gradients and training speed. This type of reasoning requires drawing conclusions beyond simply aggregating facts.
To analyze the reasoning ability of LLMs in solving real-world questions, we propose a deconstruction of complex questions into a graph structure. In this structure, each node is represented by a question that signifies a specific level of knowledge. We
Figure 2: Hierarchical structure of a deconstructed D 3 , illustrating forward and backward discrepancies. Transition to deeper nodes requires acquiring and reasoning with knowledge from the connected shallower nodes.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Deconstruction of Depth 3 Question
### Overview
The image presents a diagram illustrating the deconstruction of a "Depth 3 Question" and concepts of "Forward Discrepancy" and "Backward Discrepancy." It uses nodes labeled with "Q" followed by a number to represent questions of varying complexity, arranged along a spectrum from "Simplest Q" to "Most complex Q." Arrows indicate relationships between questions, and colored boxes highlight specific discrepancies.
### Components/Axes
* **Title:** Deconstruction of Depth 3 Question
* **Horizontal Axis:** Represents question complexity, ranging from "Simplest Q" on the left to "Most complex Q" on the right.
* **Complexity Levels:** D1, D2, and D3, representing increasing levels of complexity.
* **Nodes:** Represented by rounded rectangles, labeled with "Q" followed by a number (e.g., Q1, Q2, Q3).
* **Arrows:** Indicate relationships or dependencies between questions.
* **Forward Discrepancy Box:** Located in the top-right corner, illustrating a scenario where reasoning fails to form complex knowledge.
* **Backward Discrepancy Box:** Located in the bottom-right corner, illustrating a scenario where one fails to solve a simpler question despite having complex knowledge.
### Detailed Analysis
**Main Diagram (Left Side):**
* **D1 Box (Red Dotted Border):** Contains nodes Q1, Q2, Q3, and Q4.
* Q1 in D1 has a red arrow pointing to Q1 in D2.
* Q2 in D1 has a red arrow pointing to Q1 in D2.
* Q3 in D1 has a black arrow pointing to Q.
* Q4 in D1 has a black arrow pointing to Q.
* **D2 Box (Red Dotted Border):** Contains nodes Q1 and Q2.
* Q1 in D2 is colored in a light red.
* Q2 in D2 has a black arrow pointing to Q.
* **D3:** Contains only the node Q.
* **Bottom Box (Blue Dotted Border):** Contains nodes Q5, Q10, Q3, and Q4.
* Q5 has a blue arrow pointing to Q3.
* Q5 has a blue arrow pointing to Q4.
* Q10 has a blue arrow pointing to Q4.
**Forward Discrepancy (Top-Right):**
* **Text:** "Forward Discrepancy" and "Fail reasoning to form complex knowledge"
* **Nodes:** Q1 (green checkmark), Q2 (green checkmark), Q3 (green checkmark), and Q1 (red X).
* **Arrows:** Red arrows point from Q1, Q2, and Q3 to Q1 (red X).
* **Labels:** "Target Node" (pointing to Q1 with red X), "Direct Predecessors" (pointing to Q1, Q2, and Q3 with green checkmarks).
**Backward Discrepancy (Bottom-Right):**
* **Text:** "Backward Discrepancy" and "Fail to solve a simpler Q despite having complex knowledge"
* **Nodes:** Q5 (red X), Q3 (green checkmark), and Q4 (green checkmark).
* **Arrows:** Blue arrows point from Q5 (red X) to Q3 and Q4 (green checkmarks).
* **Labels:** "Target Node" (pointing to Q5 with red X), "Direct Successors" (pointing to Q3 and Q4 with green checkmarks).
### Key Observations
* The main diagram illustrates a progression of questions from simpler to more complex.
* Forward Discrepancy shows a scenario where simpler questions (Q1, Q2, Q3) are answered correctly, but a more complex question (Q1) is answered incorrectly.
* Backward Discrepancy shows a scenario where more complex questions (Q3, Q4) are answered correctly, but a simpler question (Q5) is answered incorrectly.
### Interpretation
The diagram illustrates a hierarchical structure of questions and highlights potential discrepancies in a learner's understanding. The "Forward Discrepancy" suggests a gap in the ability to synthesize knowledge from simpler concepts to more complex ones. The "Backward Discrepancy" suggests a potential rote memorization of complex concepts without a solid foundation in simpler, prerequisite knowledge. The diagram emphasizes the importance of a coherent and progressive understanding of concepts, where mastery of simpler questions leads to the ability to answer more complex ones. The red arrows and X marks indicate failures in reasoning or knowledge application, while the green checkmarks indicate successful understanding.
</details>
adopt Webb's Depth of Knowledge (Webb, 1997, 1999, 2002), which assesses both the content and the depth of understanding required. Webb's Depth of Knowledge categorizes questions into three levels: mere recall of information ( D 1 ), application of knowledge ( D 2 ), and strategic thinking ( D 3 ). The transition from shallower to deeper nodes involves applying the knowledge gained from shallower nodes and performing reasoning to tackle harder problems. This approach emphasizes the gradual accumulation and integration of knowledge to address real-world problems effectively.
We introduce the resulting DEPTHQA, a collection of deconstructed questions and answers derived from human-written, scientific D 3 questions in the TutorEval dataset (Chevalier et al., 2024). The target complex questions are in D 3 , and we examine the utilization of multiple layers of knowledge and reasoning in the sequence of D 1 , D 2 , and D 3 . Figure 2 illustrates how the deconstruction process results in a hierarchical graph connecting D 1 to D 3 questions. Based on the hierarchical structure, we first measure forward reasoning gaps, denoted as forward discrepancy , which are differences in LLM performance on simpler subproblems compared to more complex questions requiring advanced reasoning. Additionally, we introduce backward discrepancy , which quantifies inconsistencies where LLMs can successfully answer complex inquiries but struggle with simpler ones. This dual assessment provides a comprehensive evaluation of the models' reasoning capabilities across different levels of complexity.
Using DEPTHQA, we investigate the knowledge reasoning ability of various instruction-tuned LLMs in the LLaMA 2 (Touvron et al., 2023), LLaMA 3 (AI@Meta, 2024), Mistral (Jiang et al.,
2023), and Mixtral (Jiang et al., 2024) family, varying in size from 7B to 70B. We compare the relationship between model capacities and depthwise discrepancies, showing that smaller models exhibit larger discrepancies in both directions. We further analyze how reliance on memorization of training data affects discrepancy, revealing that forward and backward discrepancies in large models originate from distinct types of failures. Finally, to examine the importance of structured intermediate steps in reasoning, we gradually guide models from simpler to more advanced questions through multi-turn interactions, consistently improving performance across various model sizes.
The contributions of our work are threefold:
- We propose to connect complex questions with simpler sub-questions by deconstructing questions based on depth of knowledge.
- We design the DEPTHQA dataset to evaluate LLMs' capability to form complex knowledge through reasoning. We measure forward and backward reasoning discrepancies across different levels of question complexity. 1
- We investigate the reasoning abilities of LLMs with various capacities, analyzing the impact of model size and training data memorization on discrepancies. We demonstrate the benefits of structured, multi-turn interactions to perform complex reasoning.
## 2 Related Work
Recent advancements have highlighted the impressive reasoning abilities of transformer language models across a wide range of tasks (Wei et al., 2022a; Zhao et al., 2023). Despite the success, numerous studies have found that these models often struggle with various types of reasoning, such as commonsense and logical reasoning (Qin et al., 2023; Srivastava et al., 2023). Even advanced models like GPT-4 (Achiam et al., 2023) have been noted to struggle with implicit reasoning over their internal knowledge, especially when it comes to effectively combining multiple steps to solve compositionally complex problems (Talmor et al., 2020; Rogers et al., 2020; Allen-Zhu and Li, 2023; Yang et al., 2024; Wang et al., 2024).
To address these challenges, several studies have focused on better Chain-of-Thought-style prompt-
1 We release our dataset and code at github.com/kaistAI/knowledge-reasoning .
ing or fine-tuning LLMs to verbalize the intermediate steps of knowledge and reasoning during inference (Nye et al., 2021; Wei et al., 2022b; Kojima et al., 2022; Wang et al., 2022; Sun et al., 2023; Wang et al., 2023b; Liu et al., 2023). This approach has significantly improved performance, especially in larger models with strong generalization capabilities. Theoretical and empirical studies investigate the advantages of verbalizations, highlighting their role in enhancing the reasoning capabilities of language models (Feng et al., 2023; Wang et al., 2023a; Li et al., 2024). The analysis of step-by-step reasoning abilities has matured further based on ontological (Saparov and He, 2023) and mechanistic perspectives (Hou et al., 2023a; Dutta et al., 2024).
In our proposed dataset, the most complex questions often necessitate implicit intermediate steps to reach a conclusion, which can be benefited from explicit verbalized reasoning. However, unlike previous works, our setup does not induce detailed step-by-step explanation contained in an answer to a complex question. Instead, we represent intermediate steps for a complex question in the form of sub-questions and gather answers to every subquestion, testing a model's understanding of intermediate knowledge individually. Our approach is similar to strategic question answering with intermediate answers (Geva et al., 2021; Press et al., 2023), but we further ensure a hierarchy of decompositions based on knowledge complexity. This allows examining discrepancies between questions of varying complexities, providing a distinct assessment of multi-step reasoning abilities.
Another line of work focuses on understanding transformers' knowledge and reasoning through controlled experiments (Chan et al., 2022; Akyürek et al., 2023; Dai et al., 2023; von Oswald et al., 2022; Prystawski et al., 2023; Feng and Steinhardt, 2024). Numerous studies on implicit reasoning often aim to identify latent reasoning pathways, but most have focused on simple synthetic tasks or toy models (Nanda et al., 2023; Conmy et al., 2023; Hou et al., 2023b), or evaluating through binary accuracy of short-form model predictions without considering intermediate steps (Yang et al., 2024; Wang et al., 2024). Our DEPTHQA, in contrast, challenges a model to answer complex real-world questions that require diverse reasoning types in long-form text. DEPTHQA further requires diverse types of reasoning across different depths, such as inductive and procedural reasoning, in addition to the comparative and compositional reasoning explored in prior studies (Press et al., 2023; AllenZhu and Li, 2023; Wang et al., 2024). This approach provides a more practical and nuanced assessment of the model's reasoning capabilities.
## 3 Graph-based Reasoning Framework
We develop a novel graph-based representation that delineates the dependencies between different levels of knowledge. We represent nodes as questions (Section 3.1) and edges as reasoning processes (Section 3.2). Based on the graph definition, we construct a dataset that encompasses diverse concepts and reasoning types (Section 3.3).
## 3.1 Knowledge Depth in Nodes
We represent each node as a question tied to a specific layer of knowledge. As our approach to addressing real-world problems emphasizes the gradual accumulation of knowledge similar to educational goals, we adopt the Webb's Depth of Knowledge (DOK) (Webb, 1997, 1999, 2002) widely used in education settings to categorize the level of questions. The depth of knowledge levels D k ( k ∈ { 1 , 2 , 3 } ) 2 in questions are defined as follows:
- D 1 . Factual and conceptual knowledge : The question involves the acquisition and recall of information, or following a simple formula, focusing on what the knowledge entails.
- D 2 . Procedural knowledge : The question necessitates the application of concepts through the selection of appropriate procedures and stepby-step engagement, concentrating on how the knowledge can be utilized.
- D 3 . Strategic knowledge : The question demands analysis, decision-making, or justification to address non-routine problems, emphasizing why the knowledge is applicable.
The levels can be viewed as ceilings that establish the extent or depth of an assessee's understanding (Hess, 2006), a concept recognized as a valuable assessment tool in educational contexts (Hess et al., 2009). Accordingly, we associate simpler questions with shallower depths and more complex questions with deeper depths.
2 We exclude the highest level in the original Webb's DOK, D 4 , as this level often includes interactive or creative activities and is rare or even absent in most standardized assessment (Webb, 2002; Hess, 2006).
## 3.2 Criteria for Reasoning in Edges
To conceptualize how simpler knowledge contributes to the development of complex knowledge, we define edges in our framework as transitions from a node at D k to at least one direct successor node at D k +1 3 . We perceive that advancing to deeper knowledge often requires synthesizing multiple aspects of simpler knowledge. Thus, a D k node should connect to multiple direct predecessor D k -1 nodes. This configuration establishes hierarchical dependencies among D 1 , D 2 , and D 3 questions, effectively modeling the progression needed to deepen understanding and engage with higherorder knowledge (See graph in Figure 2). Additionally, we establish three criteria to ensure that edges accurately represent the reasoning processes from shallower questions.
- C1 . Comprehensiveness: Questions at lower levels should aim to cover all foundational concepts necessary to answer a question at higher levels. This ensures that no critical knowledge gaps exist as the complexity increases.
- C2 . Implicitness: Questions at lower levels should avoid directly revealing answers or heavily hinting at solutions for higher-level questions. This encourages independent reasoning relying on the synthesis of implicit connections between nodes rather than straightforward clues.
- C3 . Non-binary questioning: Questions should elicit detailed, exploratory responses instead of simple yes/no answers. Given that LLMs may have an inherent positivity bias which leads them to prefer affirmative responses (Augustine et al., 2011; Dodds et al., 2015; Papadatos and Freedman, 2023), this helps in evaluating deep reasoning abilities beyond superficial or biased reasoning.
## 3.3 Dataset: DEPTHQA
We create DEPTHQA, a new question answering dataset for testing graph-based reasoning. The dataset is constructed in a top-down approach, deconstructing D 3 nodes into D 2 nodes, then into D 1 , creating multiple edges at each step (Table 1). We design the construction process to meticulously backtrack the knowledge necessary for complex
3 A foundational concept may apply to multiple advanced questions.
Table 1: Statistics of DEPTHQA.
| Domain | # Questions | # Questions | # Questions | # Edges between questions | # Edges between questions |
|-----------------------|---------------|---------------|---------------|-----------------------------|-----------------------------|
| | D 1 | D 2 | D 3 | D 1 → D 2 | D 2 → D 3 |
| Math | 573 | 193 | 49 | 774 | 196 |
| Computer Science | 163 | 54 | 14 | 212 | 55 |
| Environmental Science | 147 | 44 | 11 | 175 | 44 |
| Physics | 140 | 40 | 10 | 154 | 40 |
| Life Sciences | 98 | 28 | 7 | 111 | 28 |
| Math → {CS, Physics} | - | - | - | 11 | 0 |
| Total | 1,121 | 359 | 91 | 1,437 | 363 |
questions while meeting our three criteria for reasoning transition representation.
D 3 question curation We select real-world questions from the TutorEval (Chevalier et al., 2024) dataset, which contains human-crafted queries based on college-level mathematical and scientific content from textbooks 4 available on libretexts.org . Note that while these textbooks may be part of models' pre-training data due to online availability, TutorEval's human-written questions challenge models to generalize familiar concepts beyond direct training examples. We procure only complex D 3 questions from TutorEval, sorting them out using GPT-4 Turbo 5 (Achiam et al., 2023) with guidance on depth of knowledge levels. From an initial set of 834 questions, we manually refine our selection to 91 self-contained D 3 questions, ensuring clarity. We use GPT-4 Turbo to generate reference answers for each TutorEval question 6 , based on the original context and the model's self-annotated depth of knowledge. These reference answers are guided by the ground-truth key points provided by the author of each question.
Question deconstruction For each D k question, we use GPT-4 Turbo to generate up to four D k -1 questions. The prompt includes definitions for all three knowledge depths and decomposition examples to guide the deconstruction process. We provide D k with its reference answer to ensure extracted knowledge remains relevant for more challenging questions, adhering to C1 (Comprehensiveness) . We decide the optimal number of decompositions to four based on qualitative analysis, balancing comprehensiveness and implicitness: outlining every implicit reasoning step enhances comprehensiveness but may reduce implicitness,
4 Textbooks are designed with a scaffolding approach to knowledge development.
5 We use the gpt-4-0125-preview version for GPT-4 Turbo throughout this work, including data construction, verification, and experiments.
6 Chevalier et al. (2024) reports that GPT-4 excels in solving TutorEval problems with 92% correctness.
Table 2: Representative examples of required reasoning skills in D 3 and D 2 . %of instances within each depth that include the reasoning type is reported. Note that multiple reasoning types can be included in a single question.
| Depth | Reasoning type | Example question | % |
|---------|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| | Comparative | In the context of computer programming, what is the difference between for and while, are they always exchangeable? Show me some cases where one is strongly preferred to the other. | 21.1 |
| 3 | Causal | How does deflection of hair cells along the basilar membrane result in different perceived sound frequences? | 10.5 |
| | Inductive | How could a process satisfying the first law of thermodynamics still be impossible? | 8.8 |
| | Criteria Development | Explain if a matrix always have a basis of eigenvectors. | 8.8 |
| | Relational | What factors influence the time complexity of searching for an element in a data structure? | 22.6 |
| 2 | Procedural | Describe the process involved in solving cubic equations using the cubic formula. | 13.4 |
| | Application | How can sustainable agricultural practices contribute to food security and economic development in developing countries? | 7.3 |
and vice versa. Our prompt instructions carefully address this tradeoff to satisfy C2 (Implicitness) .
Deduplication and question augmentation We identified redundancies in knowledge and reasoning processes, where similar content appeared across different D 1 nodes linked to the same D 2 node, or between unconnected D 1 and D 2 nodes (example in Table 4). To address this, we utilize a Sentence Transformers embedding model 7 (Reimers and Gurevych, 2019) to detect and remove near-duplicate questions based on cosine similarity of their embeddings. We then employ GPT-4 Turbo to generate new, targeted questions and answers, filling any gaps in knowledge coverage. This approach has reduced misclassification of D 1 questions as D 2 by 88%, markedly enhancing C2 (Implicitness) . It has also decreased the total number of near-duplicates by decreased by 88%, further improving C1 (Comprehensiveness) . We subsequently update our graph data structure with these modifications.
Question debiasing Lastly, we undertake the task of manually rewriting 53 questions that originally invoke binary 'yes' or 'no' answers, ensuring C3 (Non-binary Questioning) . For example, a question that begins with 'If I understand correctly...' is transformed into 'Clarify my understanding that...', prompting the model to directly engage in analytical thinking rather than relying on simple affirmations or negations of the correctness.
Verification of hierarchy We conduct human annotation to verify the three criteria that shapes the reasoning hierarchy, reporting positive results in Appendix B. On 27.5% of DEPTHQA, an average of 83.5% of relations are fully comprehensive and 89.5% of sub-questions are fully implicit, with
7 sentence-transformers/all-mpnet-base-v2
98.7% of all questions being non-binary. Further details and examples in the construction process are in Appendix A. Prompts are in Appendix J.1.
## 3.4 Diversity of Reasoning Processes
Using a sample of 20 D 3 questions along with their interconnected 80 D 2 and 320 D 1 questions, we examine the types of reasoning needed to progress from basic to complex knowledge levels. We discover that nearly all questions necessitate the identification and extraction of several pieces of relevant information to synthesize comprehensive answers. Table 2 displays examples of questions requiring advanced reasoning skills, such as interpreting relationships between concepts, applying specific conditions, and handling assumptions, demonstrating that basic knowledge manipulation is insufficient. This diversity in reasoning types within our dataset robustly challenges LLMs to demonstrate sophisticated cognitive abilities. Detailed statistics and additional examples of reasoning types are provided in Appendix D.
## 4 Experiments
In this section, we present experiments on the depthwise reasoning ability of LLMs using DEPTHQA. We first explain the evaluation metrics and models (Section 4.1). Experimental results that follow are overall depthwise and discrepancy evaluation results (Section 4.2), the impact of memorization in knowledge reasoning (Section 4.3), and the effect of enforcing knowledge-enhanced reasoning via multi-turn inputs or prompt inputs (Section 4.5).
## 4.1 Experiment Setup
Depthwise evaluation For each question q k with depth k ( D k ), we score the factual correctness of the predicted answer on a scale from 1 to 5. We
Table 3: Depthwise reasoning performance of large language models. Bold indicates the best-performing model, and underline represents the second best performance. A darker color indicates a higher discrepancy.
| Model | Average Accuracy ↑ | Average Accuracy ↑ | Average Accuracy ↑ | Average Accuracy ↑ | Forward Discrepancy ↓ | Forward Discrepancy ↓ | Forward Discrepancy ↓ | Backward Discrepancy ↓ | Backward Discrepancy ↓ | Backward Discrepancy ↓ |
|----------------------------|----------------------|----------------------|----------------------|----------------------|-------------------------|-------------------------|-------------------------|--------------------------|--------------------------|--------------------------|
| Model | D 1 | D 2 | D 3 | Overall | D 2 → D 3 | D 1 → D 2 | Overall | D 2 → D 3 | D 1 → D 2 | Overall |
| LLaMA 2 7B Chat | 3.828 | 3.320 | 3.165 | 3.673 | 0.130 | 0.181 | 0.176 | 0.219 | 0.110 | 0.134 |
| LLaMA 2 13B Chat | 4.289 | 3.872 | 3.615 | 4.155 | 0.152 | 0.158 | 0.157 | 0.126 | 0.078 | 0.088 |
| LLaMA 2 70B Chat | 4.495 | 4.153 | 4.022 | 4.390 | 0.126 | 0.136 | 0.134 | 0.136 | 0.063 | 0.079 |
| Mistral 7B Instruct v0.2 | 4.280 | 3.897 | 4.000 | 4.176 | 0.092 | 0.157 | 0.147 | 0.144 | 0.070 | 0.088 |
| Mixtral 8x7B Instruct v0.1 | 4.599 | 4.532 | 4.429 | 4.574 | 0.087 | 0.079 | 0.081 | 0.063 | 0.063 | 0.063 |
| LLaMA 3 8B Instruct | 4.482 | 4.351 | 4.286 | 4.440 | 0.083 | 0.096 | 0.093 | 0.088 | 0.072 | 0.075 |
| LLaMA 3 70B Instruct | 4.764 | 4.749 | 4.648 | 4.754 | 0.065 | 0.050 | 0.053 | 0.043 | 0.044 | 0.044 |
| GPT-3.5 Turbo | 4.269 | 4.251 | 4.011 | 4.250 | 0.100 | 0.072 | 0.078 | 0.046 | 0.067 | 0.063 |
employ the LLM-as-a-Judge approach, which correlates highly with human judgments in scoring long-form responses (Zheng et al., 2024; Kim et al., 2024a; Lee et al., 2024; Kim et al., 2024b). Specifically, we utilize GPT-4 Turbo (Achiam et al., 2023) for absolute scoring. Following Kim et al. (2024a) and Lee et al. (2024), the model generates a score and detailed feedback for each question, reference answer, and prediction based on a defined scoring rubric. Further details on the evaluation process are provided in Appendix E. The exact input prompt for the LLM judge including the accuracy score rubric is in Appendix J.3. The reliability of the LLM evaluation results in our setting is evidenced by high annotation agreement with human evaluations, as explained in Appendix F. We report average accuracy at D k , the averaged factual correctness of questions at depth k .
Discrepancy evaluation As we deconstruct complex questions into a hierarchical graph, we can measure forward discrepancy and backward discrepancy between neighboring questions. Forward discrepancy measures the differences in performance on sub-problems compared to deeper questions requiring advanced reasoning. Given a question q k at D k ∈ { 2 , 3 } , let DP ( q k ) represents a set of direct predecessor questions at D k -1 . Then forward discrepancy for q k is defined as follows:
<!-- formula-not-decoded -->
where f is a function of a question that outputs factual correctness, as measured by an LLM evaluator. Backward discrepancy , conversely, quantifies inconsistencies where LLMs can successfully answer deeper questions but struggle with shallower ones. Given a question q k with D k ∈ { 1 , 2 } , let
DS ( q k ) represent a set of direct successor questions at D k -1 . Then backward discrepancy is defined as follows:
<!-- formula-not-decoded -->
Both forward discrepancy and backward discrepancy are normalized to the range [0, 1] by dividing by the maximum possible score gap , which is 4 at our scoring range from 1 to 5. To highlight gaps across depths, we set a strict accuracy threshold of 4 and report the average discrepancies only for examples where the mean score for DP ( q k ) and DS ( q k ) exceeds this threshold. This excludes cases where models perform poorly at both depths.
Models We mainly probe into the depthwise knowledge reasoning ability of open-source LLMs. We test representative open-source models based on the LLaMA (Touvron et al., 2023) architecture, including LLaMA 2 {7B, 13B, 70B} Chat (Touvron et al., 2023), Mistral 7B Instruct v0.2 (Jiang et al., 2023), Mixtral 8x7B Instruct v0.1 (Jiang et al., 2024), and LLaMA 3 {8B, 70B} Instruct (AI@Meta, 2024). Additionally, we include the latest GPT-3.5 Turbo 8 (OpenAI, 2022) to compare the performance of these open-source models against a proprietary model.
## 4.2 Depthwise Knowledge Reasoning Results
## Larger models exhibit smaller discrepancies.
Table 3 presents the overall depthwise reasoning performance of LLMs. As anticipated, solving questions at D 3 is the most challenging, showing the lowest average accuracy for all models. LLaMA 3 70B Instruct demonstrates the best performance across all depths, with Mixtral 8x7B Instruct achieving the second-best results. LLaMA 3
8 gpt-3.5-turbo-0125
Figure 3: Memorization analysis with Min-K% probability. (a)-(d) show the distribution of average Min-K% probabilities at each depth. (e)-(g) present the distribution of score differences between neighboring questions, whose Min-K% probability is in the bottom 25% or top 75%. A positive gap indicates backward discrepancy, while a negative gap represents forward discrepancy.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Chart/Diagram Type: Comparative Box Plots and Density Plots
### Overview
The image presents a comparative analysis of different LLaMA models using box plots and density plots. The top row displays box plots showing the "Depthwise Average MIN-K%" for different depths (Depth 1, Depth 2, Depth 3). The bottom row shows density plots of the "Score Gap (D3 - D2)" for 25% and 75% quantiles. The models compared are LLaMA 2 7B Chat, LLaMA 2 70B Chat, LLaMA 3 8B Instruct, and LLaMA 3 70B Instruct.
### Components/Axes
**Top Row (Box Plots):**
* **Title:** Depthwise Average MIN-K%
* **Y-axis:** Values ranging from 0 to 8.
* **X-axis:** Categorical, representing "Depth 1", "Depth 2", and "Depth 3".
* **Models (Subplots):**
* (a) LLaMA 2 7B Chat
* (b) LLaMA 2 70B Chat
* (c) LLaMA 3 8B Instruct
* (d) LLaMA 3 70B Instruct
**Bottom Row (Density Plots):**
* **Title:** Score Gap (D3 - D2)
* **Y-axis:** Density, ranging from 0 to approximately 0.8 (e), 1.25 (f), and 3 (g).
* **X-axis:** Score Gap, ranging from -1.5 to 1.5.
* **Legend (Top-Right):**
* Green: 25%
* Orange: 75%
* **Models (Subplots):**
* (e) LLaMA 2 7B Chat
* (f) LLaMA 2 70B Chat
* (g) LLaMA 3 70B Instruct
### Detailed Analysis
**Box Plots (Depthwise Average MIN-K%):**
* **LLaMA 2 7B Chat (a):**
* Depth 1: Median around 3.5, IQR (Interquartile Range) from approximately 2.5 to 4.5.
* Depth 2: Median around 3.5, IQR from approximately 3 to 4.
* Depth 3: Median around 4, IQR from approximately 3 to 5.
* Trend: Slight increase in median MIN-K% from Depth 1 to Depth 3.
* **LLaMA 2 70B Chat (b):**
* Depth 1: Median around 3, IQR from approximately 2 to 4.
* Depth 2: Median around 3.5, IQR from approximately 2.5 to 4.
* Depth 3: Median around 4.5, IQR from approximately 3.5 to 5.
* Trend: Increase in median MIN-K% from Depth 1 to Depth 3.
* **LLaMA 3 8B Instruct (c):**
* Depth 1: Median around 3.5, IQR from approximately 2.5 to 4.
* Depth 2: Median around 4, IQR from approximately 3 to 5.
* Depth 3: Median around 4.5, IQR from approximately 3.5 to 5.5.
* Trend: Increase in median MIN-K% from Depth 1 to Depth 3.
* **LLaMA 3 70B Instruct (d):**
* Depth 1: Median around 3, IQR from approximately 2 to 4.
* Depth 2: Median around 3.5, IQR from approximately 2.5 to 4.
* Depth 3: Median around 4, IQR from approximately 3 to 5.
* Trend: Increase in median MIN-K% from Depth 1 to Depth 3.
**Density Plots (Score Gap D3 - D2):**
* **LLaMA 2 7B Chat (e):**
* 25% (Green): Peak density around -0.25.
* 75% (Orange): Peak density around 0.25.
* The 25% quantile distribution is shifted to the left compared to the 75% quantile.
* **LLaMA 2 70B Chat (f):**
* 25% (Green): Peak density around -0.25.
* 75% (Orange): Peak density around 0.
* The 25% quantile distribution is shifted to the left compared to the 75% quantile.
* **LLaMA 3 70B Instruct (g):**
* 25% (Green): Peak density around 0.
* 75% (Orange): Peak density around 0.
* Both quantiles are highly concentrated around 0, with a long tail to the left for the 25% quantile.
### Key Observations
* The box plots show a general trend of increasing "Depthwise Average MIN-K%" as the depth increases from Depth 1 to Depth 3 across all models.
* The density plots reveal differences in the distribution of the "Score Gap (D3 - D2)" between the 25% and 75% quantiles for different models.
* LLaMA 3 70B Instruct exhibits a significantly different "Score Gap" distribution compared to the other models, with both quantiles concentrated around 0.
### Interpretation
The data suggests that increasing the depth in these LLaMA models generally leads to a higher "Depthwise Average MIN-K%". The "Score Gap (D3 - D2)" density plots indicate how the performance changes between Depth 2 and Depth 3 at different quantiles. The concentrated distribution around 0 for LLaMA 3 70B Instruct suggests that the performance difference between Depth 3 and Depth 2 is minimal for both the 25th and 75th percentiles, implying a more consistent performance gain or lack thereof between these depths compared to the other models. The shift in the density plots between the 25% and 75% quantiles for LLaMA 2 7B Chat and LLaMA 2 70B Chat suggests that the performance gain from Depth 2 to Depth 3 is more pronounced for the 75th percentile compared to the 25th percentile.
</details>
70B Instruct also exhibits the lowest discrepancies for both forward and backward discrepancy metrics, effectively answering questions at all depths with minimal discrepancies. Conversely, the least capable model, LLaMA 2 7B Chat, shows the lowest average accuracy along with the highest forward and backward discrepancies. Note that the relatively low forward discrepancy from D 1 → D 2 for LLaMA 2 7B Chat is due to its low performance at D 2 . This observation highlights the varying capabilities of different LLMs in handling questions at different depths and the inconsistencies in reasoning across depths.
Contrasting patterns of discrepancies We observe distinct patterns when analyzing forward and backward discrepancies separately. These discrepancies can be understood as a product of intensity (the magnitude of the discrepancies) and frequency (the proportion of questions showing a positive discrepancy). Frequency indicates how often forward discrepancy or backward discrepancy occurs, while intensity reflects the strength of the discrepancy when it happens. Our analysis shows that forward discrepancy tends to occur more frequently but with lower intensity. For example, LLaMA 3 8B Instruct exhibits an intensity of 0.225 with a frequency of 41.44%. In contrast, backward discrepancy is less common but has a higher intensity when they appear. Specifically, LLaMA 3 8B Instruct shows an intensity of 0.323 with a frequency of 23.32% for backward discrepancies. The intensity and frequency for all models are provided in Appendix G.
## 4.3 Memorization in Depthwise Knowledge Reasoning
## 4.3.1 Depthwise Memorization
To determine whether solving complex questions requires reasoning rather than memorization of training data, we use a pre-training data detection method to approximate potential aspects of memorization. Following Shi et al. (2023), we compare the Min-K% probability within models. Higher values suggest a smaller possibility of predictions directly existing in the training data. To elaborate, Min-K% probability is calculated by averaging the negative log-likelihood of the K% least probable tokens in the model's predictions. In the case where a given prediction was seen during training, outlier words with low probabilities would appear less frequently, resulting in high probabilities for the K% tokens. Since Min-K% probability is the average negative log-likelihood of such tokens, the resulting value would be lower in this case. 9
Models rely less on memorization for complex questions. Figure 3 (a)-(d) presents the depthwise average of the Min-K% probability for four models. We observe that as the depth increases,
9 For our calculations, we set k to 20 and use a sequence length of 128.
the Min-K% probability also increases for all models. This indicates that answering questions based on simple conceptual knowledge corresponding to D 1 is more likely to be solved by recalling training data. While shallow questions ( D 1 ) can be addressed through memorization, solving deeper questions ( D 3 ) requires more than just recalling a single piece of memorized knowledge, indicating a need for genuine reasoning capabilities.
## 4.3.2 Memorization Gap between Depths
Further analysis of questions in the bottom 25% and top 75% quantiles of the Min-K% probability distribution provides additional insights. Note that questions in the top 75% quantile are more likely to appear in the training data, while those in the bottom 25% are less likely. Figure 3 (e)-(g) shows the score difference between neighboring questions ( D 2 → D 3 ) whose Min-K% probability is in the bottom 25% or top 75%. We calculate the memorization gap as the difference between the factual correctness of D 3 and D 2 , normalized by the maximum gap of 4. A positive value indicates higher factual accuracy for the deeper questions, signifying backward discrepancy, while a negative value indicates higher accuracy for the shallower question, representing forward discrepancy.
Variance of gaps Weobserve that the model with the smallest capacity, LLaMA 2 7B Chat, exhibits large variances in both positive and negative directions, showing significant forward and backward discrepancies. In contrast, models with larger capacities, such as LLaMA 2 70B Chat and LLaMA 3 70B Instruct, demonstrate smaller variances.
Potential causes of discrepancies Additionally, models with larger capacities tend to show relatively higher forward discrepancies-distribution concentrated on the negative side-for the top 75% examples, which rely less on memorization. On the other hand, the bottom 25% distribution is concentrated on positive values, indicating relatively more backward discrepancies. This suggests that as model capacity increases, failures in knowledge reasoning result in forward discrepancies, while failures due to reliance on memorization may lead to backward discrepancies. The depthwise MinK%probability and score difference for other models are provided in Appendix H.
## 4.4 Qualitative Analysis of Backward Discrepancy
To better understand how the more abnormal inconsistency-backward discrepancy-can emerge, we qualitatively analyze backward discrepancy cases from the weakest model in our experiments, LLaMA 2 7B chat, and the strongest model in our experiments, LLaMA 3 70B Instruct. The examples we refer to in the following paragraphs are listed in Appendix I.
We observe that backward discrepancies often stem from the models' ability to articulate highlevel concepts but struggle with translating this understanding into precise, step-by-step procedures, particularly when mathematical operations are involved. This is illustrated in Example 1, where both models explain the importance of continued fraction representation for tangle numbers well ( D 3 ) but fail to accurately describe the process of constructing a tangle for a given number ( D 2 ).
In backward discrepancy cases, answers to deeper questions are more likely to be text-based and conceptual, making them easier for models to memorize that data. In contrast, shallower questions require execution of mathematical or logical operations, where the variability in the elements makes answers harder to memorize verbatim. This elucidates memorization effects on backward discrepancy analyzed in Section 4.3.2.
Interestingly, we also observe how the degree of memorization contributing to backward discrepancy can vary with model capacity. Example 2 shows LLaMA 2 7B Chat accurately reasoning about time complexity ( D 3 ) but introducing nonstandard terminology for specific operations ( D 2 ), suggesting the model's struggle with precise recall of basic concepts. Conversely, Example 3 demonstrates LLaMA 3 70B Instruct correctly recalling a complex formula ( D 3 ) but failing to apply it practically ( D 2 ). This indicates that the model can extensively memorize information but still struggle with its flexible application. This observation exemplifies why variance of memorization gaps can differ by model capacity, as described in Section 4.3.2.
## 4.5 Effect of Explicit Reasoning Process
In this study, as presented in Figure 1 (a), D 3 questions can be solved through sequential reasoning, utilizing answers from D 1 to D 3 questions. Previous studies on implicit reasoning (Wei et al., 2022b; Press et al., 2023; Zhou et al., 2023) have shown
Figure 4: Performance change after providing shallower questions. Note that D 1 is not reported for prompt inputs, as D 1 does not have shallower questions.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Bar Chart: Score Difference at Different Depths
### Overview
The image contains three bar charts, each representing the score difference for various language models at different depths (Depth 1, Depth 2, and Depth 3). The x-axis represents the language models, and the y-axis represents the score difference. The charts compare "Multi-turn", "Prompt (Gold.)", and "Prompt (Pred.)" scores for each model.
### Components/Axes
* **Y-axis (Score Difference):** Ranges from approximately -0.75 to 0.6, with increments of 0.25 or 0.2.
* **X-axis:** Represents different language models: LLaMA2 7B, LLaMA2 13B, LLaMA2 70B, Mistral 7B, Mistral 8x7B, LLaMA3 8B, LLaMA3 70B, and GPT3.5.
* **Legend:** Located at the top-right of the first chart and in the middle of the second and third charts.
* Blue: Multi-turn
* Orange: Prompt (Gold.)
* Green: Prompt (Pred.)
* **Titles:** Each chart has a title indicating the depth: (a) Depth 1, (b) Depth 2, (c) Depth 3.
### Detailed Analysis
#### (a) Depth 1
* **LLaMA2 7B:** Multi-turn score difference is approximately 0.4.
* **LLaMA2 13B:** Multi-turn score difference is approximately 0.0.
* **LLaMA2 70B:** Multi-turn score difference is approximately 0.08.
* **Mistral 7B:** Multi-turn score difference is approximately 0.33.
* **Mistral 8x7B:** Multi-turn score difference is approximately 0.12.
* **LLaMA3 8B:** Multi-turn score difference is approximately 0.20.
* **LLaMA3 70B:** Multi-turn score difference is approximately 0.0.
* **GPT3.5:** Multi-turn score difference is approximately 0.25.
#### (b) Depth 2
* **LLaMA2 7B:**
* Multi-turn: ~0.5
* Prompt (Gold.): ~0.55
* Prompt (Pred.): ~0.45
* **LLaMA2 13B:**
* Multi-turn: ~-0.1
* Prompt (Gold.): ~-0.1
* Prompt (Pred.): ~-0.1
* **LLaMA2 70B:**
* Multi-turn: ~-0.05
* Prompt (Gold.): ~-0.05
* Prompt (Pred.): ~-0.1
* **Mistral 7B:**
* Multi-turn: ~0.4
* Prompt (Gold.): ~0.3
* Prompt (Pred.): ~0.35
* **Mistral 8x7B:**
* Multi-turn: ~-0.15
* Prompt (Gold.): ~-0.2
* Prompt (Pred.): ~-0.2
* **LLaMA3 8B:**
* Multi-turn: ~-0.5
* Prompt (Gold.): ~-0.6
* Prompt (Pred.): ~-0.3
* **LLaMA3 70B:**
* Multi-turn: ~-0.4
* Prompt (Gold.): ~-0.4
* Prompt (Pred.): ~-0.5
* **GPT3.5:**
* Multi-turn: ~0.15
* Prompt (Gold.): ~-0.1
* Prompt (Pred.): ~-0.15
#### (c) Depth 3
* **LLaMA2 7B:**
* Multi-turn: ~0.55
* Prompt (Gold.): ~0.6
* Prompt (Pred.): ~0.45
* **LLaMA2 13B:**
* Multi-turn: ~0.25
* Prompt (Gold.): ~0.1
* Prompt (Pred.): ~0.2
* **LLaMA2 70B:**
* Multi-turn: ~0.15
* Prompt (Gold.): ~0.1
* Prompt (Pred.): ~0.1
* **Mistral 7B:**
* Multi-turn: ~0.2
* Prompt (Gold.): ~0.2
* Prompt (Pred.): ~0.1
* **Mistral 8x7B:**
* Multi-turn: ~-0.1
* Prompt (Gold.): ~-0.1
* Prompt (Pred.): ~-0.1
* **LLaMA3 8B:**
* Multi-turn: ~-0.1
* Prompt (Gold.): ~-0.1
* Prompt (Pred.): ~-0.2
* **LLaMA3 70B:**
* Multi-turn: ~-0.2
* Prompt (Gold.): ~-0.1
* Prompt (Pred.): ~-0.3
* **GPT3.5:**
* Multi-turn: ~0.4
* Prompt (Gold.): ~0.1
* Prompt (Pred.): ~0.1
### Key Observations
* **Depth 1:** Multi-turn scores are generally positive for all models.
* **Depth 2:** Some models have negative score differences, particularly LLaMA3 8B and LLaMA3 70B.
* **Depth 3:** The score differences vary, with some models showing positive and others showing negative differences.
* LLaMA2 7B consistently has high scores across all depths.
* LLaMA3 models tend to have lower or negative scores at Depth 2 and Depth 3.
### Interpretation
The charts compare the performance of different language models across varying depths, likely referring to the depth of reasoning or interaction within a task. The "Multi-turn" scores likely represent performance in a conversational or multi-step task. "Prompt (Gold.)" and "Prompt (Pred.)" likely refer to performance based on a gold-standard prompt and a predicted prompt, respectively.
The data suggests that some models, like LLaMA2 7B, perform well across all depths and prompting strategies. Other models, like LLaMA3 8B and 70B, struggle at deeper levels, indicating potential limitations in their reasoning or interaction capabilities. The differences between "Prompt (Gold.)" and "Prompt (Pred.)" highlight the impact of prompt quality on model performance. The negative score differences at Depth 2 and Depth 3 for some models suggest that their performance degrades as the task becomes more complex or requires deeper reasoning.
</details>
that enforcing LLMs to reason through intermediate steps explicitly can improve their reasoning ability. We investigate whether explicitly providing these reasoning processes to the model can aid in solving complex questions.
We encourage the model to reason by providing shallower questions in three ways: (i) Multiturn , where shallower questions are provided as user queries in a multi-turn conversation; (ii) Prompt (Gold) , where shallower questions and their gold answers are provided in prompts; (iii) Prompt (Pred.) , where shallower questions with the model's predictions are given in prompts. Note that prompt-based approaches require shallower QA pairs as inputs, which cannot be applied to D 1 questions. The prompt template for each approach is provided in Appendix J.2.
## Explicitly providing shallower solutions is beneficial for small models and complex questions.
Figure 4 illustrates the depthwise performance changes after incorporating deconstructed question information. Providing shallower questions benefits models with smaller capacities, such as LLaMA 2 7B Chat and Mistral 7B Instruct v0.2. For relatively simpler questions ( D 2 ), the benefit is less pronounced or may even decrease the per- formance of more capable models (>7B). However, intermediate questions ( D 2 ) are beneficial for complex questions ( D 3 ), except for models with large capacities ( ≥ 56B). These findings align with recent research on decomposing a complex question into simpler sub-tasks and solving sub-tasks prior to the final answer (Juneja et al., 2023; Khot et al., 2023), which have shown high performance improvements for solving complex problems across different model sizes.
Implicitly guiding reasoning via multi-turn interactions best improves performance. When comparing the two prompt-based inputs, smaller models tend to perform better with gold answers (Gold.), while more capable models favor selfprediction results (Pred.). This preference likely arises because more capable models align better with their own generated outputs, which reflect their advanced internal reasoning processes. The multi-turn approach provides the most stable results across all depths, enhancing the performance of smaller models while causing minimal performance drops for larger models. Additionally, the multi-turn approach improves D 1 performance by providing context or domain information as part of the interaction history.
## 5 Conclusion
In this study, we explore the reasoning capabilities of LLMs by deconstructing real-world questions into a graph. We introduce DEPTHQA, a set of deconstructed D 3 questions mapped into a hierarchical graph, requiring utilization of muliple layers of knowledge in the sequence of D 1 , D 2 to D 3 . This hierarchical approach provides a comprehensive assessment of LLM performance by measuring forward and backward discrepancies between simpler and complex questions. Our comparative analysis of LLMs with different capacities reveals an inverse relationship between model capacities and discrepancies. Memorization analysis suggests that the sources of forward and backward discrepancies in large models stem from different types of failures. Lastly, we demonstrate that guiding models from shallower to deeper questions through multiturn interactions stabilizes performance across the majority of models. These findings emphasize the importance of intermediate knowledge extraction in understanding LLM reasoning capabilities.
## Limitations
Small sample size Our dataset, DEPTHQA, consists of 91 complex ( D 3 ) questions from the TutorEval dataset, along with 1,480 derived shallower ( D 2 , D 1 ) questions. Despite the diversity in reasoning types explored (Section 3.4) and the hierarchical structuring of subquestions, the limited number of complex questions and the narrow content scope restrict the generalizability of our findings. The selection of TutorEval as our primary source is based on the challenge of manually developing or even sourcing intricate questions that necessitate advanced reasoning skills; such questions require (1) maintaining real-world relevance, (2) eliciting long-form answers, and (3) having minimal risk of test set contamination. Within TutorEval, complex D 3 questions represent only 33.6% of its 834 questions, which further reduces to 10.9% when excluding questions that require external knowledge retrieval. We encourage future research to build larger, more diverse datasets to more robustly assess knowledge reasoning capabilities of LLMs.
GPT-4 data generation and evaluation All questions except for D 3 and reference answers in DEPTHQA are generated by GPT-4 Turbo. To ensure the quality of these questions, we have established strict decomposition criteria (Section 3.2) and implemented rigorous procedures including detailed instructions, question augmentation, manual rewriting and verification by human annotators (Section 3.3). The reliability of the answers is supported by findings from Chevalier et al. (2024), which demonstrate GPT-4's high accuracy of 92% on TutorEval problems as assessed by human evaluators. However, there may exist inaccuracies due to unseen errors in the decomposition process or unverified knowledge produced by the model.
Furthermore, we utilize GPT-4 Turbo to assess the correctness of model predictions. Following protocols from previous studies (Kim et al., 2024a,b) which highlight GPT-4's strong correlation with human judgments on long-form content, we provide detailed instructions and specific scoring rubrics to the evaluator to ensure that the evaluation process aligns closely with our objectives. In addition, we conduct human evaluations and compare with GPT-4 Turbo evaluations, and measure sufficiently high inter-annotator agreement (Appendix F). Still, the evaluation method is subject to bias inherent in LLM judges.
## Acknowledgement
Wethank Hyeonbin Hwang, Sohee Yang, and Sungdong Kim for constructive feedback and discussions. This work was partly supported by KAISTNAVER Hypercreative AI Center and Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (RS-2024-00398115, Research on the reliability and coherence of outcomes produced by Generative AI, 30%).
## References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv .
AI@Meta. 2024. Llama 3 model card.
Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. 2023. What learning algorithm is in-context learning? investigations with linear models. In ICLR .
Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.2, knowledge manipulation. arXiv .
Adam A Augustine, Matthias R Mehl, and Randy J Larsen. 2011. A positivity bias in written and spoken english and its moderation by personality and gender. Social Psychological and Personality Science , 2(5):508-515.
Santiago Castro. 2017. Fast Krippendorff: Fast computation of Krippendorff's alpha agreement measure. https://github.com/pln-fing-udelar/fast-krippendorff.
Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. 2022. Data distributional properties drive emergent in-context learning in transformers. In NeurIPS , pages 18878-18891.
Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo RodrÃguez Fanlo, Simon Frieder, Simon Machado, et al. 2024. Language models as science tutors. arXiv .
Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS . Curran Associates, Inc.
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. 2023. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of ACL .
- Peter Sheridan Dodds, Eric M Clark, Suma Desu, Morgan R Frank, Andrew J Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M Kloumann, James P Bagrow, et al. 2015. Human language reveals a universal positivity bias. Proceedings of the national academy of sciences , 112(8):2389-2394.
- Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. 2024. How to think stepby-step: A mechanistic understanding of chain-ofthought reasoning. TMLR .
- Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang (Lorraine) Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. Faith and fate: Limits of transformers on compositionality. In NeurIPS .
- Guhao Feng, Yuntian Gu, Bohang Zhang, Haotian Ye, Di He, and Liwei Wang. 2023. Towards revealing the mystery behind chain of thought: a theoretical perspective. NeurIPS .
- Jiahai Feng and Jacob Steinhardt. 2024. How do language models bind entities in context? In ICLR .
- Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. TACL .
- K Hess. 2006. Applying webb's depth-of-knowledge (dok) levels in science. Accessed November , 10.
- Karin Hess, Ben Jones, Dennis Carlock, and John R Walkup. 2009. Cognitive rigor: Blending the strengths of bloom's taxonomy and webb's depth of knowledge to enhance classroom-level processes. ERIC Document (Online Database) .
- Yifan Hou, Jiaoda Li, Yu Fei, Alessandro Stolfo, Wangchunshu Zhou, Guangtao Zeng, Antoine Bosselut, and Mrinmaya Sachan. 2023a. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In EMNLP .
- Yifan Hou, Jiaoda Li, Yu Fei, Alessandro Stolfo, Wangchunshu Zhou, Guangtao Zeng, Antoine Bosselut, and Mrinmaya Sachan. 2023b. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In EMNLP .
- Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv .
- Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv .
- Gurusha Juneja, Subhabrata Dutta, Soumen Chakrabarti, Sunny Manchanda, and Tanmoy Chakraborty. 2023. Small language models fine-tuned to coordinate larger language models improve complex reasoning. In EMNLP .
- Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2023. Decomposed prompting: A modular approach for solving complex tasks. In ICLR .
- Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. 2024a. Prometheus: Inducing evaluation capability in language models. In ICLR .
- Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024b. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv .
- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS .
- Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology . Sage publications.
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In SOSP .
- Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. 2024. Prometheusvision: Vision-language model as a judge for finegrained evaluation. arXiv .
- Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. Chain of thought empowers transformers to solve inherently serial problems. In ICLR .
- Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, and Asli Celikyilmaz. 2023. Crystal: Introspective reasoners reinforced with selffeedback. In EMNLP .
- Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress measures for grokking via mechanistic interpretability. In ICLR .
- Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv .
- OpenAI. 2022. Chatgpt: Optimizing language models for dialogue.
- Henry Papadatos and Rachel Freedman. 2023. Your llm judge may be biased. https://www.lesswrong.com/posts/ S4aGGF2cWi5dHtJab/your-llm-judge-may-be-biased. Accessed: 2023-06-14.
- Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of EMNLP .
- Ben Prystawski, Michael Y. Li, and Noah Goodman. 2023. Why think step by step? reasoning emerges from the locality of experience. In NeurIPS .
- Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is ChatGPT a general-purpose natural language processing task solver? In EMNLP .
- Nils Reimers and Iryna Gurevych. 2019. SentenceBERT: Sentence embeddings using Siamese BERTnetworks. In EMNLP-IJCNLP .
- Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. TACL .
- Abulhair Saparov and He He. 2023. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In ICLR .
- Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2023. Detecting pretraining data from large language models. ArXiv .
- Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. TMLR .
- Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. 2023. Recitation-augmented language models. In ICLR .
- Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2020. oLMpics-on what language model pre-training captures. TACL .
- Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv .
- Johannes von Oswald, Eyvind Niklasson, E. Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. 2022. Transformers learn in-context by gradient descent. In ICML .
- Boshi Wang, Xiang Deng, and Huan Sun. 2022. Iteratively prompt pre-trained language models for chain of thought. In EMNLP .
- Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2023a. Towards understanding chain-of-thought prompting: An empirical study of what matters. In ACL .
- Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. 2024. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization. arXiv .
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In ICLR .
- Norman L Webb. 1997. Criteria for alignment of expectations and assessments in mathematics and science education. research monograph no. 6.
- Norman L Webb. 1999. Alignment of science and mathematics standards and assessments in four states. research monograph no. 18.
- Norman L Webb. 2002. Depth-of-knowledge levels for four content areas. Language Arts , 28(March):1-9.
- Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models. TMLR .
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. 2022b. Chain of thought prompting elicits reasoning in large language models. NeurIPS .
- Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. 2024. Do large language models latently perform multi-hop reasoning? arXiv .
- Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. 2024. FLASK: Fine-grained language model evaluation based on alignment skill sets. In ICLR .
- Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jianyun Nie, and Ji rong Wen. 2023. A survey of large language models. ArXiv .
- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS , 36.
- Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. 2023. Least-to-most prompting enables complexz reasoning in large language models. In ICLR .
## A Details in Dataset Construction
Classifying questions based on depth of knowledge To categorize questions from the TutorEval dataset (Chevalier et al., 2024), we use GPT-4 Turbo set at a temperature of 0.7, following the specific prompt detailed in Table 19. We evaluate the model's classification accuracy using a validation set of 50 questions, which we have previously annotated with their respective depth of knowledge levels. Our optimal prompting strategy involves incorporating key points from each question provided in the original dataset and instructing the model to provide a step-by-step explanation of its classification reasoning. This approach achieves a precision of 0.67 and a recall of 0.77, with a low rate of false positives. Analysis of the entire set of 834 questions reveals the distribution of depth levels: 43% at D 2 , 33.6% at D 3 , 23.3% at D 1 , and only one question at D 4 .
D 3 question filtering and disambiguation From the 280 D 3 questions initially identified, we manually exclude questions that are not selfcontained, meaning they refer to specific contexts or excerpts in textbook passages that cannot be seamlessly integrated into our input. Examples include questions like, 'I don't understand the point of Theorems 4.3.2 and 4.3.3. Why do we care about these statements?' and 'Please tell me the common conceptual points between the Weinrich and Wise 1928 study and the Roland et al. 1980 paper .' Additionally, we disambiguate questions to ensure clarity and context accuracy. For example, the question 'Why is branching unstructured? And is it a bad design choice?' was initially vague about its reference to 'branching.' Upon review, we identify the context as computer programming rather than database systems and revise the question to: 'In the context of computer programming, why is branching considered unstructured, and is it considered a poor design choice?'.
Question deduplication and augmentation As explained in Section 3.3, we leverage cosine similarity of question embeddings produced by a Sentence Transformers embedding model 10 (Reimers and Gurevych, 2019) to identify near-duplicate questions. Specifically, within the same depth 1 or 2, we apply a similarity threshold of 0.9 to identify duplicates and eliminate them. For questions across D 1 and D 2 , we remove D 2 questions with a
10 sentence-transformers/all-mpnet-base-v2
Top-1 before deduplication (similarity = 0.97)
D 2 : How do you calculate the determinant of a matrix?
D 1 : How do you find the determinant of a matrix?
Top-1 after deduplication (similarity = 0.93)
D 2 : What does it mean for two vectors to be orthogonal, and how can you verify this property?
- D 1 : What does it mean for two vectors to be orthogonal?
Table 4: Top-1 similar question pairs between D 2 and D 1 before and after the deduplication and augmentation process. While the pair above shares essentially the same depth of knowledge, the pair below substantially differ in knowledge depth due to the D 2 question asking additional procedures.
Describe how division and remainders work when considering congruence modulo a number.
1. What is the result of a division called?
2. How is a remainder defined in division?
3. What does it mean for two numbers to be congruent modulo a number?
4. What does the term 'congruence modulo a number' mean? ⇒ What is the modulo operation in mathematics?
Table 5: The original 4th shallower question (red) is asking redundant knowledge addressed in the 3rd question. We remove the duplicate question and replace it with a question asking a different concept (blue).
similarity score ≥ 0 . 9 to D 1 questions and remove D 1 questions with a similarity score between 0.8 and 0.9 to D 1 questions, under the rationale that the D 2 question is effectively at D 1 level and vice versa. The similarity threshold and the candidate to remove was determined through qualitative analysis. We analyze that the false positive rate of D 1 questions misclassified as D 2 decreases from 60% to 7% (reduction rate of 88%). Table 4 compares top-1 similar question pairs between D 2 and D 1 before and after deduplication. We further use GPT-4 Turbo with temperature 0.7 (prompt in Table 25 and 26) to augment questions filling the gap of knowledge. Table 5 illustrates how an augmented question successfully replaces a previously redundant question, enhancing comprehensiveness.
Motivation of question debiasing In our preliminary study, we found that models tend to favor 'yes' over 'no' at the beginning of the response to a question that can be answered in binary format, as exemplified in Table 6. We recognize that the inherent positivity bias in models (Augustine et al., 2011; Dodds et al., 2015; Papadatos and Freedman, 2023) has the potential to skew the model's rea-
Original question:
Being congruent modulo an ideal sounds a bit like two numbers being congruent modulo another number. Is there a connection between these two notions? ⇒ Yes
## Predicted answer:
Yes, there is a connection between the two notions. In mathematics, when we say that two integers are congruent modulo n, it means that they have the same remainder when divided by n. For example, 3 and 6 are congruent modulo 3 because when we divide 3 by 3, the remainder is 0, and when we divide 6 by 3, the remainder is also 0. [... ]
## Flipped question:
Are the concepts of being congruent modulo an ideal and two numbers being congruent modulo another number entirely distinct with no connection ? ⇒ No
## Predicted answer:
Yes, the concepts of being congruent modulo an ideal and two numbers being congruent modulo another number are distinct. When we say that two numbers a and b are congruent modulo n, we mean that their difference a - b is a multiple of n. This is written as a ≡ b (mod n). [... ]
Table 6: Example of Mistral 7B Instruct v0.2 favoring affirmative responses over negative responses when the knowledge required is consistent but only the question format is flipped.
Are there problems that one can use standard induction to prove but cannot use strong induction to prove? ⇒ What kind of problems can be proven using standard induction but not strong induction?
If I understand correctly, adding sine functions always results in a new sine function?
⇒ Clarify my understanding that adding sine functions always results in a new sine function.
Can a linear transformation map all points of a vector space to a single point, and under what conditions does this occur? ⇒ Describe the possibility of a linear transformation mapping all points of a vector space to a single point. Under what conditions does this occur?
Table 7: Example conversions of a binary question into a non-binary question.
soning processes and consequently obscure a true evaluation of its capability to reason and articulate nuanced thoughts. To mitigate this, we debias problematic questions by reframing them into more exploratory inquiries. Example transformations are in Table 7.
## B Human Verification on Data Quality
2 of the authors and one graduate student who volunteered annotate 27.5% of DEPTHQA, verifying the three criteria we hold in Section 3.2: Comprehensiveness (C1), Implicitness (C2), and Nonbinary questioning (C3). Comprehensiveness and Implicitness are especially crucial criteria for subquestions to ensure the hierarchy in the reasoning process, as Comprehensiveness ensures no critical knowledge gaps with increasing depth, while Implicitness ensures no straightforward clues, encouraging implicit reasoning between sub-questions.
To set up the procedure, we randomly sample 5 D 3 questions from each of the 5 domains in our dataset and use all questions derived from the selected D 3 questions, totaling 25 D 3 , 100 D 2 , and 396 D 1 questions. Given 25 D 3 → D 2 and 100
Table 8: Human annotation on Comprehensiveness of a subset of DEPTHQA question relations.
| C1. Comprehensiveness | D 3 → D 2 | D 3 → D 2 | D 2 → D 1 | D 2 → D 1 |
|-------------------------|-------------|-------------|-------------|-------------|
| | Count | % | Count | % |
| Comprehensive | 22 | 88.0 | 79 | 79.0 |
| Partially comprehensive | 3 | 12.0 | 18 | 18.0 |
| Insufficient | 0 | 0.0 | 3 | 3.0 |
Table 9: Human annotation on Implicitness of a subset of DEPTHQA sub-questions.
| C2. Implicitness | D 2 | D 2 | D 1 | D 1 |
|-------------------------|-------|-------|-------|-------|
| | Count | % | Count | % |
| Fully implicit | 87 | 87.0 | 364 | 91.9 |
| Partially comprehensive | 13 | 13.0 | 31 | 7.8 |
| Insufficient | 0 | 0.0 | 1 | 0.3 |
D 2 → D 1 relations, the relations are divided into 40, 40, 45 and are assigned to the three workers. For each relation, the main question and the subquestions (predecessors) are provided along with their gold answers. Then the labeler is asked to check whether the relation is conceptually comprehensive and whether each question is implicit or non-binary. The labeler can choose from three varying degrees of comprehensiveness and implicitness due to the subjective nature of the criteria. The annotation interface is shown in Figure 7.
Table 8, 9, and 10 reports the annotation statistics. Table 10 shows that the decompositions into shallower questions are fully comprehensive (C1) in 88% of D 3 → D 2 relations and 79% of D 2 → D 1 relations, reaching 100% and 97% when taking partially comprehensive relations as well, respectively. Also, Table 9 shows that 87% of D 2 and 91.9% of D 1 questions do not hint at solutions for more complex questions (C2), with similarly low failure rates. We also find in Table 10 that
Table 10: Human annotation on Non-binary questioning of a subset of DEPTHQA sub-questions.
| C3. Non-binary Questioning | D 3 | D 3 | D 2 | D 2 | D 1 | D 1 |
|------------------------------|-------|-------|-------|-------|-------|-------|
| | Count | % | Count | % | Count | % |
| Open-ended | 24 | 96.0 | 100 | 100.0 | 396 | 100.0 |
| Binary | 1 | 4.0 | 0 | 0.0 | 0 | 0.0 |
Table 11: Distribution of reasoning types for D 3 and D 2 in a subset of DEPTHQA. Multiple reasoning types can be included in one instance.
| Reasoning Type | Depth 3 | Depth 3 | Depth 2 | Depth 2 |
|----------------------|-----------|-----------|-----------|-----------|
| | Count | % | Count | % |
| Comparative | 12 | 21.1 | 19 | 11.6 |
| Relational | 10 | 17.5 | 37 | 22.6 |
| Causal | 6 | 10.5 | 19 | 11.6 |
| Inductive | 5 | 8.8 | 6 | 3.7 |
| Criteria Development | 5 | 8.8 | 13 | 7.9 |
| Procedural | 4 | 7.0 | 22 | 13.4 |
| Evaluative | 4 | 7.0 | 12 | 7.3 |
| Example | 2 | 3.5 | 8 | 4.9 |
| Quantitative | 2 | 3.5 | 6 | 3.7 |
| Application | 2 | 3.5 | 19 | 11.6 |
| Other | 5 | 8.8 | 3 | 1.8 |
| Total | 57 | 100 | 164 | 100 |
nearly all questions require open-ended answers (C3). Human verification data provides evidence that our synthetically generated edges in the adequately represent the reasoning process.
## C Dataset License
The TutorEval (Chevalier et al., 2024) dataset from which we source complex questions has not disclosed the license yet. Our DEPTHQA is subject to OpenAI's Terms of Use for the generated data. We will notify the intended use of our dataset for research when releasing our dataset to the public.
## D Reasoning Type Analysis
In Table 11, we report the distribution of reasoning types annotated by the authors on a sample of 20 D 3 questions and D 2 and D 2 related to them. Table 15 outlines the definition of each reasoning type and a representative example set of questions that best elicits such reasoning. We provide question deconstructions examples in Table 16 and Table 17 where each showcases distinct reasoning types and knowledge.
## E Details in Main Experiments
## E.1 Model Inference
To inference LLMs used in our experimental setup (Section 4.1), we use a standardized API from OpenRouter 11 to access LLMs and use the complementary LiteLLM 12 interface to call model generations. An exception is LLaMA 7B Chat, which is not hosted in OpenRouter; we use the HuggingFace model and the vLLM (Kwon et al., 2023) inference engine for this particular model, performing local inference with mixed precision on 1 NVIDIA A6000 40GB GPU. We use the default sampling parameters suited for each model. The specific prompt templates used to induce reasoning paths are organized in Appendix J.2. The inference on the whole pass of DEPTHQA finishes within 10 minutes. We report single-run results.
## E.2 LLM-as-a-Judge Evaluation
When prompting GPT-4 Turbo to evaluate model responses, we use a temperature of 1.0, nucleus sampling with top\_p of 0.9, and maximum number of generation tokens of 1,024, following previous works (Ye et al., 2024; Kim et al., 2024a,b; Lee et al., 2024). The prompt template including the score rubric is in Table 30. We report single-run results. See Table 18 for example output format. Unlike prior works that emphasize the use of instancespecific scoring rubrics (Kim et al., 2024a,b; Lee et al., 2024), our initial experiments comparing evaluations given a common rubric and instancespecific rubric showed that instance-specific rubrics increase noise in evaluation and decrease the quality of evaluation. We speculate that it is because the focus of our evaluation is on a common factor of factual correctness, i.e. , whether the model accurately uses knowledge in the reasoning process, different from conventional benchmark evaluations.
## F Reliability of LLM-as-a-Judge
To assess the reliability of LLM evaluations in our analysis, we conduct human evaluation of LLM responses and calculate the agreement between annotations. We randomly sample 20 model responses for each score level (1 to 5) as evaluated by GPT-4 Turbo, with the question and response model being random as well. 2 of the authors and one graduate student who volunteered evaluate 46, 46, and 48
11 openrouter.ai
12 litellm.vercel.app/docs/providers/openrouter
| Model | Average Forward Discrepancy | Average Forward Discrepancy | Average Forward Discrepancy | Value | Value | Value | Frequency (%) | Frequency (%) | Frequency (%) |
|----------------------------|-------------------------------|-------------------------------|-------------------------------|-----------|-----------|---------|-----------------|-----------------|-----------------|
| | D 2 → D 3 | D 1 → D 2 | Overall | D 2 → D 3 | D 1 → D 2 | Overall | D 2 → D 3 | D 1 → D 2 | Overall |
| LLaMA 2 7B Chat | 0.1304 | 0.1814 | 0.1756 | 0.2708 | 0.2683 | 0.2685 | 48.15 | 67.62 | 65.40 |
| LLaMA 2 13B Chat | 0.1524 | 0.1582 | 0.1573 | 0.2572 | 0.2720 | 0.2697 | 59.26 | 58.14 | 58.31 |
| LLAMA 2 70B Chat | 0.1259 | 0.1361 | 0.1344 | 0.2633 | 0.2490 | 0.2512 | 47.83 | 54.68 | 53.50 |
| Mistral 7B Instruct v0.2 | 0.0920 | 0.1569 | 0.1474 | 0.2031 | 0.2294 | 0.2267 | 45.28 | 68.39 | 65.01 |
| Mixtral 8x7B Instruct v0.1 | 0.0868 | 0.0791 | 0.0806 | 0.1844 | 0.2058 | 0.2009 | 47.06 | 38.46 | 40.14 |
| Llama 3 8B Instruct | 0.0831 | 0.0957 | 0.0934 | 0.2225 | 0.2258 | 0.2253 | 37.33 | 42.38 | 41.44 |
| Llama3 70B Instruct | 0.0653 | 0.0497 | 0.0528 | 0.2176 | 0.2211 | 0.2202 | 30.00 | 22.47 | 23.99 |
| GPT-3.5 Turbo | 0.1002 | 0.0722 | 0.0779 | 0.1608 | 0.1369 | 0.1424 | 62.35 | 52.73 | 54.70 |
Table 12: Average intensity and frequency of forward discrepancy.
| Model | Average Backward Discrepancy | Average Backward Discrepancy | Average Backward Discrepancy | Value | Value | Value | Frequency (%) | Frequency (%) | Frequency (%) |
|----------------------------|--------------------------------|--------------------------------|--------------------------------|-----------|-----------|---------|-----------------|-----------------|-----------------|
| | D 2 → D 3 | D 1 → D 2 | Overall | D 2 → D 3 | D 1 → D 2 | Overall | D 2 → D 3 | D 1 → D 2 | Overall |
| LLaMA 2 7B Chat | 0.2193 | 0.1104 | 0.1342 | 0.3827 | 0.3589 | 0.3671 | 57.31 | 30.77 | 36.57 |
| LLaMA 2 13B Chat | 0.1255 | 0.0782 | 0.0879 | 0.3846 | 0.3339 | 0.3473 | 32.64 | 23.43 | 25.32 |
| LLAMA 2 70B Chat | 0.1363 | 0.0632 | 0.0787 | 0.3811 | 0.3258 | 0.3442 | 35.76 | 19.40 | 22.88 |
| Mistral 7B Instruct v0.2 | 0.1442 | 0.0700 | 0.0881 | 0.3488 | 0.3071 | 0.3225 | 41.33 | 22.81 | 27.31 |
| Mixtral 8x7B Instruct v0.1 | 0.0627 | 0.0635 | 0.0633 | 0.2979 | 0.2728 | 0.2781 | 21.04 | 23.27 | 22.76 |
| Llama 3 8B Instruct | 0.0878 | 0.0717 | 0.0752 | 0.3500 | 0.3141 | 0.3227 | 25.08 | 22.82 | 23.32 |
| Llama3 70B Instruct | 0.0427 | 0.0442 | 0.0438 | 0.2778 | 0.2692 | 0.2710 | 15.38 | 16.41 | 16.18 |
| GPT-3.5 Turbo | 0.0457 | 0.0672 | 0.0626 | 0.2892 | 0.2602 | 0.2644 | 15.79 | 25.81 | 23.68 |
Table 13: Average intensity and frequency of backward discrepancy.
| Question depth | Human-Human | Human-GPT-4 |
|------------------|-----------------|------------------|
| D 3 | 0.4848 (n = 3) | 0.7064 (n = 13) |
| D 2 | 0.6464 (n = 6) | 0.7730 (n = 32) |
| D 1 | 0.5671 (n = 11) | 0.7969 (n = 55) |
| Overall | 0.5730 (n = 20) | 0.7797 (n = 100) |
Table 14: Krippendorf's Alpha between human-human and human-GPT-4 ratings on model responses to DEPTHQA questions. For human-GPT agreement, the scores of predictions rated by the three human raters are averaged. The number of responses in each measurement is reported below the Krippendorf's Alpha value.
unique responses, respectively, and all 3 workers label the remaining 20 responses set aside for interannotator agreement. The human raters are given only 1 instance at a time and individually score it on a scale of 1 to 5, under the exact setting of our LLM-as-a-Judge experiments. The evaluation interface is shown in Figure 8. Following Ye et al. (2024), we measure Krippendorf's Alpha (Krippendorff, 2018; Castro, 2017) with an ordinal metric to reliability between three human raters and between humans and GPT-4 Turbo.
Table 14 reports the agreement results. The results show that the human-GPT agreement is substantially high, approaching 0.80, the commonly accepted reliability threshold (Krippendorff, 2018). While the sample size is smaller, there is also mod- erate human-human agreement. This implies that the individual absolute rating scheme is effective and that GPT-4 Turbo evaluations are aligned with humans in our setting.
## G Discrepancy Results
To separately observe how frequently each discrepancy occurs and its intensity when it happens, Table 12 and Table 13 show the average intensity and frequency of each forward and backward discrepancy. Note that the average discrepancy is calculated as the product of the value and frequency. Overall, forward discrepancies appeared more frequently, although their intensity was relatively low (between 0.14 and 0.26). In contrast, backward discrepancies appeared less than 25%, except for LLaMA 2 7B, which exhibited high intensity (between 0.26 and 0.37).
## H Overall Results with Min-K% Probability
## H.1 Depthwise Min-K% Prob.
In Figure 5, we plot the Min-k% probability of LLaMA 2 13B Chat, Mistral 8B Instruct and Mixtral 8x7B Instruct. Similar to Figure 3, D 3 shows the highest average Min-K% probability, indicating the least memorization over all three models.
## H.2 Score Gap within Neighboring Questions
Figure 6 presents the KDE plot of the factual accuracy gap between q 3 and q 2 for q 3 instances whose
Figure 5: Average Min-K% probability at each depth. Lower values indicate more memorization while higher values indicate less memorization.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Box Plot: Depthwise Average MIN-K% for Different Models
### Overview
The image presents three box plots comparing the depthwise average MIN-K% for three different language models: LLaMA 2 13B Chat, Mistral 8B Instruct, and Mixtral 8x7B Instruct. Each box plot shows the distribution of MIN-K% values at three different depths (Depth 1, Depth 2, and Depth 3) within the model.
### Components/Axes
* **Title:** Depthwise Average MIN-K% (repeated above each subplot)
* **Y-axis:** Values range from 0 to 8, with tick marks at intervals of 2.
* **X-axis:** Represents the depth within the model, with categories "Depth 1", "Depth 2", and "Depth 3".
* **Box Plot Elements:** Each box plot displays the median (center line within the box), the interquartile range (IQR, represented by the box), the whiskers (lines extending from the box), and outliers (individual points beyond the whiskers).
* **Subplot Titles:**
* (a) LLaMA 2 13B Chat
* (b) Mistral 8B Instruct
* (c) Mixtral 8x7B Instruct
### Detailed Analysis
**Subplot (a): LLaMA 2 13B Chat**
* **Depth 1:** The box extends from approximately 3 to 4. The median is around 3.5. There are outliers above, reaching up to approximately 6.5.
* **Depth 2:** The box extends from approximately 3.5 to 4.5. The median is around 4. There are no visible outliers below, but there are outliers above, reaching up to approximately 7.
* **Depth 3:** The box extends from approximately 4.5 to 5.5. The median is around 5. There are outliers above, reaching up to approximately 8.
**Subplot (b): Mistral 8B Instruct**
* **Depth 1:** The box extends from approximately 3 to 4. The median is around 3.5. There are outliers above, reaching up to approximately 5.5.
* **Depth 2:** The box extends from approximately 3.5 to 4. The median is around 3.8. There are no visible outliers below, but there are outliers above, reaching up to approximately 7.5.
* **Depth 3:** The box extends from approximately 4 to 5. The median is around 4.5. There are outliers above, reaching up to approximately 6.
**Subplot (c): Mixtral 8x7B Instruct**
* **Depth 1:** The box extends from approximately 3 to 4. The median is around 3.5. There are outliers both above and below.
* **Depth 2:** The box extends from approximately 3.8 to 4.5. The median is around 4. There are outliers above, reaching up to approximately 7.5.
* **Depth 3:** The box extends from approximately 4 to 5. The median is around 4.5. There are outliers above, reaching up to approximately 8.
### Key Observations
* Across all three models, the median MIN-K% tends to increase slightly from Depth 1 to Depth 3.
* The range of MIN-K% values, as indicated by the box size, appears relatively consistent across the different depths for each model.
* Outliers are present in all box plots, indicating some variability in MIN-K% values at each depth.
* Mixtral 8x7B Instruct has more outliers at Depth 1 than the other models.
### Interpretation
The box plots provide a visual comparison of the depthwise average MIN-K% for the three language models. The MIN-K% metric likely represents some measure of information or activity within the model at different depths. The general trend of increasing median MIN-K% from Depth 1 to Depth 3 suggests that, on average, this metric tends to increase as information propagates through the model layers. The presence of outliers indicates that there are specific instances where the MIN-K% deviates significantly from the average at each depth. The differences in the distribution of MIN-K% values between the models may reflect differences in their architectures, training data, or overall performance characteristics.
</details>
Figure 6: Factual accuracy difference between neighboring q 3 and q 2 in bottom 25% and top 75% quantiles. Positive gap indicates backward discrepancy and negative gap represents forward discrepancy.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Density Plots: Model Score Gap Distributions
### Overview
The image contains four density plots, each displaying the distribution of the score gap (D3 - D2) for different language models. The plots compare the distributions at the 25th and 75th percentiles. The models analyzed are LLaMA 2 13B Chat, LLaMA 3 8B Chat, and Mixtral 8x7B Instruct.
### Components/Axes
Each plot has the following components:
* **Title:** "Score Gap (D3 - D2)"
* **X-axis:** Represents the score gap (D3 - D2), ranging from -1.5 to 1.5.
* **Y-axis:** Represents the density, ranging from 0 to a maximum value that varies between plots (1.2, 1.5, or 3.0).
* **Legend:** Located in the top-right corner of each plot, indicating the percentile represented by each color:
* Green: 25%
* Orange: 75%
* **Model Name:** Located below each plot, indicating the language model being analyzed.
### Detailed Analysis
**Plot (a): LLaMA 2 13B Chat**
* **25% (Green):** The density curve peaks around -0.1, with a relatively symmetrical distribution.
* **75% (Orange):** The density curve peaks around 0, slightly shifted to the right compared to the 25% curve.
* **Density Range:** 0 to 1.2
**Plot (b): LLaMA 3 8B Chat**
* **25% (Green):** The density curve peaks around -0.1, with a relatively symmetrical distribution.
* **75% (Orange):** The density curve peaks around 0, slightly shifted to the right compared to the 25% curve.
* **Density Range:** 0 to 1.5
**Plot (c): Mixtral 8x7B Instruct**
* **25% (Green):** The density curve peaks around 0, with a relatively symmetrical distribution.
* **75% (Orange):** The density curve peaks around 0.1, slightly shifted to the right compared to the 25% curve.
* **Density Range:** 0 to 1.5
**Plot (d): Mixtral 8x7B Instruct**
* **25% (Green):** The density curve has two peaks, one around -0.2 and another around 0.1.
* **75% (Orange):** The density curve peaks around 0.1, with a less pronounced secondary peak.
* **Density Range:** 0 to 3.0
### Key Observations
* For LLaMA 2 13B Chat and LLaMA 3 8B Chat, the 75th percentile score gap distribution is shifted slightly to the right compared to the 25th percentile.
* For Mixtral 8x7B Instruct, the 25th percentile distribution shows a bimodal shape, while the 75th percentile distribution is more unimodal.
* The density values for Mixtral 8x7B Instruct are higher than those for the LLaMA models, indicating a more concentrated distribution of score gaps.
### Interpretation
The density plots illustrate the distribution of score gaps (D3 - D2) for different language models at the 25th and 75th percentiles. The shift in the 75th percentile distribution towards positive values for LLaMA models suggests that, at higher percentiles, the models tend to have a larger positive score gap. The bimodal distribution observed in the 25th percentile for Mixtral 8x7B Instruct indicates that there are two distinct clusters of score gaps at the lower percentile. The higher density values for Mixtral 8x7B Instruct suggest that its score gaps are more tightly clustered around the mean compared to the LLaMA models.
</details>
Min-%K probability is in the bottom 25% and top 75%. A positive gap represents higher factual accuracy for q 3 , indicating backward discrepancy. In contrast, a negative difference represents forward discrepancy.
## I Backward Discrepancy Examples
Different backward discrepancy examples are shown in Example 1, 2, and 3.
## J Prompts
## J.1 Data construction
We provide the prompts used to classify TutorEval questions (Table 19), generate D 3 answers (Table 20), generate D 2 or D 2 answers (Table 21), generate questions at D 2 (Table 23) and D 1 (Table 24), and augment questions at D 2 (Table 25) and D 1 (Table 26). For generating or augmenting any question at D 2 or D 1 , we use the same system prompt (Table 22) that describes the definitions of depths of knowledge.
## J.2 Inference
We provide the prompts used for zero-shot (Table 27), Prompt (Gold) and Prompt (Pred.) (Table 28), and multi-turn (Table 29) inference.
## J.3 Evaluation
The prompt used for LLM-as-a-Judge evaluation is in Table 30.
Figure 7: Interface for human annotators to check if Comprehensiveness (C1), Implicitness (C2), Non-binary questioning (C3) hold between a question and its sub-questions in DEPTHQA.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Evaluation Form: Anhedonia and Schizophrenia
### Overview
The image presents an evaluation form designed for assessing the quality of questions related to anhedonia and schizophrenia. It includes general instructions, criteria descriptions, a main question, sub-questions, and evaluation sections for comprehensiveness, implicitness, and non-binary questioning.
### Components/Axes
**Left Side:**
* **General Instructions:** Provides guidance to the judge/evaluator.
* **Criteria Descriptions:** Defines the evaluation criteria (Comprehensiveness, Implicitness, Non-binary Questioning).
* **Main Question (Higher complexity):** Presents the main question regarding anhedonia and its relation to depression and schizophrenia.
* **Sub-questions (Derived, lower complexity):** Lists sub-questions related to the main question.
**Right Side:**
* **Evaluation:** Provides instructions for evaluating each criterion independently.
* **C1. Comprehensiveness:** Assesses how well the sub-questions cover the foundational concepts necessary to answer the main question.
* Scale: Insufficient, Partial, Comprehensive (Comprehensive is selected)
* **C2. Implicitness:** Assesses how well each lower-level question encourages independent reasoning without providing obvious clues to the higher-level question's answer.
* Scale: Explicit, Partially Implicit, Fully Implicit (Fully Implicit is selected for all sub-questions)
* **C3. Non-binary Questioning:** Evaluates whether each question encourages detailed explanations and avoids binary (yes/no) responses.
* Scale: Binary, Open-ended (Open-ended is selected for all questions)
### Detailed Analysis or ### Content Details
**General Instructions:**
* The evaluator is instructed to provide clear, objective feedback based on specific criteria.
* If a question field is named 'undefined', it does not need to be checked.
**Criteria Descriptions:**
* **C1. Comprehensiveness:** Assesses whether lower-level questions cover foundational concepts.
* **C2. Implicitness:** Evaluates whether lower-level questions avoid directly revealing answers or heavily hinting at solutions.
* **C3. Non-binary Questioning:** Assesses whether questions elicit detailed, exploratory responses instead of simple yes/no answers.
**Main Question (Higher complexity):**
* "[Main] Anhedonia seems to be a common feature of both depression and schizophrenia. Explain whether it is valid to state that schizophrenic people have depression."
* "See Answer"
**Sub-questions (Derived, lower complexity):**
* "[Sub-1] How can anhedonia be observed or identified in a clinical setting?"
* "[Sub-2] What are the primary diagnostic criteria for schizophrenia and how do they differ from those of depression?"
* "[Sub-3] Explain the presence of symptoms that are common to more than one mental health disorder, if possible, and how this is addressed in diagnosis."
* "[Sub-4] What role does symptom overlap play in the diagnosis and treatment of mental health disorders?"
* All sub-questions have a "See Answer" link.
**Evaluation:**
* **C1. Comprehensiveness:**
* "Evaluate how well the entire set of sub-questions covers all the foundational concepts necessary to answer the main question. Consider whether any crucial aspects are missing when looking at all sub-questions together."
* The "Comprehensive" option is selected.
* **C2. Implicitness:**
* "Assess how well each lower-level question encourages independent reasoning without providing obvious clues to the higher-level question's answer."
* For all sub-questions ([Sub-1] to [Sub-4]), the "Fully Implicit" option is selected.
* **C3. Non-binary Questioning:**
* "Evaluate whether each question encourages detailed explanations and avoids binary (yes/no) responses."
* For all questions (Main and [Sub-1] to [Sub-4]), the "Open-ended" option is selected.
### Key Observations
* The evaluation form is structured to assess the quality of questions related to anhedonia and schizophrenia.
* The evaluation criteria focus on comprehensiveness, implicitness, and the nature of the questions (non-binary).
* The evaluator has deemed the sub-questions comprehensive, fully implicit, and open-ended.
### Interpretation
The evaluation form suggests a structured approach to assessing the quality of questions related to complex topics like anhedonia and schizophrenia. The selection of "Comprehensive" for comprehensiveness indicates that the sub-questions are considered to adequately cover the foundational concepts. The selection of "Fully Implicit" for implicitness suggests that the sub-questions do not provide obvious clues, encouraging independent reasoning. The selection of "Open-ended" for non-binary questioning indicates that the questions are designed to elicit detailed, exploratory responses rather than simple yes/no answers. This suggests that the questions are well-designed to promote deeper understanding and critical thinking about the topic.
</details>
Figure 8: Interface for human evaluators to evaluate an LLM's response on a question from DEPTHQA. The rubric shown is a simplified form of the actual factual accuracy rubric used in LLM evaluations.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Form: Evaluation of Response
### Overview
The image presents a form for evaluating a response related to the concept of visual attention. It includes instructions for the evaluator, the question and the response to be evaluated, a reference answer, and a section for submitting the evaluation. The evaluation section asks the user to rate the correctness, accuracy, and factuality of the response on a scale of 1 to 5.
### Components/Axes
**General Instructions:**
* Instructs the user to provide clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
**Question and Response:**
* **Question:** "What does it mean for attention to be visual?"
* **Response to Evaluate:** A detailed explanation of visual attention, including its focus on visual stimuli, its distinction from other types of attention, and examples of visual attention.
**Reference Answer (Score 5):**
* Provides a reference answer defining visual attention as the brain's ability to selectively process visual information while ignoring irrelevant details.
**Evaluation:**
* **Instructions:** "When evaluating, consider the criterion independently. Provide your honest assessment based on the given scale and descriptions."
* **Question:** "Is the response correct, accurate, and factual?"
* **Scale:**
1. Largely incorrect, inaccurate, and not factual[1]
2. Partially correct but with significant inaccuracies[2]
3. Generally correct with minor inaccuracies[3]
4. Mostly correct, accurate, and factual[4]
5. Consistently correct, accurate, and entirely factual[5]
* A checkbox is selected for option 5.
* (Optional) A text box is provided for detailed feedback.
**Submit Evaluation:**
* Instructs the user to carefully consider their evaluation before submitting.
### Detailed Analysis or ### Content Details
**Response to Evaluate:**
The response defines visual attention as being focused on visual stimuli and processed through the visual system of the brain. It involves detection, orientation, and processing of visual information. It distinguishes visual attention from other types of attention (auditory, tactile, cognitive). Examples of visual attention are provided: scanning a room, focusing on a detail, recognizing a face, following a moving object.
**Reference Answer (Score 5):**
The reference answer defines visual attention as the brain's ability to selectively process visual information while ignoring irrelevant details. It involves focusing awareness on specific elements in the visual field.
**Evaluation:**
The evaluator has selected option 5, indicating that the response is "Consistently correct, accurate, and entirely factual."
### Key Observations
* The response to be evaluated provides a comprehensive explanation of visual attention.
* The reference answer provides a concise definition of visual attention.
* The evaluator has judged the response to be consistently correct, accurate, and entirely factual.
### Interpretation
The form is designed to assess the quality of a response regarding the concept of visual attention. The evaluator, acting as a "fair judge assistant," is guided by specific criteria to provide objective feedback. The selection of option 5 suggests that the response aligns well with the reference answer and demonstrates a thorough understanding of visual attention. The form facilitates a structured evaluation process, ensuring that feedback is both informed and consistent.
</details>
<details>
<summary>Image 9 Details</summary>

### Visual Description
Icon/Small Image (20x16)
</details>
<details>
<summary>Image 10 Details</summary>

### Visual Description
Icon/Small Image (19x16)
</details>
<details>
<summary>Image 11 Details</summary>

### Visual Description
Icon/Small Image (18x15)
</details>
<details>
<summary>Image 12 Details</summary>

### Visual Description
Icon/Small Image (18x18)
</details>
<details>
<summary>Image 13 Details</summary>

### Visual Description
Icon/Small Image (18x17)
</details>
<details>
<summary>Image 14 Details</summary>

### Visual Description
Icon/Small Image (18x19)
</details>
<details>
<summary>Image 15 Details</summary>

### Visual Description
Icon/Small Image (18x16)
</details>
<details>
<summary>Image 16 Details</summary>

### Visual Description
Icon/Small Image (18x16)
</details>
<details>
<summary>Image 17 Details</summary>

### Visual Description
Icon/Small Image (19x19)
</details>
<details>
<summary>Image 18 Details</summary>

### Visual Description
Icon/Small Image (19x15)
</details>
<details>
<summary>Image 19 Details</summary>

### Visual Description
Icon/Small Image (19x18)
</details>
<details>
<summary>Image 20 Details</summary>

### Visual Description
Icon/Small Image (19x18)
</details>
<details>
<summary>Image 21 Details</summary>

### Visual Description
Icon/Small Image (19x16)
</details>
<details>
<summary>Image 22 Details</summary>

### Visual Description
Icon/Small Image (18x16)
</details>
<details>
<summary>Image 23 Details</summary>

### Visual Description
Icon/Small Image (18x16)
</details>
<details>
<summary>Image 24 Details</summary>

### Visual Description
Icon/Small Image (18x19)
</details>
<details>
<summary>Image 25 Details</summary>

### Visual Description
Icon/Small Image (20x17)
</details>
<details>
<summary>Image 26 Details</summary>

### Visual Description
Icon/Small Image (19x17)
</details>
<details>
<summary>Image 27 Details</summary>

### Visual Description
Icon/Small Image (18x16)
</details>
<details>
<summary>Image 28 Details</summary>

### Visual Description
Icon/Small Image (19x18)
</details>
<details>
<summary>Image 29 Details</summary>

### Visual Description
Icon/Small Image (18x17)
</details>
<details>
<summary>Image 30 Details</summary>

### Visual Description
Icon/Small Image (18x20)
</details>
<details>
<summary>Image 31 Details</summary>

### Visual Description
Icon/Small Image (18x16)
</details>
<details>
<summary>Image 32 Details</summary>

### Visual Description
Icon/Small Image (18x17)
</details>
<details>
<summary>Image 33 Details</summary>

### Visual Description
Icon/Small Image (18x17)
</details>
Table 15: Reasoning type explanation and examples. D 3 , D 2 , and D 1 questions are denoted as D3 , D2 , D1 , respectively.
| Reasoning Type | Explanation | Example |
|------------------------|-------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Comparative | Compare two or more concepts, identi- fying similarities and differences. | D2 How do neutrinos differ from other subatomic particles, and why are they considered potential candidates for dark matter? D1 What are neutrinos? D1 What are subatomic particles? D1 What is dark matter? |
| Relational | Specify and explain relationships and understand how different concepts are connected organically. | D2 Describe how eco-efficient urban planning can address the challenges of rapid urbanization in developing countries. D1 What is urbanization? D1 What does eco-efficient mean? D1 What are common challenges faced by rapidly urbanizing cities in developing countries? D1 What is urban planning? |
| Causal | Identify cause-and-effect relationships. | D2 Explain how the bending of stereocilia on hair cells leads to the depolarization of these cells. D1 What are stereocilia? D1 What is depolarization? D1 Where are hair cells located? D1 What is the function of hair cells in the ear? |
| Inductive | Make broad generalizations from spe- cific observations and/or formulate a hypothesis about a particular concept. | D3 Can you sum up the point of connecting finite sums to integrals? This concept is still a bit obscure to me. D2 How do you approximate the area under a curve using rectangles or trapezoids? D2 Explain the process of taking the limit of a sum as the number of rectangles increases to infinity. D2 What is a Riemann sum, and how is it related to the concept of an integral? D2 How can finite sums be used to estimate real-world quantities that change continuously over an interval? |
| Criteria Develop- ment | Understand when and why specific cri- teria apply, and know the conditions or assumptions required for different cases. | D2 Under what conditions does the ideal gas law provide accurate predictions, and when does it not? D1 What is the ideal gas law equation? D1 What are the standard conditions for temperature and pressure in experiments? D1 What is meant by 'ideal gas'? D1 How do real gases differ from ideal gases? |
| Procedural | Select a procedure according to task need and perform it. | D2 Describe the process by which hair cells transduce mechanical energy from sound waves into electrical signals. D1 What are hair cells? D1 What is mechanical energy? D1 What are sound waves? D1 What are electrical signals? |
| Evaluative | Verify reasonableness of results. | D3 How can I evaluate the suitability of the ideal gas equation for a given gas? D2 How do you calculate the properties such as pressure, temperature, and volume using the ideal gas law? D2 What methods can be used to obtain experimental data for gas properties under specific conditions? D2 How can deviations from ideal gas behavior be identified and measured? D2 Under what conditions does the ideal gas law provide accurate predictions, and when does it not? |
| Example | Provide example for the given concept. | D2 Describe a scenario where energy is conserved but the process is thermodynamically impossi- ble. D1 What does the law of conservation of energy state? D1 What is thermodynamic impossibility? D1 What is meant by energy conversion? D1 Can energy be created or destroyed? |
| Quantitative | Manipulate numerical data to make in- formed decisions. | D2 Explain the process and time complexity of deleting an element from a data structure like a linked list. D1 What is a linked list? D1 What is the definition of time complexity? D1 How is data stored in a linked list? D1 What does 'deleting an element' mean in the context of data structures? |
| Application | Apply concepts to practical situations. | D2 What policies can governments implement to encourage the transition towards a circular economy and sustainable business practices? D1 What is a circular economy? D1 What are sustainable business practices? D1 What is the role of government in regulating the economy? D1 What does the term 'policy' mean in the context of government regulation? |
Table 16: Snippet of hierarchical question deconstruction for analyzing matrix diagonalizability. The topmost complex question, D 3 , requires developing criteria of whether the statement holds or not. The first D 2 question identifies the key property to help determine the case. The D 1 child questions addresses relevant definitions, characteristics, and formula in order to synthesize the foundational concepts.
<details>
<summary>Image 34 Details</summary>

### Visual Description
## List of Questions: Linear Algebra Concepts
### Overview
The image presents a list of questions related to linear algebra concepts, specifically focusing on matrices, eigenvectors, and eigenvalues. Each question is prefixed with a difficulty level indicator (D1, D2, or D3).
### Components/Axes
* **Difficulty Levels:** D1 (light purple), D2 (light blue), D3 (light green)
* **Questions:** A series of questions related to linear algebra.
### Detailed Analysis or ### Content Details
Here's a breakdown of the questions and their associated difficulty levels:
* **D3 (light green):** "Does a matrix always have a basis of eigenvectors?"
* **D2 (light blue):** "How can you determine if a square matrix is diagonalizable?"
* **D1 (light purple):** "What is the definition of a square matrix?"
* **D1 (light purple):** "What are the characteristics of a diagonal matrix?"
* **D1 (light purple):** "What is meant by the eigenvalues of a matrix?"
* **D1 (light purple):** "How is the characteristic equation of a matrix defined?"
* **D2 (light blue):** "What is the process for finding the eigenvalues of a matrix?"
* **D2 (light blue):** "Explain how to compute eigenvectors from a given set of eigenvalues."
* **D2 (light blue):** "Describe the method to perform a similarity transformation on a matrix."
### Key Observations
* The questions cover fundamental concepts in linear algebra.
* Difficulty levels range from D1 (basic definitions) to D3 (more complex theoretical questions).
* The questions are ordered with D3 at the top, followed by D2, and then D1.
### Interpretation
The image likely represents a study guide, quiz, or a set of review questions for a linear algebra course. The difficulty levels provide a way to gauge understanding of the material, starting with basic definitions and progressing to more complex concepts and procedures. The questions cover key topics such as matrix diagonalization, eigenvalues/eigenvectors, and matrix transformations.
</details>
Table 17: Snippet of hierarchical question deconstruction for understanding species differentiation and hybrid viability. Key reasoning steps include identifying exceptions, elucidating causal relationships, and hypothesizing potential outcomes in species classification criteria, culminating in a robust evaluation of biological definitions and exceptions.
<details>
<summary>Image 35 Details</summary>

### Visual Description
## Text Snippets: Questions and Statements about Hybrid Offspring
### Overview
The image presents a series of questions and a statement related to the biological concept of viable offspring from different species, specifically focusing on horses, donkeys, and mules. The questions are categorized into three levels, D1, D2, and D3, indicated by colored labels.
### Components/Axes
* **Labels:** The questions and statements are labeled with "D1" (light purple), "D2" (light blue), and "D3" (light green). These labels likely indicate different levels of difficulty or depth of inquiry.
* **Text Content:** The image contains several questions and one statement.
### Detailed Analysis or ### Content Details
Here's a breakdown of the text content, categorized by its label:
* **D3 (light green):** "I thought that animals from different species could not produce viable offspring. However, a horse and a donkey can produce viable offspring. Isn't this contradictory?"
* **D2 (light blue):**
* "How do horses and donkeys differ genetically and what makes them separate species?"
* "What is the definition of a viable offspring in biological terms?"
* "Can you explain the concept of hybrid vigor and how it applies to cross-species offspring such as mules?"
* "What are the implications of sterility in hybrids like mules for their classification as members of a species?"
* **D1 (light purple):**
* "What is the definition of hybrid vigor?"
* "What is a mule?"
* "How is a mule produced?"
* "What are the parent species of a mule?"
### Key Observations
* The questions range from basic definitions (D1) to more complex genetic and classification implications (D2).
* The statement (D3) introduces a perceived contradiction that prompts further inquiry.
* The focus is on the biological concepts of viable offspring, hybrid vigor, and species classification, using the example of horses, donkeys, and mules.
### Interpretation
The image appears to be a set of prompts or questions designed to explore the complexities of species, hybridization, and reproductive viability. The progression from D1 to D3 suggests a learning path, starting with fundamental definitions and culminating in a deeper understanding of the apparent contradiction in the case of horse-donkey hybrids (mules). The questions highlight the importance of understanding genetic differences, hybrid vigor, and the implications of sterility in defining species boundaries.
</details>
## Question:
I'm studying ideal and real gasses in engineering thermodynamics. Why can't we measure temperature in Fahrenheit?
## Predicted answer:
In engineering thermodynamics, it's common to work with ideal and real gases, and temperature plays a crucial role in characterizing their behavior.
Actually, you can measure temperature in Fahrenheit, but not in the context of the ideal gas law or ideal gas equations. The issue lies in the definition and scaling of the Fahrenheit and Celsius temperature scales.
Celsius is the preferred choice for scientific and engineering applications because it's based on the concept of absolute zero, which is the theoretically lowest possible temperature. It's defined as 0 K (kelvin). This allows for a coherent and consistent framework for discussing temperature and thermodynamic quantities.
Fahrenheit, on the other hand, has a different zero point (32°F) and relies on arbitrary reference points to define its scale. Although it's still used for everyday applications, it doesn't provide the same theoretical foundation as Celsius.
The ideal gas law and the equations of state for gases are typically expressed in terms of absolute temperature (K or °R), and Celsius is the most compatible scale for these applications.
So, while you could measure temperature in Fahrenheit if needed, it's essential to understand the underlying science and choose the most relevant and consistent temperature scale when working with ideal and real gases in engineering thermodynamics.
## Feedback:
Feedback: The response provides an accurate explanation regarding the use of Fahrenheit versus Celsius (and indirectly Kelvin) in the context of thermodynamics and ideal gases. It correctly identifies the reasons why Celsius and Kelvin are preferred for scientific studies, focusing on the relevance of absolute zero and the uniformity in scientific communication. However, it slightly misleads by stating that Celsius is based on the concept of absolute zero, when in fact, it is Kelvin that is based on this concept. Celsius is indeed more aligned with Kelvin compared to Fahrenheit but it is not based on absolute zero; rather, it is defined by the melting point of ice and the boiling point of water at one atmosphere of pressure, which are then used to extrapolate its relation to Kelvin. The response demonstrates a strong understanding but includes a minor inaccuracy regarding the basis of the Celsius scale. This inaccuracy does not significantly detract from the overall quality and factual correctness of the explanation but is worth noting for precision.
## Score:
Table 18: Example LLaMA 8B Instruct zero-shot inference on a D 3 question followed by GPT-4 Turbo evaluation.
## Example 1: Common Backward Discrepancy
D3 question: I understand the point of the tangle number. But why go to the extra lengths of considering the continued fraction representation of the tangle number.
## LLaMA 2 7B Chat Predicted Answer (Score: 4)
(...) The continued fraction representation of a tangle number allows us to represent the tangle as a rational number , which can be easily manipulated and analyzed. This can be particularly useful when studying the properties of tangles, such as their convergence properties, their behavior under certain operations, and their relationship to other mathematical structures. (...)
## LLaMA 3 70B Instruct Predicted Answer (Score: 4)
(...) By looking at the continued fraction representation, we can gain insight into the tangle's underlying topology and geometry. In particular, the continued fraction representation can reveal information about the tangle's symmetry, its decomposition into simpler components, and even its relationship to other tangles. (...)
D2 question: What is the process of constructing a tangle for a given rational number?
## LLaMA 2 7B Chat Predicted Answer (Score: 1)
(...) Choose a base: There are many different bases that you can use to construct a tangle for a rational number. Some common bases include the golden ratio, the Fibonacci sequence, and the square root of 2 . Choose a base that you feel comfortable working with and that will help you visualize the tangle. (...)
Feedback: Introduces unrelated concepts that are not part of the accepted method for constructing tangles based on rational numbers.
## LLaMA 3 70B Instruct Predicted Answer (Score: 2)
- (...) Draw p arcs : Draw p arcs from the top of the circle to the bottom, each crossing the circle at a single point. These arcs will form the strands of the tangle.
Twist the arcs : Twist the arcs around each other according to the following rule: for each of the q crossings, take the upper strand and pass it over or under the lower strand (it doesn't matter which) to create a twist. (...)
Feedback: Describes a process that diverges from the correct method of using continued fractions to construct a tangle. It also omits the critical step of expressing the rational number as a continued fraction and constructing the tangle diagram based on the continued fraction's terms.
<details>
<summary>Image 36 Details</summary>

### Visual Description
## Text Analysis: Backward Discrepancy by Small Model
### Overview
The image presents an example of "Backward Discrepancy by Small Model," showcasing the performance of the LLaMA 2 7B Chat model on two questions related to data structures and algorithm complexity. The model's answers are evaluated and scored, with feedback provided on the accuracy and terminology used.
### Components/Axes
The image is structured into the following components:
1. **Title:** "Example 2: Backward Discrepancy by Small Model"
2. **Question 1 (D3):**
* **Label:** "D3 question:"
* **Question Text:** "Given that node operations are O(1), explain whether this implies that all operations in data structures using nodes will also have O(1) complexity, and why or why not."
* **Model Answer:** "LLaMA 2 7B Chat Predicted Answer (Score: 5)"
* **Answer Text:** "(\...) just because node operations have a constant time complexity doesn't necessarily mean that all operations in data structures using nodes will also have O(1) complexity. The reason for this is that the time complexity of an operation can depend on many factors beyond just the size of the input, such as the data structure's internal organization, the algorithms used, and the hardware on which the data structure is running. (\...)"
3. **Question 2 (D2):**
* **Label:** "D2 question:"
* **Question Text:** "How does the insertion operation work in a linked list, and what factors affect its time complexity?"
* **Model Answer:** "LLaMA 2 7B Chat Predicted Answer (Score: 2)"
* **Answer Text:** "(\...) Here are the common ways to perform insertion in a linked list: 1. *Linear Insertion*: (\...) This approach has a time complexity of O(1) because we only need to traverse the list once to find the correct position. 2. *Shift-Insertion*: (\...) This approach has a time complexity of O(n) because we need to traverse the entire list to shift all the nodes. 3. *Rotate-Insertion*: (\...) This approach has a time complexity of O(n) because we need to traverse the entire list to rotate all the nodes. (\...)"
4. **Feedback:**
* **Text:** "Feedback: Introduces non-standard terms ("Linear Insertion," "Shift-Insertion," "Rotate-Insertion") that are not typically used in describing linked list operations. Incorrectly states that insertion at a specific position has O(1) time complexity, when it should be O(n) in the worst case."
### Detailed Analysis or Content Details
* **Question D3:** The LLaMA 2 7B model received a score of 5 for its answer. The answer correctly identifies that O(1) node operations do not necessarily imply O(1) complexity for all operations on data structures using those nodes. It explains that the time complexity depends on factors beyond input size, such as internal organization, algorithms, and hardware.
* **Question D2:** The LLaMA 2 7B model received a score of 2 for its answer. The answer lists three insertion methods: Linear Insertion (O(1)), Shift-Insertion (O(n)), and Rotate-Insertion (O(n)).
* **Feedback:** The feedback highlights two issues: the introduction of non-standard terminology and an incorrect statement about the time complexity of insertion at a specific position.
### Key Observations
* The model performs better on the more conceptual question (D3) than on the question requiring specific knowledge of data structure operations (D2).
* The model's answer to D2 includes inaccuracies, as pointed out in the feedback.
### Interpretation
The image demonstrates a scenario where a small language model (LLaMA 2 7B) exhibits "backward discrepancy," meaning it performs better on a higher-level, more abstract question (D3) than on a more specific, knowledge-based question (D2). This suggests that the model may have a better grasp of general principles than specific details. The feedback highlights the model's tendency to use non-standard terminology and make factual errors, indicating potential limitations in its training data or reasoning capabilities. The discrepancy in scores (5 vs. 2) underscores the importance of evaluating language models on a variety of tasks to identify their strengths and weaknesses.
</details>
## Example 3: Backward Discrepancy by Large Model
D3 question: The quadratic formula is pretty cool. What similar formulas, if any, exist for higher degree equations?
## LLaMA 3 70B Instruct Predicted Answer (Score: 5)
(...) For cubic equations of the form ax 3 + bx 2 + cx + d = 0 , there is a formula known as Cardano's Formula (or Cardano's Method ). (...) For quartic equations of the form ax 4 + bx 3 + cx 2 + dx + e = 0 , there is a formula known as Ferrari's Method (or Ferrari's Solution ). (...) For equations of degree 5 or higher, there is no general formula that can be written using only addition, subtraction, multiplication, division, and nth roots (like the quadratic formula). This was proven by Niels Henrik Abel in 1824, and later independently by Évariste Galois in 1832. (...)
D2 question: Explain the steps to solve quartic equations using the quartic formula.
## LLaMA 3 70B Instruct Predicted Answer (Score: 1)
The quartic formula! While it's not as well-known as the quadratic formula, it's a powerful tool for solving quartic equations of the form ax 4 + bx 3 + cx 2 + dx + e = 0 . Here are the steps to solve quartic equations using the quartic formula:
## Steps to Solve Quartic Equations:
1. Write down the quartic equation in standard form : ax 4 + bx 3 + cx 2 + dx + e = 0 , where a, b, c, d, and e are constants.
2. Rearrange the equation to have a = 1 : Divide both sides of the equation by a, so that the coefficient of x 4 is 1. This simplifies the formula.
3. Calculate the values of p, q, and r :
4. -p = b 2 -3 c
5. -q = bc -4 d -a
6. -r = c 2 -3 bd +12 ae
4. Calculate the value of u : u = ( r/ 2 ± √ ( r/ 2) 2 +( q/ 3) 3 ) ( 1 / 3)
5. Calculate the value of v : v = ( q + u -b 2 / 3) / (3 u ) √
7. Solve for x : x = ( -b ± 2 w ± w 2 -4 u ) / 2
6. Calculate the value of w : w = ( -b ± b 2 -4 v ) / 2 √
Feedback: Introduces an incorrect formula for solving quartic equations, which does not align with established methods like Ferrari's solution. The provided steps and calculations do not correspond to the correct procedure for solving quartic equations using the quartic formula or any other recognized method.
## System prompt:
You are an excellent question classifier. You will be given (1) a question and (2) key points that a good response would address when answering the question. You have to classify at which Depth of Knowledge (DOK) level the question is located. DOK is a framework that focuses on the context which knowledge will be demonstrated. Here is the definition of each DOK level:
1. DOK-1 (Basic Knowledge and Recall): This level addresses 'What is the knowledge?'. It evaluates the ability to remember, explain, or pinpoint fundamental facts, terms, principles, and procedures. It's about recognizing or recollecting basic information and performing simple, direct tasks.
2. DOK-2 (Application of Knowledge and Skills): This level explores 'How can the knowledge be used?'. It tests the ability to employ knowledge and concepts in practical situations, which involves choosing appropriate methods, solving straightforward problems, or interpreting data. This level acts as an intermediary step between fundamental understanding and more advanced reasoning.
3. DOK-3 (Analytical and Strategic Thinking): This level questions 'Why can the knowledge be used?'. It challenges one to use strategic thought, logic, and problem-solving in intricate, abstract situations that might have more than one solution. This stage demands critical thinking, rationale, and conceptualization of theoretical scenarios.
4. DOK-4 (Extended and Integrative Knowledge): This level examines 'How else can the knowledge be applied?'. It assesses the ability to conduct thorough research, apply concepts and skills in real-world scenarios, and integrate knowledge across different disciplines or sources. It involves developing original ideas, conducting experiments, and synthesizing information from various fields. Note that in the science domain, this level may be constrained to designing studies, experiments, and projects and is thus rare or even absent in most standardized assessment.
## User prompt:
Please classify the following question into DOK-1, 2, 3, or 4. Refer to the key points to help your judgment. Think step-by-step and provide an explanation of your judgment. After providing your explanation, output the DOK level that is an integer of 1, 2, 3, or 4. The output format should looks as follows: {explanation for reaching the DOK decision} [RESULT ]{DOK level that is an integer in the range 1 to 4}.
```
```
## System prompt:
You are an excellent assistant that effectively answers complex questions. You are given a passage, question, and key points to answer the question. Read the instruction and give an appropriate answer.
## User prompt:
```
```
## ## Instruction
Answer the question below.
- You may refer to the contents in the chapter text above if necessary, but do NOT expose in your answer that you are referring to the provided source.
- Ensure that the answer is complete, fully satisfying the key points to answer the question.
- The answer must also match the level of cognitive complexity required, incorporating the context which the depth of knowledge will be demonstrated.
```
```
Table 20: Prompt for generating reference answer for a D 3 question.
Table 19: Prompt for classifying TutorEval questions.
## System prompt:
You are a helpful assistant that accurately answers complex questions. Ensure that your answer is focused and compact.
## User prompt:
{question}
## System prompt:
You are an excellent question generator. You will be given a question and a gold answer to the question. You have to generate shallower questions from the given question. Here is the definition of the depth of knowledge a question requires:
1. Depth-1 (Basic Knowledge and Recall): This level addresses 'What is the knowledge?'. It evaluates the ability to remember, explain, or pinpoint fundamental facts, terms, principles, and procedures. It's about recognizing or recollecting basic information and performing simple, direct tasks.
2. Depth-2 (Application of Knowledge and Skills): This level explores 'How can the knowledge be used?'. It tests the ability to employ knowledge and concepts in practical situations, which involves choosing appropriate methods, solving straightforward problems, or interpreting data. This level acts as an intermediary step between fundamental understanding and more advanced reasoning.
3. Depth-3 (Analytical and Strategic Thinking): This level questions 'Why can the knowledge be used?'. It challenges one to use strategic thought, logic, and problem-solving in intricate, abstract situations that might have more than one solution. This stage demands critical thinking, rationale, and conceptualization of theoretical scenarios.
Table 22: System prompt for generating or augmenting D 1 or D 2 questions.
Table 21: Prompt for generating reference answer for a D 1 or D 2 question.
Table 23: User prompt for generating D 2 questions.
<details>
<summary>Image 37 Details</summary>

### Visual Description
## Question Generation Examples
### Overview
The image presents instructions and examples for generating "Depth-2" questions that appropriately address a given "Depth-3" question. It outlines the criteria for these questions and provides four examples, each containing a Depth-3 question and a set of generated Depth-2 questions in JSON format.
### Components/Axes
* **Instruction:** A set of guidelines for creating Depth-2 questions.
* **Examples:** Four distinct examples, each consisting of:
* **Depth-3 question:** The primary question to be addressed.
* **Generated Depth-2 questions:** A list of questions designed to help answer the Depth-3 question. These are formatted as a JSON object with the key "Depth-2\_questions" and a list of question strings as the value.
* **JSON Format:** The Depth-2 questions are structured in a JSON format: `{"Depth-2_questions": [list of Depth-2 question strings ]}`.
### Detailed Analysis or ### Content Details
**Instruction:**
* Create a maximum of 4 Depth-2 questions.
* Depth-2 questions should focus on the application of procedural knowledge and skills, while Depth-3 questions focus on analysis and strategic knowledge.
* Consider the cognitive complexity required to solve the Depth-3 question.
* Ensure the Depth-2 questions adequately and comprehensively cover the necessary factual or conceptual knowledge.
* Ensure the Depth-2 questions do not directly answer the Depth-3 question.
* The number of generated Depth-2 questions should not exceed 4.
**Example 1:**
* **Depth-3 question:** "What is the intuition behind the Gram - Schmidt procedure?"
* **Generated Depth-2 questions:**
* "How do you project one vector onto another vector?"
* "What does it mean for two vectors to be orthogonal, and how can you verify this property?"
* "Describe the process of normalizing a vector."
* "Explain how subtracting the projection of one vector from another results in orthogonality."
* "Given a set of vectors, how can you determine if they are linearly independent?"
* "How can the concept of linear independence be used to form a basis for a vector space?"
**Example 2:**
* **Depth-3 question:** "Why couldn't we test general relativity effects using the Eotvos experiment?"
* **Generated Depth-2 questions:**
* "How does the Eötvös experiment determine the equivalence between inertial mass and gravitational mass?"
* "Describe the Equivalence Principle and its significance in the theory of General Relativity."
* "Identify experiments or observations that could directly test the predictions of General Relativity, such as time dilation or the bending of light."
* "How do experiments measuring time dilation differ in design and scope from those measuring mass equivalence?"
**Example 3:**
* **Depth-3 question:** "Why are aldehydes more readily oxidized to carboxylic acids compared to ketones, and how does this difference in reactivity influence their identification in the laboratory?"
* **Generated Depth-2 questions:**
* "How can you identify an aldehyde using Tollens' reagent?"
* "Why does the carbonyl carbon in aldehydes have a significant partial positive charge?"
* "How does the structure of ketones differ from that of aldehydes, and how does this affect their reactivity towards oxidation?"
**Example 4:**
* **Depth-3 question:** "In the context of computer programming, why is branching unstructured? And is it a bad design choice?"
* **Generated Depth-2 questions:**
* "What are the key differences between structured and unstructured branching in programming?"
* "How does the 'goto' statement work in computer programming?"
* "What are the potential risks involved with using unstructured branching in large software projects?"
* "How does the structure of a program affect its maintainability?"
* "How can the flow of execution in a program influence its debuggability?"
### Key Observations
* The Depth-3 questions require a higher level of understanding and analysis.
* The Depth-2 questions are designed to break down the Depth-3 question into smaller, more manageable parts.
* The JSON format provides a structured way to represent the generated questions.
* Each example provides a different subject area, showcasing the versatility of the question generation process.
### Interpretation
The image illustrates a method for generating supporting questions (Depth-2) to aid in answering a more complex question (Depth-3). The instructions emphasize the importance of focusing on procedural knowledge for Depth-2 questions and analytical knowledge for Depth-3 questions. The examples demonstrate how to break down complex questions into smaller, more specific inquiries that, when answered, contribute to a comprehensive understanding of the original question. The use of JSON format suggests a potential application in automated question generation or knowledge representation systems. The examples cover diverse topics, indicating the broad applicability of this approach.
</details>
```
```
Table 24: User prompt for generating D 1 questions.
- User prompt: ## Instruction Create {count} Depth-2 question(s) that complement current Depth-2 questions, which are necessary to correctly answer the provided Depth-3 question. - Remember that Depth-2 questions are centered on application of procedural knowledge and skills and Depth-3 questions are centered on analysis and strategic knowledge. - Take into consideration the level of cognitive complexity required to solve the Depth-3 question, so that your generated questions fall under the description of Depth-2 appropriately. - Complement the existing Depth-2 questions with additional questions to ensure they collectively cover all necessary procedural knowledge and skills required to answer the Depth-3 question effectively. - Ensure that all of your generated Depth-2 questions do not directly answer to the given Depth-3 question. - The number of all Depth-2 questions should not exceed 4. - The generated Depth-2 questions should be in JSON format: {'Depth-2\_questions': [list of Depth-2 question strings ]} ## Example 1 ### Depth-3 question and current Depth-2 questions What is the intuition behind the Gram - Schmidt procedure? {'current\_Depth-2\_questions': ['How do you project one vector onto another vector?', 'What does it mean for two vectors to be orthogonal, and how can you verify this property?', 'Describe the process of normalizing a vector.', 'Explain how subtracting the projection of one vector from another results in orthogonality.', 'Given a set of vectors, how can you determine if they are linearly independent?' ]} ### Generated complementary Depth-2 questions {'complementary\_Depth-2\_questions': ['How can the concept of linear independence be used to form a basis for a vector space?' ]} ## Example 2 ### Depth-3 question and current Depth-2 questions Why couldn't we test general relativity effects using the Eotvos experiment? {'current\_Depth-2\_questions': ['How does the Eötvös experiment determine the equivalence between inertial mass and gravitational mass?', 'Describe the Equivalence Principle and its significance in the theory of General Relativity.', 'Identify experiments or observations that could directly test the predictions of General Relativity, such as time dilation or the bending of light.' ]} ### Generated complementary Depth-2 questions {'complementary\_Depth-2\_questions': ['How do experiments measuring time dilation differ in design and scope from those measuring mass equivalence?' ]} ## Example 3 ### Depth-3 question and current Depth-2 questions Why are aldehydes more readily oxidized to carboxylic acids compared to ketones, and how does this difference in reactivity influence their identification in the laboratory? {'current\_Depth-2\_questions': ['How can you identify an aldehyde using Tollens' reagent?', 'Why does the carbonyl carbon in aldehydes have a significant partial positive charge?' ]} ### Generated complementary Depth-2 questions {'complementary\_Depth-2\_questions': ['How does the structure of ketones differ from that of aldehydes, and how does this affect their reactivity towards oxidation?' ]} ## Example 4 ### Depth-3 question and current Depth-2 questions In the context of computer programming, why is branching unstructured? And is it a bad design choice? {'current\_Depth-2\_questions': ['What are the key differences between structured and unstructured branching in programming?', 'How does the 'goto' statement work in computer programming?' ]} ### Generated complementary Depth-2 questions {'complementary\_Depth-2\_questions': ['What are the potential risks involved with using unstructured branching in large software projects?', 'How does the structure of a program affect its maintainability?', 'How can the flow of execution in a program influence its debuggability?' ]} ## Depth-3 question {question} ## Answer to the Depth-3 question {answer} ## Current Depth-2 questions {'current\_Depth-2\_questions': {current\_questions} } ## Generated {count} complementary Depth-2 questions
Table 25: User prompt for augmenting D 2 questions.
```
```
Table 26: User prompt for augmenting D 1 questions.
## System prompt:
You are a helpful, respectful and honest assistant. Answer the question.
```
```
```
```
Table 27: Prompt for zero-shot inference.
```
```
Table 28: Prompt for inference given reference answers or self-predictions on shallower questions.
```
```
```
```
Table 29: Prompt for multi-turn inference.
## System prompt:
You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
## User prompt:
###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: 'Feedback: (write a feedback for criteria) [RESULT ](an integer number between 1 and 5)'
4. Please do not generate any other opening, closing, and explanations.
###The instruction to evaluate:
{instruction}
###Response to evaluate:
{response}
###Reference Answer (Score 5):
{reference\_answer}
## ###Score Rubrics:
[Is the response correct, accurate, and factual? ]
Score 1: The response is largely incorrect, inaccurate, and not factual. It demonstrates a fundamental misunderstanding of the query or topic, leading to irrelevant or completely erroneous information.
Score 2: The response is partially correct but contains significant inaccuracies or factual errors. It shows some understanding of the query or topic but fails to provide a fully accurate or reliable answer.
Score 3: The response is generally correct and factual but may include minor inaccuracies or lack of detail. It shows a good understanding of the query or topic but may miss some nuances or specific information.
Score 4: The response is mostly correct, accurate, and factual. It demonstrates a strong understanding of the query or topic, with only minimal inaccuracies or omissions that do not significantly detract from the overall quality of the response.
Score 5: The response is consistently correct, accurate, and entirely factual. It reflects a comprehensive understanding of the query or topic, providing detailed, precise, and fully reliable information without any inaccuracies or omissions.
###Feedback:
Table 30: Prompt for LLM-as-a-Judge evaluation with an accuracy score rubric.