# Knowledge Graph-extended Retrieval Augmented Generation for Question Answering
**Authors**: JasperLinders
> 1 1 1 Currently working at Chan Zuckerberg Initiative: [
[1] Jakub M. Tomczak
1] Department of Mathematics and Computer Science, Eindhoven University of Technology, De Zaale, Eindhoven, 5600 MB, the Netherlands
## Abstract
Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAGâs potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.
keywords: Knowledge Graphs, Large Language Models, Retrieval-Augmented Generation, Question Answering
## 1 Introduction
As our world becomes increasingly digital and information is more widely available than ever before, technologies that enable information retrieval and processing have become indispensable in both our personal and professional lives. The advent of Large Language Models (LLMs) has had a great impact, by changing the way many internet users interact with information, through models like ChatGPT https://chatgpt.com/. This has arguably played a large role in sparking an immense interest in solutions that build on artificial intelligence.
The rapid adoption of LLMs has transformed the fields of natural language processing (NLP) and information retrieval (IR). Understanding of natural language, with its long range dependencies and contextual meanings, as well as human-like text generation capabilities, allows these models to be applied to a wide variety of tasks. Additionally, LLMs have proven to be few-shot learners, meaning that they have the ability to perform unseen tasks with only a couple of examples [1]. Unfortunately, the benefits of LLMs come at the cost of characteristic downsides, which are important to consider.
LLMs can hallucinate [2], generating untruthful or incoherent outputs. They also miss knowledge not present during training, leading to knowledge cutoff, and cannot guarantee that certain training data is remembered [3]. Because of their massive size and data requirements, LLMs are expensive to train, deploy, and maintain [4]. Thus, smaller models or those needing only fine-tuning can be more practical for many use cases.
By contrast, Knowledge Graphs (KGs) store information explicitly as entities and relationships, allowing symbolic reasoning and accurate answers [5]. Even if a direct link between entities is missing, inferences can be drawn from their shared associations. KGs may also recall underrepresented knowledge better than LLMs [3]. However, they are costly to build, specialized to a domain, and typically require querying languages rather than natural language [6]. They also do not easily generalize to other domains [5].
Retrieval-Augmented Generation (RAG) [7] addresses LLMsâ lack of external knowledge by augmenting them with a text document database. Text documents are split into chunks, embedded, and stored in a vector database; the most similar chunks to an input query are retrieved and added to a prompt so the LLM can generate an answer based on this external information [8] (see Figure 1). However, relying on unstructured text can miss comprehensive entity data and even introduce distracting misinformation [8].
<details>
<summary>x1.png Details</summary>

### Visual Description
## System Architecture Diagram: Retrieval-Augmented Generation (RAG) Pipeline
### Overview
The image displays a technical flowchart illustrating a Retrieval-Augmented Generation (RAG) system architecture. It depicts the process flow from an input question to a generated answer, involving text chunk storage, embedding, selection, and a Large Language Model (LLM). The diagram uses color-coded shapes and directional arrows to represent components and data flow.
### Components
The diagram consists of seven primary components connected by directional arrows indicating data flow. All text is in English.
1. **Text chunks** (Top-Left): Represented by a light blue cylinder icon, symbolizing a database or vector store.
2. **Selector** (Top-Center): A yellow rounded rectangle.
3. **Input Question** (Bottom-Left): A white rectangle.
4. **Embedding Model** (Bottom-Left, right of Input Question): A pink rounded rectangle.
5. **Text Chunk N** (Center-Right): A stack of three white rectangles, with the front one labeled. The label "Text Chunk N" implies multiple chunks (1 through N) are involved.
6. **LLM** (Right): A blue rounded rectangle.
7. **Answer** (Far-Right): A white rectangle.
### Detailed Analysis
The process flow, as indicated by the arrows, is as follows:
1. **Data Storage & Retrieval Initiation**:
* The "Text chunks" database provides input to the "Selector" component.
* The "Input Question" is fed into the "Embedding Model".
2. **Question Processing & Chunk Selection**:
* The "Embedding Model" processes the "Input Question" and sends its output (an embedding vector) to the "Selector".
* The "Selector" uses this embedding to query the "Text chunks" database and retrieves relevant chunks. The output of the Selector is a set of relevant text chunks, represented by the stack labeled "Text Chunk N".
3. **Answer Generation**:
* The selected "Text Chunk N" is sent as context to the "LLM".
* Crucially, a separate arrow also sends the original "Input Question" directly to the "LLM". This indicates the LLM receives both the retrieved context and the original query.
* The "LLM" processes this combined input and produces the final "Answer".
### Key Observations
* **Dual Input to LLM**: The LLM receives two distinct inputs: the retrieved text chunks (context) and the original input question. This is a standard pattern in RAG systems to ensure the answer is grounded in the provided documents while addressing the specific query.
* **Component Roles**: The color coding suggests functional grouping: storage (blue cylinder), processing/selection (yellow, pink), data (white rectangles), and the core generative model (blue rectangle).
* **Abstraction**: The diagram is high-level. It abstracts away details like the specific embedding algorithm, the selection mechanism (e.g., similarity search), the LLM's architecture, and the format of the text chunks.
### Interpretation
This diagram visually explains the core mechanism of a Retrieval-Augmented Generation system. The data suggests a pipeline designed to overcome a key limitation of standalone LLMs: their static knowledge and tendency to "hallucinate."
* **How it works**: Instead of relying solely on its internal parameters, the system first *retrieves* relevant information ("Text chunks") from an external knowledge base based on the semantic meaning of the user's question (via the "Embedding Model" and "Selector"). It then *augments* the LLM's prompt with this retrieved context before *generating* an answer.
* **Purpose**: This architecture grounds the LLM's response in specific, verifiable source documents, improving factual accuracy, reducing hallucinations, and allowing the system to incorporate new information without retraining the core LLM.
* **Notable Design Choice**: The direct link from "Input Question" to "LLM" is critical. It ensures the model knows what to answer, while the link from "Text Chunk N" provides the factual basis for the answer. The "Selector" acts as the intelligent filter, determining which pieces of the vast "Text chunks" database are pertinent to the specific query.
**Language Declaration**: All text within the image is in English.
</details>
Figure 1: An example of a Retrieval-Augmented Generation (RAG) system, which combines information retrieval and text generation techniques. The red block indicates processing by a text embedding model, whereas the blue block depicts processing by an LLM. The yellow block shows a selector of nearest text chunks in the database.
To overcome these limitations, RAGs can utilize KGs. The resulting system integrates structured data from Knowledge Graphs in a RAG, enabling precise retrieval and complex reasoning. For example, KAPING [9] performs Knowledge Graph Question Answering (KGQA) without requiring any training. When training is needed for KG-enhanced LLMs, issues arise such as limited training data, domain specificity, and the need for frequent retraining as KGs evolve [10, 11]. In short, while RAG enhances LLMs by providing explainable, natural language outputs, incorporating structured Knowledge Graphs may offer improved reasoning and domain adaptability.
In this paper, we propose the Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG) system, which combines the reliability of Retrieval Augmented Generation (RAG) with the high precision of Knowledge Graphs (KGs) and operates without any training or fine-tuning. We focus on the task of Knowledge Graph Question Answering; although this focus is narrow, our findings may have broader implications. For instance, certain insights could be applied to the development of other systems that utilize KG-based information retrieval, such as chatbots. The primary objective of this work is to investigate how LLMs can be enhanced through the integration of KGs. Since the term âenhanceâ can encompass various improvements, we define it as follows. First, we aim to enable LLMs to be more readily applied across different domains requiring specialized or proprietary knowledge. Second, we seek to improve answer explainability, thereby assisting end users in validating LLM outputs. Eventually, we aim to answer the following research questions:
1. How can Large Language Models be enhanced with Knowledge Graphs without requiring any training?
1. How can answer explainability be improved with the use of Knowledge Graph-extended Retrieval Augmented Generation systems?
## 2 Related Work
Knowledge Graphs
Knowledge Graphs (KGs) are structured databases that model real-world entities and their relationships as graphs, which makes them highly amenable to machine processing. They enable efficient querying to retrieve all entities related to a given entity, a task that would be significantly more challenging with unstructured text databases. Complex queries are executed using specialized languages such as SPARQL [12]. As noted in recent research, âthe success of KGs can largely be attributed to their ability to provide factual information about entities with high accuracyâ [3]. Typically, the information in KGs is stored as triples, i.e. $(subject,relation,object)$ .
Large Language Models
Large Language Models (LLMs) learn natural language patterns from extensive text data, enabling various NLP tasks such as text generation and sentiment classification. Their emergence was enabled by the Transformer architecture, introduced in Attention Is All You Need [13], which efficiently models sequential data via attention mechanisms. Scaling these modelsâby increasing compute, dataset size, and parameter countâyields performance improvements following a power law [14], with LLMs typically comprising hundreds of millions to hundreds of billions of parameters.
LLMs generate text in an autoregressive manner. Given a sequence $x_{1:t}$ , the model produces a probability distribution $p(x_{t+1}|x_{1:t})=\mathrm{softmax}(z/T)$ over its vocabulary, where $z$ are the raw logits and $T$ is a temperature parameter that controls randomness. Instead of selecting tokens via simple $\mathrm{argmax}$ , more sophisticated sampling methods are employed (see Section 3.5) to generate coherent and diverse output consistent with the input context [15].
In-Context Learning & Chain-of-Thought
In-Context Learning (ICL) improves LLM performance by providing few-shot examples instead of zero-shot queries. This method boosts task performance through prompt engineering without altering model parameters [16]. It is often combined with Chain-of-Thought (CoT) that can significantly enhance performance without modifying the modelâs parameters or incurring the high cost of fine-tuning [17]. A CoT prompt instructs the model to generate intermediate reasoning steps that culminate in the final answer, rather than directly mapping a query to an answer [17]. This approach naturally decomposes complex queries into simpler steps, yielding more interpretable results.
Knowledge Graph Question Answering
Knowledge Graph Question Answering (KGQA) is the task of answering questions using a specific knowledge graph (KG). Benchmarks such as Mintaka [18], WebQuestionsSP [19], and MetaQA [20] provide datasets where each row includes a question, its associated entity/entities, and the answer entity/entities, along with the corresponding KG (provided as a file of triples or accessible via an API). In these benchmarks, the question entity is pre-identified (avoiding the need for entity matching or linking), and performance is evaluated using the binary Hit@1 metric.
KGQA systems are typically classified into three categories [9]:
- Neural Semantic Parsing-Based Methods: These map a question to a KG query (e.g., in SPARQL), reducing the search space between question and answer entities. Although effective [19], they require labor-intensive semantic parse labels.
- Differentiable KG-Based Methods: These employ differentiable representations of the KG (using sparse matrices for subjects, objects, and relations) to perform query execution in the embedding space. They enable end-to-end training on question-answer pairs [21, 22], but necessitate ample training data and may not generalize across different KGs.
- Information Retrieval-Based Methods: These combine KGs with LLMs by retrieving relevant factsâwhich are then injected into the promptâto generate answers [9]. Although they leverage off-the-shelf components, they often require fine-tuning on KG-specific datasets [11].
Knowledge Graph-extended Retrieval Augmented Generation
Information retrieval-based KGQA (IR-KGQA) systems differ from neural semantic parsing and differentiable KG methods by delegating part of the reasoning over triples to the LLM. The process is split into retrieving candidate triples and then having the LLM reason over them to formulate an answer, whereas the other methods map directly from the question to the answer entities [21, 23].
KG-RAG is defined as an IR-KGQA system that employs a similarity-based retrieval mechanism using off-the-shelf text embedding models, akin to the original RAG system [7]. In KG-RAG (exemplified by the KAPING system [9]), candidate triples are retrieved up to $N$ hops from the question entity/entities, verbalized, and embedded alongside the question. Their similarity is computed via dot or cosine product, and the Top- $K$ similar triples are passed to an answer generation LLM, which then outputs the answer.
## 3 Methodology
### 3.1 Problem Statement
Let $G$ be a knowledge graph, defined as a set of triples of the form $(s,r,o)$ where:
- Each triple $(s,r,o)\in G\subseteq\mathcal{E}\times\mathcal{R}\times\mathcal{E}$ represents a fact;
- $s,o\in\mathcal{E}$ are entities from the set of all entities $\mathcal{E}$ ;
- $r\in\mathcal{R}$ is a relation from the set of all relations $\mathcal{R}$ .
We assume that the following objects are given:
- A question $q$ that can be answered using facts from $G$
- The question entity/entities part of that question $e_{q}\in\mathcal{E}$
Moreover, let us introduce the following variables:
- $a$ denotes a natural language answer that can be derived from the facts in $G$ ;
- $c$ is a reasoning chain in natural language, explaining the logical steps from $q$ and $e_{q}$ to $a$
Our objective is to develop a function $f$ that maps given object to both an answer and the reasoning chain, namely:
$$
f:q\times e_{q}\times G\rightarrow(a,c)
$$
where:
- $a$ is a natural language answer that can be derived from the facts in $G$
- $c$ is a reasoning chain in natural language, explaining the logical steps from $q$ and $e_{q}$ to $a$
Additionally, we aim for the following:
- Answer Accuracy: The function $f$ should have high answer accuracy, as evaluated by the Hit@1 metric.
- Answer Explainability: For each answer $a$ generated by the function $f$ , the reasoning chain $c$ must provide a clear logical explanation of how the answer was derived, so that it is more easily verifiable by the user.
- Application Generalizability: The function $f$ must operate without training or finetuning on specific Knowledge Graphs, using only In-Context Learning examples. The Knowledge Graphs must include sufficient amounts of natural language information, as the system relies on natural language-based methods.
The degree to which the function $f$ achieves the objectives is evaluated using both quantitative and qualitative methods, based on experiments with a KGQA benchmark, namely:
- Quantitative evaluation of answer accuracy, based on the Hit@1 metric.
- Qualitative analysis of reasoning chain clarity and logical soundness, as judged by a human evaluator on a sample of results.
### 3.2 State-of-the-Art
Recent advances in question answering have seen the development of several state-of-the-art methods that leverage a diverse array of Large Language Models alongside innovative baseline strategies. For instance, one method employs multiple scales of models such as T5, T0, OPT, and GPT-3, while experimenting with baselines ranging from no knowledge to generated knowledge on datasets like WebQSP [19] and Mintaka [18]. Another approach expands this exploration by integrating Llama-2, Flan-T5, and ChatGPT, and introducing baselines that utilize triple-form knowledge and alternative KG-to-Text techniques, evaluated on datasets that include WebQSP, MetaQA [20], and even a Chinese benchmark, ZJQA [11]. Additionally, methods centered on ChatGPT are further compared with systems like StructGPT and KB-BINDER across varying complexities of MetaQA and WebQSP. The overview of the SOTA methods is presented in Table 1.
Table 1: Comparison of the question answering LLMs, baselines and benchmark datasets that were used for the different models. The full set of QA LMs is as follows: T0 [24], T5 [25], Flan-T5 [26], OPT [27], GPT-3 [1], ChatGPT, AlexaTM [28], and Llama-2 [29]. The full set of datasets is as follows: WebQuestions [30], WebQSP [19], ComplexWebQuestions [31], MetaQA [20], Mintaka [18], LC-QuAD [32], and ZJQA [11].
| KAPING [9] Retrieve-Rewrite-Answer [11] Keqing [10] | T5 (0.8B, 3B, 11B) T0 (3B, 11B) OPT (2.7B, 6.7B) GPT-3 (6.7B, 175B) Llama-2 (7B, 13B) T5 (0.8B, 3B, 11B) Flan-T5 (80M, 3B, 11B) T0 (3B, 11B) ChatGPT ChatGPT | No knowledge Random knowledge Popular knowledge Generated knowledge No knowledge Triple-form knowledge 2x Alternative KG-to-Text 2x Rival model ChatGPT StructGPT KB-BINDER | WebQSP (w/ 2 KGs) Mintaka WebQSP WebQ MetaQA ZJQA (Chinese) WebQSP MetaQA-1hop MetaQA-2hop MetaQA-3hop |
| --- | --- | --- | --- |
#### 3.2.1 KAPING
KAPING [9] is one of the best IR-KGQA models that requires no training. For example, due to the large number of candidate triplesâ27 $\$ of entities in WebQSP [19] have more than 1000 triplesâa text embedding-based selection mechanism is employed, typically using cosine similarity [33], instead of appending all triples directly to the prompt. KAPING outperforms many baselines presented in Table 1 in terms of Hit@1, especially those with smaller LLMs, suggesting that external knowledge compensates for the limited parameter space. Notably, using 2-hop triples degrades performance, so only 1-hop triples are selected; when retrieval fails to fetch relevant triples, performance drops below a no-knowledge baseline. An additional finding is that triple-form text outperforms free-form text for retrieval, as converting triples to free-form via a KG-to-Text model often leads to semantic incoherence, and using free-form text in prompts does not improve answer generation.
#### 3.2.2 Retrieve-Rewrite-Answer
Motivated by KAPINGâs limitations, the Retrieve-Rewrite-Answer (RRA) architecture was developed for KGQA [11]. Unlike KAPING, which overlooked the impact of triple formatting, RRA introduces a novel triple verbalization module, among other changes. Specifically, question entities are extracted from annotated datasets (with entity matching deferred). The retrieval process consists of three steps: (i) a hop number is predicted via a classification task on the question embedding; (ii) relation pathsâsequences of KG relationshipsâare predicted by sampling and selecting the top- $K$ candidates based on total probability; (iii) selected relation paths are transformed into free-form text using a fine-tuned LLM. This verbalized output, together with the question, is fed to a QA LLM via a prompt template.
For training, the hop number and relation path classifiers, as well as the KG-to-Text LLM, are tuned on each benchmark. Due to the lack of relation path labels and subgraph-text pairs in most benchmarks, the authors employ various data construction techniques, limiting the modelâs generalizability across domains and KGs.
As detailed in Table 1, evaluations were carried out using QA LLM, baselines (no knowledge, triple-form knowledge and two standard KG-to-Text models), and benchmark datasets, compared with models from [9] and [22] on WebQ [30] and WebQSP [19] using the Hit@1 metric. The main results show that RRA significantly outperforms rival models, achieving an improvement of 1â8% over triple-form text and 1â5% over the best standard KG-to-Text model. Moreover, RRA is about 100 $\times$ more likely to produce a correct answer when the no-knowledge baseline fails, confirming the added value of IR-based KGQA models over vanilla LLMs.
#### 3.2.3 Keqing
Keqing, proposed in [10], is the third SOTA model that is positioned as an alternative to SQL-based retrieval systems. Its key innovation is a question decomposition module that uses a fine-tuned LLM to break a question into sub-questions. These subquestions are matched to predefined templates via cosine similarity, with each template linked to specific KG relation paths. Candidate triples are retrieved based on these relation paths, and sub-questions are answered sequentiallyâthe answer to one sub-question seeds the next. The triples obtained are verbalized and processed through a prompt template by a Quality Assurance LLM, ultimately generating a final answer that reflects the modelâs reasoning chain.
In this approach, only the question decomposition LLM is trained using LoRA [34], which adds only a small fraction of trainable weights. However, the construction of sub-question templates and the acquisition of relation path labels are not clearly detailed, which may limit the systemâs scalability.
According to Table 1, Keqing outperforms vanilla ChatGPT and two rival models, achieving Hit@1 scores of 98.4% to 99.9% on the MetaQA benchmark and superior performance on the WebQSP benchmark. Its ability to clearly explain its reasoning through sub-question chains further underscores its contribution to answer explainability.
#### 3.2.4 Research Gap
After KAPING was introduced as the first KG-Augmented LLM for KGQA, RRA [11] and Keqing [10] followed, each employing different triple retrieval methods. Although all three use an LLM for question answering, KAPING relies on an untrained similarity-based retriever, while RRA and Keqing develop trainable retrieval modules, improving performance at the cost of significant engineering. Specifically, RRA trains separate modules (hop number classifier, relation path classifier, and KG-to-Text LLM) for each benchmark, requiring two custom training datasets (one for questions with relation path labels and one for triples with free-form text labels). The need for KG-specific techniques limits generalizability and raises concerns about the extra labor required when no Q&A dataset is available. Keqing fine-tunes an LLM for question decomposition to enhance answer interpretability and triple retrieval. This approach also demands a training dataset with sub-question templates and relation path labels, though the methods for constructing these remain unclear. Consequently, it is debatable whether the performance gains justify the additional engineering effort.
In summary, these shortcomings reveal a gap for models that are both as generalizable as KAPING and as explainable as Keqing. KAPINGâs training-free design allows minimal human intervention across diverse KGs and domains, even in the absence of benchmark datasets. For this reason, we propose an improvement to the KAPING model by introducing a question decomposition module.
### 3.3 Our Approach
KAPING, a SOTA method combinining KGs and LLMs, outperforms many zero-shot baselines. However, its retrieval process, a vital process for accurate answer generation, can benefit from reducing irrelevant triple inclusion [9]. Therefore, we build on top of the KAPING model and propose to enhance it by integrating a question decomposition module to improve triple retrieval, answer accuracy, and explainability while maintaining application generalizability.
The proposed question decomposition module decomposes complex, multi-hop questions into simpler sub-questions. This allows the similarity-based retriever to focus on smaller, manageable pieces of information, thereby improving retrieval precision and yielding a more interpretable reasoning chain. Unlike conventional Chain-of-Thought prompting, which may induce hallucinated reasoning [35], decomposing the question forces the LLM to independently resolve each sub-question, ensuring fidelity to the stated reasoning. Our question decomposition module uses manually curated in-context learning examples for the KGQA benchmark, obviating the need for additional training and minimizing human labor. As a result, our approach aligns well with the goals of enhanced generalizability and answer explainability while potentially outperforming KAPING for multi-hop questions. The following section details the overall system architecture and the roles of its individual components.
### 3.4 System Architecture
Our system comprises multiple components, each executing a specific role in answering KG-based questions. The overall process involves four primary steps, with the first two being non-sequential:
1. Question Decomposition: The decomposition module splits the question into sub-questions. For simple queries, it avoids unnecessary decomposition.
1. Candidate Triple Retrieval: Given the question entity, the system retrieves all triples up to $N$ hops from the KG. Each triple is verbalized into text for subsequent selection via a sentence embedding model.
1. Sub-Question Answering: This sequential step answers each sub-question using the candidate triples. The process involves embedding the candidate triples to form a vector database, selecting the Top- $K$ similar triples for the sub-question, and reformulating subsequent sub-questions based on prior sub-answers.
1. Answer Synthesis: Finally, the system synthesizes the final answer from the sub-questions and their corresponding answers. The output also includes the chain-of-thought from the decomposition stage, enhancing interpretability.
<details>
<summary>x2.png Details</summary>

### Visual Description
## System Architecture Diagram: Multi-Stage Question Answering Pipeline
### Overview
The image displays a technical flowchart illustrating a multi-stage pipeline for answering complex questions. The system decomposes an input question, retrieves relevant knowledge triples, processes sub-questions, and synthesizes a final answer. The diagram is organized into four major functional blocks, indicated by dotted-line boundaries, with data flowing primarily from top to bottom.
### Components/Axes
The diagram is structured into four main regions, labeled vertically on the left and right sides:
1. **Candidate Triple Retrieval** (Top-Left)
2. **Question Decomposition** (Top-Center)
3. **Sub-Question Answering** (Bottom-Left/Center)
4. **Answer Synthesis** (Bottom-Right)
**Module Types (Color-Coded):**
* **Blue Rounded Rectangles:** Core processing modules (e.g., Decomposition Module, Question Reformulator, Answer Generator, Final Answer Generator).
* **Light Green Rounded Rectangle:** Candidate Triple Retriever.
* **Pink Rounded Rectangle:** Embedding Model.
* **Light Yellow Rounded Rectangle:** Top-K Selector.
* **Grey Rectangles:** Data states or intermediate outputs (e.g., Entity, Candidate Triples, Sub-Question, Sub-Answer).
* **White Rectangle with Folded Corner:** Represents a data set (Top-K Triples).
### Detailed Analysis
**1. Candidate Triple Retrieval (Top-Left Block)**
* **Input:** "Input Question" (top-most box).
* **Flow:** Input Question -> "Entity" -> "Candidate Triple Retriever" (green) -> "Candidate Triples".
* **Output:** "Candidate Triples" are passed to the "Embedding Model" in the next stage.
**2. Question Decomposition (Top-Center Block)**
* **Input:** "Input Question" is also fed directly into this block.
* **Flow:** Input Question -> "Decomposition Module" (blue) -> "CoT" (Chain-of-Thought) -> Splits into two parallel paths: "Sub-Question 1" and "Sub-Question 2".
* **Output:** The two sub-questions are sent to the "Sub-Question Answering" block.
**3. Sub-Question Answering (Large Bottom-Left/Center Block)**
This is the most complex stage, processing each sub-question iteratively.
* **Inputs:** "Candidate Triples" (from Retrieval) and "Sub-Question 1/2" (from Decomposition).
* **Core Processing Loop:**
* A "Sub-Question" (e.g., Sub-Question 1) is sent to the "Question Reformulator" (blue).
* The "Question Reformulator" outputs a "Reformulated Sub-Question".
* This reformulated question and the "Candidate Triples" are fed into the "Embedding Model" (pink).
* The "Embedding Model" output goes to the "Top-K Selector" (yellow).
* The "Top-K Selector" produces "Top-K Triples (Sub-Question 1)" and "Top-K Triples (Sub-Question 2)" (data set icons).
* These Top-K Triples are sent to the "Answer Generator" (blue).
* **Outputs:** The "Answer Generator" produces "Sub-Answer 1" and "Sub-Answer 2". These sub-answers are fed back into the "Question Reformulator", creating a potential iterative refinement loop. They are also sent forward to the final stage.
**4. Answer Synthesis (Bottom-Right Block)**
* **Inputs:** "Sub-Answer 1" and "Sub-Answer 2" (from the previous stage), and the original "Input Question" (via a long arrow from the top).
* **Flow:** All inputs converge at the "Final Answer Generator" (blue).
* **Output:** The "Final Answer Generator" produces the "Final answer" (bottom-right box).
### Key Observations
* **Parallel Processing:** The system decomposes the main question into at least two sub-questions ("Sub-Question 1" and "Sub-Question 2") which are processed in parallel streams within the "Sub-Question Answering" block.
* **Iterative Refinement:** There is a feedback loop where generated "Sub-Answers" are sent back to the "Question Reformulator," suggesting the system can iteratively improve its understanding or retrieval for each sub-question.
* **Central Role of Embeddings:** The "Embedding Model" is a critical junction, receiving both retrieved triples and reformulated questions to enable semantic matching for the "Top-K Selector."
* **Dual Input to Final Synthesis:** The "Final Answer Generator" receives both the decomposed sub-answers *and* the original input question, allowing it to ground the final synthesis in the user's initial query.
### Interpretation
This diagram outlines a sophisticated Retrieval-Augmented Generation (RAG) system designed for complex, multi-faceted questions. Its core innovation lies in the explicit **decomposition** of a hard question into simpler sub-questions, followed by a dedicated **retrieval and answering** phase for each sub-component. The **reformulation** step is key, as it likely optimizes sub-questions for better retrieval performance. The system doesn't just retrieve once; it uses a selector ("Top-K") to pick the most relevant knowledge triples for each sub-question before answer generation. Finally, the **synthesis** stage combines partial answers while referencing the original question, ensuring coherence and relevance. This architecture would be particularly effective for questions requiring multi-hop reasoning or information aggregation from disparate sources. The presence of a "CoT" (Chain-of-Thought) step in decomposition suggests the system may use reasoning to break down the problem.
</details>
Figure 2: The architecture of the proposed system. An example of a 2-hop question is included, to give an idea of the data structures that are involved in the end-to-end process. The green color indicates processing with the KG; the red block shows the embedding model and the blue modules utilize an LLM.
Figure 2 illustrates the system architecture, highlighting the data structures and interactions between components. The diagram shows how the question reformulation module, which processes all previous sub-answers, enables the sequential resolution of sub-questions until the final answer is generated by the answer synthesis module.
Different components utilize distinct data sources and models. The candidate triple retriever directly accesses the KG, while the similarity-based triple selection leverages an off-the-shelf sentence embedding model trained on question-answer pairs. The remaining modulesâthe decomposition module, sub-answer generator, question reformulator, and final answer generatorâare implemented using a LLM.
### 3.5 System Components
#### 3.5.1 Question Decomposition
Overview
The question decomposition module splits a complex question into simpler sub-questions while generating an explicit reasoning chain, thereby enhancing both triple retrieval and answer explainability (Section 3.3). Inspired by Chain-of-Thought and In-Context Learning techniques [35], the module uses manually constructed ICL examples from the benchmark (Section 4.1). The prompt is designed to first elicit the reasoning chain (CoT) followed by the sub-questions, aligning with the natural text-based reasoning of LLMs.
Inputs and Outputs
As illustrated in Figure 2, the module takes a natural language question as input and outputs a string containing the reasoning chain and sub-questions. This output is post-processed to extract the CoT and store the sub-questions in a list.
Techniques
The decomposition prompt instructs the LLM to decide if a question requires decomposition. If so, it generates a CoT followed by sub-questions, strictly adhering to a specified format and avoiding irrelevant content. In-context examplesâcovering three question types from the MetaQA benchmarkâguide the LLM, with the stop token â $<$ END $>$ â marking completion.
Implementation Details
Here, we use a 4-bit quantized version of Mistral-7B-Instruct-v0.2 [29, 36], originally a 7.24B-parameter model that outperforms Llama 2 and Llama 1 in reasoning, mathematics, and code generation. The quantized model, sized at 4.37 GB https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF, is compatible with consumer-grade hardware (e.g., NVIDIA RTX 3060 12GB https://www.msi.com/Graphics-Card/GeForce-RTX-3060-VENTUS-2X-12G-OC). Fast inference is achieved using the llama.cpp package https://github.com/ggerganov/llama.cpp, and prompts are designed with LM Studio https://lmstudio.ai/.
Inference parameters (see Table 2) include a max tokens limit (256) to prevent runaway generation, a temperature of 0.3 to reduce randomness, and top-k (40) and min-p (0.05) settings to ensure controlled token sampling [37].
Table 2: The inference parameters that were used for the question decomposition LLM.
| Max Tokens Temperature Min-p | 256 0.3 0.05 |
| --- | --- |
| Top-k | 40 |
#### 3.5.2 Candidate Triple Retrieval
Overview
Candidate triple retrieval collects all triples up to $N$ hops from a given question entity in the KG, converting each triple into a text string of the form $(subject,relation,object)$ . Although the worst-case complexity is exponential in the number of hopsâapproximately $\Theta(d^{N})$ for an undirected KG with average degree $d$ âreal-world KGs are sparse, making the average or median complexity more relevant (Section 4.1). The value of $N$ is treated as a hyperparameter.
Inputs and Outputs
This component accepts the question entity/entities as a natural language string and retrieves candidate triples from the KG. The output is a list of lists, where each sub-list corresponds to the candidate triples for each hop up to $N$ . Each triple is stored as a formatted text string, with underscores replaced by spaces (e.g., âacted_inâ becomes âacted inâ).
Techniques
Candidate triple retrieval employs a breadth-first search strategy. In the MetaQA benchmark, which uses a directed KG, retrieval can be unidirectional (considering only outgoing edges) or bidirectional (including both outgoing and incoming edges). For example, as illustrated in Figure 3, unidirectional retrieval from the Inception entity would only yield entities like 2010, Christopher Nolan, and Tom Hardy, whereas bidirectional retrieval expands the search across successive hops. This example underscores the impact of retrieval direction on both the candidate set and computational load.
<details>
<summary>extracted/6354852/Figs/KG_Example.png Details</summary>

### Visual Description
## Knowledge Graph Diagram: Movie Relationships
### Overview
The image displays a directed knowledge graph (or semantic network) illustrating relationships between movies, people, genres, and a release year. The graph consists of oval-shaped nodes containing text, connected by labeled, directed edges (arrows). The background is a uniform light gray.
### Components/Axes
**Nodes (Entities):**
1. **The Town** (Dark gray oval, top-left)
2. **2010** (Light gray oval, top-right)
3. **Inception** (White oval with black border, center-left)
4. **Christopher Nolan** (Light gray oval, center-right)
5. **Tom Hardy** (Light gray oval, lower-right)
6. **Warrior** (Dark gray oval, bottom-left)
7. **Drama** (Very dark gray/black oval, bottom-right)
**Edges (Relationships):**
- `release_year` (from "The Town" to "2010")
- `release_year` (from "Inception" to "2010")
- `directed_by` (from "Inception" to "Christopher Nolan")
- `starred_actors` (from "Inception" to "Tom Hardy")
- `starred_actors` (from "Warrior" to "Tom Hardy")
- `has_genre` (from "Warrior" to "Drama")
### Detailed Analysis
The graph defines the following factual relationships:
1. **Movie "The Town":**
- Has a `release_year` relationship pointing to the node "2010".
- **Interpretation:** The movie "The Town" was released in 2010.
2. **Movie "Inception":**
- Has a `release_year` relationship pointing to the node "2010".
- Has a `directed_by` relationship pointing to the node "Christopher Nolan".
- Has a `starred_actors` relationship pointing to the node "Tom Hardy".
- **Interpretation:** The movie "Inception" was released in 2010, was directed by Christopher Nolan, and starred actor Tom Hardy.
3. **Movie "Warrior":**
- Has a `starred_actors` relationship pointing to the node "Tom Hardy".
- Has a `has_genre` relationship pointing to the node "Drama".
- **Interpretation:** The movie "Warrior" starred actor Tom Hardy and belongs to the Drama genre.
### Key Observations
- **Shared Entity:** The node "Tom Hardy" is a shared entity, connected to two different movies ("Inception" and "Warrior") via the same relationship type (`starred_actors`).
- **Shared Attribute:** The node "2010" is a shared attribute, connected to two different movies ("The Town" and "Inception") via the same relationship type (`release_year`).
- **Node Styling:** Nodes are styled with different shades of gray. "Inception" is distinct with a white fill and black border, while "Drama" has the darkest fill. This may indicate different entity types (e.g., movies vs. people vs. attributes) or simply be a visual design choice.
- **Graph Structure:** The graph is not fully connected. "The Town" and "Warrior" are only connected to the rest of the graph through shared attribute nodes ("2010") or shared entity nodes ("Tom Hardy"), not directly to each other.
### Interpretation
This diagram is a visual representation of a **knowledge base** or **graph database** schema populated with specific instances related to films. It demonstrates how structured data can model complex, real-world relationships.
- **What it represents:** It models a small subset of a filmography database, capturing entities (movies, people, genres, years) and the predicates that link them. This structure is fundamental for semantic search, recommendation engines, and data integration.
- **Relationships:** The graph explicitly shows that relationships are directional (e.g., "Inception" `directed_by` "Christopher Nolan" is not the same as "Christopher Nolan" `directed` "Inception," though the inverse relationship is implied).
- **Data Utility:** Such a graph allows for complex queries. For example, one could infer: "Find all Drama movies released in 2010 that star Tom Hardy." While "Warrior" is a Drama and stars Tom Hardy, its release year is not shown. "Inception" stars Tom Hardy and was released in 2010, but its genre is not shown. The graph, as presented, does not contain enough information to answer that specific query, highlighting the importance of complete data in knowledge graphs.
- **Underlying System:** The use of underscored relationship labels (`release_year`, `starred_actors`) suggests this may be a visualization from a system using a formal ontology or a specific graph database query language.
</details>
Figure 3: A simple subgraph of triples from MetaQA [20]. As indicated by the arrows, this KG is a directed graph, which has implications for candidate triple retrieval. If Inception were the entity we were retrieving for, each darker tint of gray shows the entities that would be reached for a hop deeper.
Implementation Details
The MetaQA benchmark provides the KG as a text file with one triple per row. This file is pre-processed into a compressed KG with indexed entities and relationships to streamline retrieval and minimize memory usage. Each triple is embedded using a sentence embedding model (introduced in Section 3.5.3), forming a dictionary of embeddings that enhances retrieval efficiency by avoiding redundant computations. Retrieval is performed bidirectionally up to 3 hops, i.e., $N\in\{1,2,3\}$ .
#### 3.5.3 Sub-Question Answering
Overview
Once the question is decomposed into sub-questions and candidate triples are retrieved for the given entity/entities, the sub-question answering process begins. Iteratively, the sub-question and candidate triples are embedded using a sentence embedding model, and the top- $K$ similar triples are selected to generate a sub-answer via an LLM. This sub-answer is then used to reformulate subsequent sub-questions if needed (see Figure 2), continuing until all sub-questions are answered.
Inputs and Outputs
Inputs include candidate triples (a list of strings, pre-embedded from the MetaQA KG) and a list of sub-questions. The output comprises two lists of strings: one containing the sub-answers and another with the reformulated sub-questions, both of which contribute to the final answer synthesis.
Techniques
The process employs similarity-based retrieval where both the sub-question and candidate triples are embedded with the same model, and their dot-product similarity is computed. The top- $K$ triples are then passed to a zero-shot LLM answer generator along with the sub-question. Unlike Keqingâs multiple-choice approach [10] (Section 3.2.3), this method allows the LLM to reason over the context. A zero-shot LLM also performs question reformulation.
Implementation Details
The similarity-based triple selection uses the multi-qa-mpnet-base-dot-v1 https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1 model from the sentence_transformers https://www.sbert.net/ package, which embeds text into 768-dimensional vectors. Similarity is computed as the dot product between these vectors, and the model is run locally on the GPU. Both the sub-question answering and question reformulation LLMs use parameters from Table 2 with minor adjustments: the sub-question answering LLM employs a repeat_penalty of 1.1 to mitigate repetitive output, while the reformulation module uses â?â as the stop token to restrict its output to a properly reformulated question.
#### 3.5.4 Answer Synthesis
Overview
The final step synthesizes an answer to the original question using the generated reasoning chain, sub-questions, and sub-answers. This output, which includes the reasoning chain, provides transparency into the systemâs decision-making process.
Inputs and Outputs
Inputs comprise the main question, reasoning chain, sub-questions (reformulated if applicable), and sub-answersâall as strings. The output is a single natural language string that integrates both the final answer and the reasoning chain.
Techniques
A custom zero-shot prompt instructs the LLM to formulate the final answer from the provided context. The prompt template merges the main question, sub-questions, and sub-answers, and subsequently incorporates the reasoning chain into the final output. This straightforward zero-shot approach was preferred over ICL due to the simplicity of the final synthesis task compared to the more complex decomposition step.
Implementation Details
The LLM parameters mirror those in Table 2, with the exception of max_tokens, which is increased to 512 to accommodate the typically more complex final answers.
## 4 Experiments
The goal of our experiments is check whether the usefulness of a KG in question answering and whether our approach, i.e., using an additional question decomposition module, results in a better performance. For this purpose, we use a widely-used Knowledge Graph Question Answering (KGQA) benchmark called MetaQA [20]. In order to verify whether we achieved our objectives, we assess three baselines: a stand-alone LLM, an LLM with an LLM-based question-answering module, and an LLM with a KG (i.e., KAPING). Eventually, the experimental results are presented and discussed.
### 4.1 Dataset
The MetaQA benchmark, introduced in 2017, addresses the need for KGQA benchmarks featuring multi-hop questions over large-scale KGs, extending the original WikiMovies benchmark with movie-domain questions of varying hop counts [20].
Several factors motivated the selection of MetaQA for this research. First, its questions are categorized by hop count, enabling detailed analysis of multi-hop performance, a key area for improvement via question decomposition. Second, each question includes an entity label, avoiding the complexities of entity linking; many benchmarks, which focus on neural semantic parsing for SPARQL query generation, lack such labels [38]. Third, MetaQAâs simplicity and locally processable KG make it ideal for studies with limited resources, in contrast to highly complex KGs like Wikidata (over 130 GB, 1.57 billion triples, 12,175 relation types https://www.wikidata.org/wiki/Wikidata:Main_Page).
Data
MetaQA consists of three datasets (1-hop, 2-hop, and 3-hop), each split into train, validation, and test sets, and further divided into three components: vanilla, NTM text data, and audio data [20]. This research utilizes only the vanilla data, where the 1-hop dataset contains original WikiMovies questions and the 2-hop and 3-hop datasets are generated using predefined templates. Each dataset row includes a question, its associated entity, and answer entities.
Knowledge Graph
The MetaQA benchmark provides a KG as a text file with each row representing a triple. The KG comprises 43,234 entities and 9 relation types, with movie titles as subjects. Figure 4 illustrates the degree distribution: most entities have few associated triples (median of 4), while the long-tailed distribution includes entities with up to 4431 triples.
<details>
<summary>extracted/6354852/Figs/MetaQA_KG.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Triples per Entity in MetaQA
### Overview
This image is a bar chart (histogram) titled "Distribution of Triples per Entity in MetaQA." It visualizes the frequency distribution of how many knowledge graph triples are associated with each entity within the MetaQA dataset. The chart shows a classic right-skewed (long-tail) distribution, where a vast majority of entities are associated with a small number of triples, and a progressively smaller number of entities are associated with a larger number of triples.
### Components/Axes
* **Title:** "Distribution of Triples per Entity in MetaQA" (centered at the top).
* **X-Axis:** Labeled "Number of triples per entity." It is a linear scale with major tick marks at 0, 5, 10, 15, 20, 25, and 30. The axis represents discrete counts, with bars centered on integer values from 1 to 30.
* **Y-Axis:** Labeled "Frequency." It is a linear scale with major tick marks at 0, 2000, 4000, 6000, 8000, 10000, and 12000. This represents the count of entities.
* **Data Series:** A single series represented by blue vertical bars. There is no legend, as the chart displays one dataset.
* **Spatial Layout:** The chart area is bounded by a black frame. The title is positioned above the frame. The axis labels are centered below the x-axis and to the left of the y-axis (rotated 90 degrees).
### Detailed Analysis
The chart displays the frequency (y-axis) for each discrete number of triples per entity (x-axis). Below are the approximate values extracted from the bar heights. **Note:** Values are approximate, read from the visual scale.
* **1 triple:** ~11,800 entities (The tallest bar, dominating the chart).
* **2 triples:** ~5,100 entities.
* **3 triples:** ~2,500 entities.
* **4 triples:** ~3,000 entities (A slight increase from 3 triples).
* **5 triples:** ~2,700 entities.
* **6 triples:** ~2,800 entities.
* **7 triples:** ~2,500 entities.
* **8 triples:** ~2,300 entities.
* **9 triples:** ~1,800 entities.
* **10 triples:** ~1,400 entities.
* **11 triples:** ~1,100 entities.
* **12 triples:** ~900 entities.
* **13 triples:** ~700 entities.
* **14 triples:** ~600 entities.
* **15 triples:** ~500 entities.
* **16 triples:** ~400 entities.
* **17 triples:** ~350 entities.
* **18 triples:** ~300 entities.
* **19 triples:** ~250 entities.
* **20 triples:** ~200 entities.
* **21-30 triples:** The frequencies continue to decline steadily, with each subsequent bar being slightly shorter than the last. By 30 triples, the frequency is very low, appearing to be less than 100 entities.
**Trend Verification:** The visual trend is a steep, exponential-like decay from 1 to 3 triples, followed by a more gradual, roughly linear decline from 4 triples onward. There is a minor local peak at 4 triples.
### Key Observations
1. **Extreme Right Skew:** The distribution is heavily skewed to the right. The single category of entities with only 1 triple accounts for the largest proportion of all entities.
2. **Dominance of Low-Connectivity Entities:** The vast majority of entities in the MetaQA dataset have a low number of associated triples (fewer than 10).
3. **Long Tail:** A significant "long tail" exists, showing that while rare, some entities are highly connected, with up to 30 or more triples.
4. **Minor Anomaly at 4 Triples:** There is a small but noticeable increase in frequency at 4 triples compared to 3 triples, breaking the otherwise smooth decline. This could be a dataset-specific characteristic.
### Interpretation
This distribution is characteristic of many real-world networks and knowledge graphs, often following a power-law or scale-free pattern. It suggests that the MetaQA knowledge graph is structured with a core of highly connected "hub" entities (those in the long tail) and a periphery of many sparsely connected entities.
* **Data Implication:** The high frequency of entities with only 1 triple indicates that many concepts in the dataset are only mentioned in a single relational context. This could pose challenges for machine learning models that rely on multi-hop reasoning or require rich contextual information about an entity.
* **Structural Insight:** The presence of entities with 20-30 triples suggests the existence of central, well-defined concepts (e.g., major characters, key locations, or core events in a narrative domain) around which many facts are organized.
* **Anomaly Consideration:** The slight bump at 4 triples might indicate a common pattern or a specific subset of entities that naturally participate in four types of relationships within the dataset's domain. Further investigation into the dataset's schema would be needed to confirm this.
In summary, the chart reveals a knowledge graph where connectivity is highly unequal, dominated by many weakly connected entities and a few strongly connected ones, which is a fundamental property to consider when using MetaQA for tasks like question answering or link prediction.
</details>
Figure 4: The distribution of degrees (triples per entity) in the MetaQA KG. (Note that the distribution is long-tailed, so the cut-off at the value of 30 is for the purpose of visualization.)
### 4.2 Experimental design
In this study, we carry out two experiments:
1. The goal of experiment 1 is to find out how the model parameters impact performance, in order to find a parameter configuration that leads to consistent performance over the different question types. The chosen parameter configuration can then be used to compare the system to baselines in the second experiment.
1. The main goal of the second experiment is to find out how different components of the system impact performance and overall behavior. This is achieved by comparing the performance of the system with specific baselines, which are essentially made up of combinations of system components.
#### 4.2.1 Experiment 1: Model selection
Experiment 1 investigates the effect of model parameters on performance to determine a configuration that yields consistent results across different question types. The parameters under examination are the number of hops $N$ for candidate triple retrieval (tested with values 1, 2, 3) and the number of top triples $K$ selected for each sub-question (tested with values 10, 20, 30), consistent with values reported in the literature (Section 2).
For each MetaQA test dataset, 100 questions are sampled using a fixed seed, and the system is evaluated across all parameter combinations. This process is repeated with 10 different seeds (0â9) to capture performance variability, and all LLM components use the same seed for inference to ensure reproducibility.
Performance is measured using the Hit@1 metric, which checks if the generated answer exactly matches any of the label answer entities (after lowercasing and stripping). For example, if the label is âBrad Pittâ and the generated answer is âPitt is the actor in question,â the response is deemed incorrect. The final score for each dataset sample is the average Hit@1.
#### 4.2.2 Experiment 2: A Comparative Analysis with Baselines
Experiment 2 serves a purpose of assessing how individual system components influence overall performance by comparing the full system to three baselines:
1. LLM: Uses only an LLM with a simple zero-shot prompt to directly answer the question.
1. LLM+QD: Incorporates the question decomposition module to split questions and reformulate sub-questions before answering with the same zero-shot prompt as the LLM baseline.
1. LLM+KG: Functions as the full system without the question decomposition component, which is equivalent to KAPING [9] by employing candidate triple retrieval, top- $K$ triple selection, and the sub-question answering module.
Both the full system and the LLM+KG baseline use the parameter configuration selected in Section 4.3.1. As in Experiment 1, 500 questions are sampled per MetaQA dataset using 8 different seeds (0â7) to ensure consistency. Performance is quantitatively evaluated using the Hit@1 metric to determine the impact of different components, and results are qualitatively analyzed for error insights and to assess accuracy, explainability, and generalizability as outlined in Section 3.1.
### 4.3 Results and Discussion
#### 4.3.1 Experiment 1: Quantitative analysis
The results of Experiment 1 (Figure 5) indicate high overall performance that decreases with increasing question complexity, with standard deviations remaining low ( $\leq 0.063$ ) across samples.
<details>
<summary>extracted/6354852/Figs/exp1_metaqa_1.png Details</summary>

### Visual Description
## Bar Chart: MetaQA 1-Hop Hit@1 Scores (Mean ± Std) for Different N and K
### Overview
This is a grouped bar chart displaying the performance of a system on the MetaQA dataset. The performance metric is the "Hit@1 Score," presented as a mean value with error bars representing the standard deviation. The chart compares performance across two variables: the number of hops for candidate retrieval (N) and a parameter labeled "K".
### Components/Axes
* **Chart Title:** "MetaQA 1-Hop Hit@1 Scores (Mean ± Std) for Different N and K"
* **Y-Axis:**
* **Label:** "Hit@1 Score"
* **Scale:** Linear, from 0.0 to 1.0, with major gridlines at intervals of 0.2.
* **X-Axis:**
* **Label:** "Number of Hops for Candidate Retrieval (N)"
* **Categories:** Three discrete groups labeled "1", "2", and "3".
* **Legend:**
* **Title:** "K"
* **Location:** Bottom-right corner of the plot area.
* **Categories & Colors:**
* `K=10`: Light blue bar.
* `K=20`: Medium blue bar.
* `K=30`: Dark blue bar.
* **Data Representation:** For each category of N (1, 2, 3), there are three adjacent bars corresponding to K=10, K=20, and K=30. Each bar has a black error bar (whisker) extending vertically from its top, indicating the standard deviation (Std) of the mean score.
### Detailed Analysis
The chart presents the mean Hit@1 score for nine distinct conditions (3 values of N Ă 3 values of K). All scores are high, clustered between approximately 0.90 and 0.96.
**Trend Verification & Data Points (Approximate Values):**
The general visual trend is a slight decrease in the mean score as N increases from 1 to 3. Within each N group, the score tends to increase slightly as K increases.
* **For N = 1 (Leftmost group):**
* The bars show the highest overall performance.
* `K=10` (light blue): Mean â 0.95. Error bar spans â 0.93 to 0.97.
* `K=20` (medium blue): Mean â 0.955. Error bar spans â 0.94 to 0.97.
* `K=30` (dark blue): Mean â 0.96. Error bar spans â 0.94 to 0.98.
* **Observation:** Scores are very close, with a very slight upward trend from K=10 to K=30. Variability (error bar length) is similar across all three.
* **For N = 2 (Middle group):**
* Performance is slightly lower than for N=1.
* `K=10` (light blue): Mean â 0.94. Error bar spans â 0.92 to 0.96.
* `K=20` (medium blue): Mean â 0.945. Error bar spans â 0.93 to 0.96.
* `K=30` (dark blue): Mean â 0.95. Error bar spans â 0.93 to 0.97.
* **Observation:** The pattern mirrors N=1, with a minor increase in mean score as K increases. The absolute values are roughly 0.01-0.02 points lower than their N=1 counterparts.
* **For N = 3 (Rightmost group):**
* This group shows the lowest mean scores and the most noticeable separation between K values.
* `K=10` (light blue): Mean â 0.91. Error bar spans â 0.88 to 0.94. This is the lowest mean score on the chart.
* `K=20` (medium blue): Mean â 0.92. Error bar spans â 0.89 to 0.95.
* `K=30` (dark blue): Mean â 0.93. Error bar spans â 0.91 to 0.95.
* **Observation:** The downward trend with increasing N is most pronounced here. The benefit of a higher K (30 vs. 10) is also most visually apparent in this group. The error bars, especially for K=10 and K=20, appear slightly longer, suggesting potentially higher variance in results at N=3.
### Key Observations
1. **Dominant Trend (N):** There is a consistent, monotonic decrease in the mean Hit@1 score as the number of hops (N) increases from 1 to 3. This suggests the task becomes more difficult for the system as more retrieval hops are required.
2. **Secondary Trend (K):** Within each N group, a higher K value (K=30) consistently yields a slightly higher mean score than a lower K value (K=10). This positive effect of K is subtle at N=1 and N=2 but becomes more distinct at N=3.
3. **Performance Ceiling:** All reported mean scores are above 0.90, indicating very high system performance on this 1-hop task across all tested configurations.
4. **Variance:** The standard deviations (error bars) are relatively small and consistent across most conditions, indicating stable results. The variance appears to increase slightly for the more challenging condition (N=3, K=10).
### Interpretation
The data demonstrates a clear relationship between retrieval complexity (N), a system parameter (K), and performance (Hit@1 Score) on the MetaQA benchmark.
* **What the data suggests:** The system's ability to correctly identify the top candidate (Hit@1) degrades gracefully as the retrieval chain lengthens (N increases). The parameter K acts as a mitigating factor; a larger K (e.g., 30) provides a performance buffer, especially when the task is harder (N=3). This could imply that considering more candidates (higher K) helps compensate for the increased difficulty or potential error propagation in multi-hop retrieval.
* **How elements relate:** The x-axis (N) represents task difficulty, the legend (K) represents a model or retrieval hyperparameter, and the y-axis is the success metric. The chart effectively shows their interaction: the negative impact of increasing N is partially offset by increasing K.
* **Notable patterns/anomalies:** The most notable pattern is the non-uniform impact of K. Its benefit is minimal at low N but becomes significant at high N. This suggests K is a more critical hyperparameter for complex, multi-step reasoning tasks. There are no apparent anomalies; the trends are smooth and consistent. The high baseline performance (>0.90) indicates the 1-hop MetaQA task may be relatively straightforward for the evaluated system.
</details>
<details>
<summary>extracted/6354852/Figs/exp1_metaqa_2.png Details</summary>

### Visual Description
\n
## Bar Chart: MetaQA 2-Hop Hit@1 Scores (Mean ± Std) for Different N and K
### Overview
This is a grouped bar chart displaying the performance of a system (likely a question-answering model) on the MetaQA dataset. The performance metric is the "Hit@1 Score," which measures the accuracy of the top retrieved answer. The chart compares performance across different numbers of hops for candidate retrieval (N) and different values of a parameter K. The data is presented as mean scores with error bars representing standard deviation.
### Components/Axes
* **Chart Title:** "MetaQA 2-Hop Hit@1 Scores (Mean ± Std) for Different N and K" (Position: Top center).
* **Y-Axis:**
* **Label:** "Hit@1 Score" (Position: Left side, rotated vertically).
* **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Label:** "Number of Hops for Candidate Retrieval (N)" (Position: Bottom center).
* **Categories:** Three discrete values: 1, 2, and 3.
* **Legend:**
* **Title:** "K" (Position: Inside the plot area, bottom-right corner).
* **Categories & Colors:**
* `K=10`: Light blue bar.
* `K=20`: Medium blue bar.
* `K=30`: Dark blue bar.
* **Data Representation:** For each value of N (1, 2, 3), there is a group of three bars, one for each K value (10, 20, 30), ordered from left to right as K=10, K=20, K=30 within each group. Each bar has a black error bar extending vertically from its top.
### Detailed Analysis
**Data Points (Approximate Mean Values with Visual Uncertainty from Error Bars):**
* **For N = 1:**
* **Trend:** All scores are very low, clustered near the bottom of the chart.
* **K=10 (Light Blue):** Mean â 0.06. Error bar spans approximately 0.02 to 0.10.
* **K=20 (Medium Blue):** Mean â 0.06. Error bar spans approximately 0.03 to 0.09.
* **K=30 (Dark Blue):** Mean â 0.06. Error bar spans approximately 0.03 to 0.09.
* **Observation:** Performance is uniformly poor and nearly identical across all K values for a single hop.
* **For N = 2:**
* **Trend:** A dramatic increase in performance compared to N=1. Scores are the highest on the chart. There is a slight positive trend with increasing K.
* **K=10 (Light Blue):** Mean â 0.85. Error bar spans approximately 0.81 to 0.89.
* **K=20 (Medium Blue):** Mean â 0.89. Error bar spans approximately 0.86 to 0.92.
* **K=30 (Dark Blue):** Mean â 0.91. Error bar spans approximately 0.88 to 0.94.
* **Observation:** This is the peak performance configuration. The mean score increases monotonically with K, and the error bars are relatively small, indicating consistent high performance.
* **For N = 3:**
* **Trend:** Performance decreases compared to N=2 but remains significantly higher than N=1. The positive trend with increasing K persists.
* **K=10 (Light Blue):** Mean â 0.73. Error bar spans approximately 0.68 to 0.78.
* **K=20 (Medium Blue):** Mean â 0.77. Error bar spans approximately 0.74 to 0.80.
* **K=30 (Dark Blue):** Mean â 0.79. Error bar spans approximately 0.75 to 0.83.
* **Observation:** Performance drops from the N=2 peak but is still robust. The standard deviation (error bar length) appears slightly larger for K=10 compared to K=20 and K=30 at this N.
### Key Observations
1. **Dominant Effect of N:** The number of hops (N) has the most significant impact on performance. There is a sharp, non-linear increase from N=1 to N=2, followed by a moderate decrease from N=2 to N=3.
2. **Secondary Effect of K:** For a fixed N (especially N=2 and N=3), increasing the parameter K leads to a consistent, though smaller, improvement in the mean Hit@1 score.
3. **Optimal Configuration:** The highest mean score (â0.91) is achieved with N=2 and K=30.
4. **Low Performance at N=1:** The system performs very poorly (scores < 0.1) when only one hop is used for candidate retrieval, regardless of the K value.
5. **Error Bar Consistency:** The standard deviations (error bars) are generally proportional to the mean scores, being very small for low scores (N=1) and larger for higher scores (N=2, N=3). This suggests the variance in performance scales with the mean.
### Interpretation
The data suggests a clear narrative about the retrieval mechanism in this 2-hop QA task:
* **The Critical Role of Multi-Hop Retrieval:** The catastrophic failure at N=1 indicates that single-hop retrieval is fundamentally insufficient for this task. The system likely requires at least two retrieval steps (N=2) to gather the necessary evidence to answer 2-hop questions effectively. The drop at N=3 might indicate that introducing a third hop adds noise or irrelevant information, slightly degrading performance compared to the optimal two-hop process.
* **The Value of a Larger Candidate Pool (K):** Increasing K (the number of candidates considered at each retrieval step) consistently improves accuracy. This implies that casting a wider net during retrieval increases the likelihood of capturing the correct supporting facts. However, the gains from K show diminishing returns, as the improvement from K=20 to K=30 is smaller than from K=10 to K=20.
* **System Behavior:** The system appears to be well-calibrated for the core challenge (N=2), achieving high and stable accuracy. The performance profile is logical: too little retrieval (N=1) fails, optimal retrieval (N=2) succeeds, and excessive retrieval (N=3) introduces slight inefficiency. The consistent benefit of larger K suggests the retrieval module's recall is a key performance driver.
**In summary, the chart demonstrates that for MetaQA 2-hop questions, an optimal retrieval strategy involves two hops (N=2) with a sufficiently large candidate pool (K=30), yielding a high mean Hit@1 score of approximately 0.91.**
</details>
<details>
<summary>extracted/6354852/Figs/exp1_metaqa_3.png Details</summary>

### Visual Description
## Bar Chart: MetaQA 3-Hop Hit@1 Scores (Mean ± Std) for Different N and K
### Overview
This is a grouped bar chart displaying the performance of a system on the MetaQA 3-Hop question answering task. The performance metric is the Hit@1 Score, presented as a mean value with error bars representing the standard deviation. The chart compares performance across two variables: the Number of Hops for Candidate Retrieval (N) and a parameter labeled K.
### Components/Axes
* **Chart Title:** "MetaQA 3-Hop Hit@1 Scores (Mean ± Std) for Different N and K"
* **Y-Axis:**
* **Label:** "Hit@1 Score"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Ticks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Grid Lines:** Horizontal dashed lines at each major tick.
* **X-Axis:**
* **Label:** "Number of Hops for Candidate Retrieval (N)"
* **Categories:** Three discrete values: 1, 2, and 3.
* **Legend:**
* **Title:** "K"
* **Location:** Bottom-right corner of the plot area.
* **Categories & Colors:**
* `K=10`: Light blue (leftmost bar in each group).
* `K=20`: Medium blue (middle bar in each group).
* `K=30`: Dark blue (rightmost bar in each group).
* **Data Representation:** Grouped bars with error bars. Each group on the x-axis (N=1, N=2, N=3) contains three bars corresponding to K=10, K=20, and K=30.
### Detailed Analysis
**Data Points (Approximate Mean Values & Standard Deviation Ranges):**
* **N = 1:**
* **K=10 (Light Blue):** Mean â 0.43. Error bar spans approximately 0.39 to 0.47.
* **K=20 (Medium Blue):** Mean â 0.43. Error bar spans approximately 0.39 to 0.47.
* **K=30 (Dark Blue):** Mean â 0.43. Error bar spans approximately 0.39 to 0.47.
* *Trend:* All three K values yield nearly identical performance at N=1.
* **N = 2:**
* **K=10 (Light Blue):** Mean â 0.43. Error bar spans approximately 0.40 to 0.46.
* **K=20 (Medium Blue):** Mean â 0.53. Error bar spans approximately 0.48 to 0.58.
* **K=30 (Dark Blue):** Mean â 0.54. Error bar spans approximately 0.49 to 0.59.
* *Trend:* Performance for K=20 and K=30 increases notably compared to N=1, while K=10 remains flat. K=20 and K=30 are very close.
* **N = 3:**
* **K=10 (Light Blue):** Mean â 0.51. Error bar spans approximately 0.45 to 0.57.
* **K=20 (Medium Blue):** Mean â 0.62. Error bar spans approximately 0.58 to 0.66.
* **K=30 (Dark Blue):** Mean â 0.62. Error bar spans approximately 0.57 to 0.67.
* *Trend:* Performance for all K values increases compared to N=2. K=20 and K=30 again show very similar, higher performance than K=10.
### Key Observations
1. **Positive Correlation with N:** For a fixed K (especially K=20 and K=30), the Hit@1 Score generally increases as the Number of Hops for Candidate Retrieval (N) increases from 1 to 3.
2. **Impact of K:** At N=1, the parameter K has no discernible effect on performance. At N=2 and N=3, higher K values (20 and 30) lead to significantly better performance than K=10. The difference between K=20 and K=30 is minimal across all N.
3. **Performance Plateau for K:** There appears to be a diminishing return or plateau in performance when increasing K from 20 to 30, as their mean scores and error bars overlap substantially at N=2 and N=3.
4. **Variability:** The standard deviation (error bars) is relatively consistent across most data points, suggesting similar levels of variance in the results, though it appears slightly larger for the N=3, K=10 data point.
### Interpretation
The data suggests that for the MetaQA 3-Hop task, increasing the depth of candidate retrieval (N) is beneficial for improving the accuracy of the top-ranked answer (Hit@1). This benefit is most pronounced when the system is allowed to consider a larger set of candidates (higher K).
The lack of difference between K values at N=1 implies that with only a single retrieval hop, the system's performance is bottlenecked by the initial retrieval step, and simply retrieving more candidates (increasing K) does not help. However, as the retrieval process becomes more complex (N=2 or 3), having a larger candidate pool (K=20 or 30) becomes crucial for achieving higher accuracy, likely because it provides more material for the multi-hop reasoning process to work with.
The near-identical performance of K=20 and K=30 indicates that beyond a certain point (K=20), adding more candidates does not yield further significant gains for this specific task and metric. This could point to an optimal resource-accuracy trade-off, where K=20 might be sufficient. The overall trend highlights the importance of multi-hop retrieval (N>1) combined with an adequately sized candidate set (Kâ„20) for effective performance on complex, multi-hop question answering.
</details>
Figure 5: MetaQA performance results for experiment 1, over 10 samples of 100 questions for each of the three datasets. The bars show the mean Hit@1 for different parameter configurations; the error bars show the standard deviation.
Performance is highest when the parameter $N$ equals the actual number of hops in the questions. As expected, for the 2-hop dataset, $N=1$ yields poor results; however, for the 3-hop dataset, performance with $N<3$ is unexpectedly high due to MetaQAâs question templatesâfor instance, some 3-hop questions (e.g., âWho are the directors of the films written by the writer of Blue Collar?â) can be answered with $N=1$ triples. This represents a limitation of the MetaQA benchmark.
When holding the dataset and $N$ constant, increasing $K$ (the number of top triples selected) from 10 to 30 shows minimal effect on the 1-hop dataset, with slight improvements observed for the 2-hop and 3-hop datasets. Given that a higher $K$ is unlikely to reduce performance and is more likely to include the necessary triples, $K=30$ is chosen.
Considering the trade-offs across datasets, a balanced configuration is selected. Since $N=1$ is unacceptable for 2-hop questions and improved performance on 3-hop questions likely requires all candidate triples up to 3 hops, $N=3$ is deemed the best choice despite a minor reduction in 2-hop performance (0.787 $\pm$ 0.046). Consequently, the optimal parameter configuration for MetaQA is $N=3$ and $K=30$ .
#### 4.3.2 Experiment 2: Quantitative analysis
Figure 6 presents the performance results for Experiment 2 across 8 samples of 500 questions per MetaQA dataset. Our system significantly outperforms the baselines on 2-hop and 3-hop questions with minimal variance, while the LLM+KG baseline slightly outperforms on 1-hop questions. This is expected, as question decomposition adds unnecessary overhead for simple queries.
<details>
<summary>extracted/6354852/Figs/results_exp2_MetaQA.png Details</summary>

### Visual Description
## Bar Chart: MetaQA Hit@1 Scores (Mean ± Std) for Our Method and Baselines
### Overview
This is a grouped bar chart comparing the performance of four different models on the MetaQA dataset across three levels of question complexity (1-hop, 2-hop, 3-hop). The performance metric is the Hit@1 Score, presented as the mean with standard deviation error bars. The chart demonstrates how model performance degrades as the reasoning complexity (number of hops) increases.
### Components/Axes
* **Chart Title:** "MetaQA Hit@1 Scores (Mean ± Std) for Our Method and Baselines"
* **Y-Axis:**
* **Label:** "Hit@1 Score"
* **Scale:** Linear, ranging from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Label:** "MetaQA Dataset"
* **Categories (from left to right):** "1-hop", "2-hop", "3-hop". These represent the complexity of the question-answer task.
* **Legend:**
* **Title:** "Models"
* **Position:** Top-right corner of the plot area.
* **Items (with associated colors):**
1. **Our method** (Red bar)
2. **LLM+KG** (Orange bar)
3. **LLM+QD** (Magenta/Pink bar)
4. **LLM** (Dark Purple bar)
* **Data Representation:** For each X-axis category (hop level), there is a group of four bars, one for each model in the legend order. Each bar has a black error bar extending vertically from its top, representing the standard deviation (Std).
### Detailed Analysis
**1-hop Category (Leftmost Group):**
* **Our method (Red):** Highest score. Bar height is approximately **0.91**. Error bar extends from ~0.89 to ~0.93.
* **LLM+KG (Orange):** Second highest. Bar height is approximately **0.95**. Error bar extends from ~0.93 to ~0.97.
* **LLM+QD (Magenta):** Significantly lower. Bar height is approximately **0.45**. Error bar extends from ~0.42 to ~0.48.
* **LLM (Purple):** Similar to LLM+QD. Bar height is approximately **0.46**. Error bar extends from ~0.43 to ~0.49.
**2-hop Category (Middle Group):**
* **Our method (Red):** Remains the highest. Bar height is approximately **0.78**. Error bar extends from ~0.76 to ~0.80.
* **LLM+KG (Orange):** Second highest, but with a larger drop from 1-hop. Bar height is approximately **0.58**. Error bar extends from ~0.56 to ~0.60.
* **LLM+QD (Magenta):** Lowest in this group. Bar height is approximately **0.37**. Error bar extends from ~0.34 to ~0.40.
* **LLM (Purple):** Slightly higher than LLM+QD. Bar height is approximately **0.38**. Error bar extends from ~0.35 to ~0.41.
**3-hop Category (Rightmost Group):**
* **Our method (Red):** Still the highest, but with a further decline. Bar height is approximately **0.63**. Error bar extends from ~0.61 to ~0.65.
* **LLM+KG (Orange):** Second highest. Bar height is approximately **0.49**. Error bar extends from ~0.47 to ~0.51.
* **LLM+QD (Magenta):** Third highest. Bar height is approximately **0.54**. Error bar extends from ~0.52 to ~0.56.
* **LLM (Purple):** Lowest in this group. Bar height is approximately **0.52**. Error bar extends from ~0.50 to ~0.54.
**Trend Verification per Model:**
* **Our method (Red):** Shows a clear, consistent downward trend as hop count increases (0.91 -> 0.78 -> 0.63). It is the top-performing model in all categories.
* **LLM+KG (Orange):** Also shows a consistent downward trend (0.95 -> 0.58 -> 0.49). It is the second-best model in all categories.
* **LLM+QD (Magenta):** Performance dips at 2-hop (0.45 -> 0.37) but recovers slightly at 3-hop (0.54). It is not consistently the worst or second-worst.
* **LLM (Purple):** Performance dips at 2-hop (0.46 -> 0.38) and recovers slightly at 3-hop (0.52). It is generally the lowest or tied for lowest performer.
### Key Observations
1. **Performance Hierarchy:** A clear performance hierarchy is established across all complexity levels: **Our method > LLM+KG > (LLM+QD â LLM)**. The gap between the top two models and the bottom two is substantial, especially at 1-hop and 2-hop.
2. **Impact of Complexity:** All models experience a decline in Hit@1 Score as the number of reasoning hops increases from 1 to 3. This indicates that multi-hop reasoning is a more challenging task for all evaluated systems.
3. **Error Bar Overlap:** The error bars for "Our method" and "LLM+KG" do not overlap with each other or with the other two models in the 1-hop and 2-hop categories, suggesting the performance differences are statistically significant. At 3-hop, the error bars for LLM+QD and LLM overlap, indicating their performance is not significantly different at this complexity level.
4. **Relative Resilience:** While all models decline, "Our method" shows the most resilience in absolute terms, maintaining a score above 0.6 even at 3-hop complexity. LLM+KG experiences the sharpest drop between 1-hop and 2-hop.
### Interpretation
The data strongly suggests that the proposed model ("Our method") is the most effective for the MetaQA task, consistently outperforming the baselines regardless of question complexity. The inclusion of a Knowledge Graph (LLM+KG) provides a significant boost over the base LLM, particularly for simpler (1-hop) questions, but its advantage diminishes with increased complexity. The LLM+QD and base LLM models perform poorly in comparison, struggling even with 1-hop questions.
The universal downward trend across all models underscores the inherent difficulty of scaling question-answering systems to handle longer reasoning chains. The fact that "Our method" degrades more gracefully suggests its architecture or training better captures the dependencies required for multi-hop inference. The chart serves as evidence that the authors' method advances the state-of-the-art for this specific benchmark, addressing a key challenge in knowledge-intensive QA systems.
</details>
Figure 6: MetaQA performance results for experiment 2, over 8 samples of 500 questions for each of the three datasets. The bars show the mean Hit@1, and the error bars show the standard deviation. The results for both the system and the baselines are shown.
Comparing the baselines, the advantage of the KG retrieval module is most pronounced for 1-hop questions, but diminishes for 2-hop questions and disappears for 3-hop questionsâlikely because complex queries increase the difficulty of retrieving relevant triples. The integration of question decomposition in our system, however, maintains the benefits of KG retrieval for multi-hop questions while also enhancing answer explainability.
In summary, our system achieves improved performance on multi-hop questions with only a minor loss for 1-hop queries compared to the LLM+KG baseline. Although the relative and absolute advantage decreases as the number of hops increases, these quantitative results, combined with a forthcoming qualitative analysis (Section 4.4), support the effectiveness of our approach.
### 4.4 Qualitative Analysis
This section examines the model outputs to identify recurring behaviors, strengths, and weaknesses, and to suggest directions for future improvements. Given the inherent limitations of a small, quantized LLM, our focus is on common patterns rather than isolated errors.
Table 3: The datasets that were analyzed for the qualitative analysis.
| MetaQA 1-hop | KG-RAG | 1 | 0 | N=3, K=30 |
| --- | --- | --- | --- | --- |
| MetaQA 2-hop | KG-RAG | 1 | 0 | N=3, K=30 |
| MetaQA 3-hop | KG-RAG | 1 | 0 | N=3, K=30 |
Table 3 lists the generated outputs used in this analysis. First, we present an example output from the MetaQA 2-hop dataset (Figure 7). This example demonstrates how the system correctly decomposes a question into coherent sub-questions, substitutes sub-answers appropriately, and synthesizes a final answer that accurately addresses the main question while maintaining a logical chain-of-thought.
<details>
<summary>extracted/6354852/Figs/output_example.png Details</summary>

### Visual Description
## Document Screenshot: MetaQA 2-Hop Correct Example
### Overview
The image displays a structured, table-like example of a multi-hop question-answering process. It is presented as a technical document or educational figure, likely from a research paper or tutorial on reasoning systems. The content is entirely textual, organized into labeled sections within a bordered layout with a green header.
### Components/Axes
The document is structured as a single column of bordered sections, each with a bolded label. The visual hierarchy is as follows:
* **Header (Top):** A solid green bar containing the title "MetaQA 2-Hop Correct Example" in white text.
* **Main Content Area:** A series of rectangular boxes with light green borders, stacked vertically. Each box contains a label in bold followed by the corresponding text content.
* **Spatial Layout:** All text is left-aligned within its respective box. The sections flow sequentially from top to bottom, illustrating a step-by-step reasoning chain.
### Detailed Analysis / Content Details
The document contains the following text, transcribed precisely from top to bottom:
**Header:**
MetaQA 2-Hop Correct Example
**Section 1:**
**Question:**
What genres are the movies written by John Shirley in?
**Section 2:**
**Chain-of-Thought:**
To find out what genres the movies written by John Shirley are in, I first need to identify what movies were written by John Shirley. I can then look up the genre of each of these movies.
**Section 3:**
**Sub-Question 1:**
1. What movies were written by John Shirley?
**Section 4:**
**Triples Sub-Question 1:**
(the crow, written by, john shirley)
(john carter, has genre, action)
(dick tracy, has genre, action)
...
**Section 5:**
**Sub-Answer 1:**
John Shirley wrote the following movies based on the provided information: The Crow.
**Section 6:**
**Sub-Question 2:**
What is the genre of The Crow movie?
**Section 7:**
**Triples Sub-Question 2:**
(the crow, has genre, action)
(the crow, directed by, alex proyas)
(the crow, starred actors, brandon lee)
...
**Section 8:**
**Sub-Answer 2:**
Based on the provided information, The Crow is a movie with the genre "action".
**Section 9:**
**Final Answer:**
Based on the context provided, John Shirley has written at least one movie, "The Crow," which falls under the genre of "action."
### Key Observations
1. **Structured Reasoning:** The document explicitly breaks down a complex question ("2-Hop") into a sequence of simpler sub-questions and answers.
2. **Use of Knowledge Triples:** The "Triples" sections present data in a (subject, predicate, object) format, which is common in knowledge graphs. The ellipses (`...`) indicate that the listed triples are a subset of a larger available dataset.
3. **Logical Flow:** The Chain-of-Thought section outlines the plan, which is then executed step-by-step. Sub-Answer 1 provides the necessary entity ("The Crow") to address Sub-Question 2.
4. **Data Consistency:** The triples and answers are consistent. The triple `(the crow, written by, john shirley)` supports Sub-Answer 1, and the triple `(the crow, has genre, action)` directly supports Sub-Answer 2 and the Final Answer.
### Interpretation
This image serves as a pedagogical example of **multi-hop reasoning** in artificial intelligence, specifically for question answering over a knowledge base. It demonstrates how a system can decompose a query requiring multiple pieces of connected information (hop 1: find movies by an author; hop 2: find the genre of a specific movie) into a solvable sequence.
The "Triples" sections are crucial; they represent the raw, structured data the reasoning system accesses. The example shows successful **information chaining**, where the output of the first reasoning step (identifying "The Crow") becomes the input for the second step. The final answer is a synthesized conclusion derived from this chain, not a direct lookup. This illustrates a core capability for AI systems to answer complex queries by navigating relational data, moving beyond simple keyword matching. The document's clean, segmented layout is designed to make this logical process transparent and easy to follow for a human reader.
</details>
Figure 7: An example of the systemâs intermediate outputs, which lead to the final answer. The example was taken from the MetaQA 2-hop sample that was analyzed for the qualitative analysis.
#### 4.4.1 Question Decomposition
By analyzing the distribution of the number of generated sub-questions per dataset (Figure 8), we observe that the model generally recognizes the appropriate complexity of MetaQA multi-hop questions. For 1-hop questions, the model typically avoids decomposition, though ambiguous queries (e.g. asking for a movie description) sometimes lead to unnecessary sub-questions. For 2-hop and 3-hop questions, the model usually generates the expected number of sub-questions, although there are occasional cases of under-decomposition.
<details>
<summary>extracted/6354852/Figs/subq_distribution.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Generated Sub-questions per MetaQA Dataset
### Overview
The image displays a set of three bar charts arranged horizontally, collectively titled "Distribution of the Number of Generated Sub-questions per Dataset (MetaQA)". Each chart represents a different subset of the MetaQA dataset, categorized by the number of reasoning "hops" required: 1-hop, 2-hop, and 3-hop. The charts show the frequency distribution of how many sub-questions were generated for questions within each subset.
### Components/Axes
* **Main Title:** "Distribution of the Number of Generated Sub-questions per Dataset (MetaQA)"
* **Subplot Titles (from left to right):**
* "MetaQA 1-hop"
* "MetaQA 2-hop"
* "MetaQA 3-hop"
* **X-Axis (All Subplots):** Labeled "Number of Sub-questions". The axis markers are integers from 0 to 6.
* **Y-Axis (Leftmost Subplot Only):** Labeled "Count". The axis markers are 0, 20, 40, 60, 80. The scale is consistent across all three subplots.
* **Data Series:** Each subplot contains a single data series represented by blue vertical bars. There is no separate legend, as the title of each subplot defines the data category.
### Detailed Analysis
**MetaQA 1-hop Chart (Left):**
* **Trend:** The distribution is heavily right-skewed, with a dominant peak at 1 sub-question and a long, low tail extending to higher numbers.
* **Data Points (Approximate Counts):**
* 0 sub-questions: ~0
* 1 sub-question: ~83
* 2 sub-questions: ~5
* 3 sub-questions: ~6
* 4 sub-questions: ~4
* 5 sub-questions: ~1
* 6 sub-questions: ~1
**MetaQA 2-hop Chart (Center):**
* **Trend:** The distribution is centered and peaked at 2 sub-questions, with smaller counts for 1 and 3, and very few for other values.
* **Data Points (Approximate Counts):**
* 0 sub-questions: ~1
* 1 sub-questions: ~9
* 2 sub-questions: ~87
* 3 sub-questions: ~2
* 4 sub-questions: ~1
* 5 sub-questions: ~0
* 6 sub-questions: ~0
**MetaQA 3-hop Chart (Right):**
* **Trend:** The distribution is centered and peaked at 3 sub-questions, with a notable secondary count at 2, and very few for other values.
* **Data Points (Approximate Counts):**
* 0 sub-questions: ~0
* 1 sub-questions: ~0
* 2 sub-questions: ~19
* 3 sub-questions: ~78
* 4 sub-questions: ~3
* 5 sub-questions: ~0
* 6 sub-questions: ~0
### Key Observations
1. **Clear Modal Shift:** The peak (mode) of the distribution shifts directly with the hop count: 1 sub-question for 1-hop, 2 for 2-hop, and 3 for 3-hop. This is the most prominent pattern.
2. **Distribution Shape:** All distributions are unimodal and right-skewed, but the skew is most extreme for the 1-hop dataset. The 2-hop and 3-hop distributions are more symmetric around their peaks.
3. **Low Variance for Higher Hops:** The 2-hop and 3-hop datasets show very little variance; the vast majority of questions are generated with exactly 2 or 3 sub-questions, respectively. The 1-hop dataset has a more noticeable tail.
4. **Presence of Zero:** A very small number of 2-hop questions appear to have been generated with 0 sub-questions, which is an outlier compared to the other datasets.
### Interpretation
The data strongly suggests a direct, near-linear relationship between the inherent complexity of a question (as defined by its "hop" count in the MetaQA dataset) and the number of sub-questions a model generates to answer it. This indicates the model's question decomposition strategy is well-calibrated to the dataset's structure.
* **1-hop questions** are treated as largely atomic, requiring only one sub-question in most cases (~83% of the time), with occasional minor decomposition.
* **2-hop and 3-hop questions** are systematically broken down into a number of sub-questions that closely matches their hop count. This implies the model is successfully identifying the necessary reasoning steps.
* The **outlier of 0 sub-questions for a 2-hop question** could indicate a failure case where the model attempted to answer directly without decomposition, or a data anomaly.
* The **secondary peak at 2 sub-questions for the 3-hop dataset** suggests that for some 3-hop questions, the model found a more efficient reasoning path requiring only two intermediate steps, or that the decomposition was not perfectly aligned with the ground-truth hops.
Overall, the charts demonstrate that the sub-question generation process is not random but is systematically influenced by the logical complexity of the input question, as defined by the MetaQA benchmark.
</details>
Figure 8: The distribution of the number of sub-questions that were generated, for each of the MetaQA samples that was analyzed for the qualitative analysis (see Table 3 for more details).
#### 4.4.2 Qualitative Performance
Overall, the system effectively distinguishes question complexity, but several systematic errors were identified:
- Over-decomposition: In approximately 16% of 1-hop cases, ambiguous questions lead to extra sub-questions, resulting in longer, sometimes overcomplicated answers.
- Under-decomposition: For 2-hop and 3-hop datasets, the system occasionally fails to generate enough sub-questions, sometimes producing only a 1-hop and a 2-hop question instead of the full decomposition.
- Sub-answer Inconsistencies: The LLM sometimes produces sub-answers that do not align with the provided triples, either by overlooking relevant data or by incorporating its own external knowledge.
- Final Answer Synthesis: While the final synthesis step generally succeeds, it occasionally yields overly long answers that may exceed token limits or include unwarranted information.
Despite these issues, the generated reasoning chains remain logical and coherent, allowing users to trace and verify the main answer. Many of the observed errors can be attributed to the limitations of the quantized LLM, and it is expected that a more sophisticated model or refined prompting strategies (potentially using ICL) could mitigate these problems.
In conclusion, while triple selection remains robust when question decomposition is successful, the identified issues in decomposition, sub-answer generation, and answer synthesis indicate clear avenues for future research and improvements.
### 4.5 Discussion: Limitations
Here, we outline key limitations of the carried out research, which subsequently allow us to formulate future work.
First, constrained computational resources forced the use of a quantized, relatively small LLM, significantly impacting absolute performanceâdespite potentially preserving relative improvements over baselines. These constraints also necessitated random sampling of test subsets rather than evaluating on full datasets.
Second, the MetaQA benchmark is relatively simple, with a narrow domain and exclusively multi-hop questions. As noted in Section 4.3.1, some 3-hop questions are answerable using only 1-hop triples, which may skew performance evaluations compared to more complex benchmarks.
Additionally, earlier experiments with another dataset, the Mintaka benchmark, revealed that the Hit@1 metric can be inaccurate, particularly for comparative questions where the system generates full answers instead of single answer entities. This limitation, highlighted in related work [9, 11, 10], underscores the need for more sophisticated evaluation methods. Recent research on automated evaluation of natural language outputs [39] may offer promising alternatives.
In summary, key limitations include the restricted LLM size, the simplicity and flaws of the MetaQA benchmark, and the inadequacy of the Hit@1 metric for modern KGQA systems.
## 5 Conclusion
### 5.1 Contributions
Our study addressed two primary research questions. First, we investigated enhancing LLMs with knowledge graphs (KGs) without requiring any training. By leveraging the synergy between LLM reasoning and the structured knowledge in KGs, we identified a gap in creating KGQA models that are both generalizable (as in KAPING) and explainable (as in Keqing). To bridge this gap, we developed an improved, training-free version of KAPING.
Second, we explored methods to improve answer explainability using a KG-RAG system. Inspired by Keqing [10] and the work of [35], we designed a question decomposition module that first generates a chain-of-thought (CoT) followed by coherent sub-questions. This approach not only improved performance on multi-hop questions but also provided transparent reasoning chains, thereby enhancing answer explainability. Overall, the proposed solution achieved higher answer accuracy (as measured by the Hit@1 metric) and improved transparency, though further validation is needed to confirm its generalizability across different domains, KGs, and question types.
### 5.2 Future Work
Future research should focus on deepening the investigation into application generalizability by employing benchmarks with KGs composed largely of natural language, ensuring the triple selection mechanism via text embeddings functions effectively. Given some limitations of the MetaQA benchmark, exploring alternative benchmarks with diverse question domains may yield more robust conclusions.
Improved evaluation methods are also necessary. Automated techniquesâsuch as entity matching, coherence assessment of reasoning chains via LLM prompting, and verification of sub-answer validityâcould offer more reliable metrics than the currently used Hit@1 [39].
Furthermore, employing more sophisticated LLMs and fine-tuning inference parameters could mitigate many of the systematic errors observed. Future work may also explore advanced, training-free triple retrieval methods or finetuning strategies for text embedding models, thereby enhancing performance and efficiency. Finally, addressing persistent research gaps in question entity identification and entity matching is crucial for real-world KGQA applications.
## References
- Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. Advances in Neural Information Processing Systems 2020-Decem (2020)
- Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of Hallucination in Natural Language Generation. ACM Computing Surveys 55 (12) (2023) https://doi.org/10.1145/3571730
- Pan et al. [2023] Pan, J.Z., Razniewski, S., Kalo, J.-C., Singhania, S., Chen, J., Dietze, S., Jabeen, H., Omeliyanenko, J., Zhang, W., Lissandrini, M., Biswas, R., Melo, G., Bonifati, A., Vakaj, E., Dragoni, M., Graux, D.: Large Language Models and Knowledge Graphs: Opportunities and Challenges 000 (111), 1â30 (2023)
- Bender et al. [2021] Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: Can language models be too big? FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610â623 (2021) https://doi.org/10.1145/3442188.3445922
- Pan et al. [2023] Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unifying Large Language Models and Knowledge Graphs: A Roadmap 14 (8), 1â29 (2023)
- Yang et al. [2023] Yang, L., Chen, H., Li, Z., Ding, X., Wu, X.: ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling 14 (8), 1â20 (2023)
- Lewis et al. [2020] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., KĂŒttler, H., Lewis, M., Yih, W.T., RocktĂ€schel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 2020-Decem (2020)
- Gao et al. [2023] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-Augmented Generation for Large Language Models: A Survey (2023)
- Baek et al. [2023] Baek, J., Aji, A.F., Saffari, A.: Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 70â98 (2023) https://doi.org/10.18653/v1/2023.nlrse-1.7
- Wang et al. [2023] Wang, C., Xu, Y., Peng, Z., Zhang, C., Chen, B., Wang, X., Feng, L., An, B.: keqing: knowledge-based question answering is a nature chain-of-thought mentor of LLM (2023)
- Wu et al. [2023] Wu, Y., Hu, N., Bi, S., Qi, G., Ren, J., Xie, A., Song, W.: Retrieve-Rewrite-Answer: A KG-to-Text Enhanced LLMs Framework for Knowledge Graph Question Answering (2023)
- Arenas and Perez [2013] Arenas, M., Perez, J.: Querying Semantic Web Data with SPARQL. ACM (2013)
- Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
- Kaplan et al. [2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling Laws for Neural Language Models (2020)
- Zhao et al. [2023] Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., Wen, J.-R.: A Survey of Large Language Models, 1â97 (2023)
- [16] Chen, J., Chen, L., Zhu, C., Zhou, T.: How Many Demonstrations Do You Need for In-context Learning? Technical report
- Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022)
- Sen et al. [2022] Sen, P., Aji, A.F., Saffari, A.: Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering. Proceedings - International Conference on Computational Linguistics, COLING 29 (1), 1604â1619 (2022)
- Yih et al. [2016] Yih, W.T., Richardson, M., Meek, C., Chang, M.W., Suh, J.: The value of semantic parse labeling for knowledge base question answering. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Short Papers, 201â206 (2016) https://doi.org/10.18653/v1/p16-2033
- [20] Zhang, Y., Dai, H., Kozareva, Z., Smola, A.J., Song, L.: Variational Reasoning for Question Answering with Knowledge Graph, 1â22
- [21] Oliya, A., Saffari, A., Sen, P., Ayoola, T.: End-to-End Entity Resolution and Question Answering Using Differentiable Knowledge Graphs. Technical report
- Sen et al. [2023] Sen, P., Mavadia, S., Saffari, A.: Knowledge Graph-augmented Language Models for Complex Question Answering. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1â8 (2023) https://doi.org/10.18653/v1/2023.nlrse-1.1
- Gu et al. [2022] Gu, Y., Pahuja, V., Cheng, G., Su, Y.: Knowledge Base Question Answering: A Semantic Parsing Perspective (2022)
- Sanh et al. [2022] Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Bari, M.S., Xu, C., Thakker, U., Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N.V., Datta, D., Chang, J., Jiang, M.T.J., Wang, H., Manica, M., Shen, S., Yong, Z.X., Pandey, H., McKenna, M., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J.A., Teehan, R., Bers, T., Biderman, S., Gao, L., Wolf, T., Rush, A.M.: Multitask Prompted Training Enables Zero-Shot Task Generalization. ICLR 2022 - 10th International Conference on Learning Representations (2022)
- Raffel et al. [2020] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1â67 (2020)
- Chung et al. [2022] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E.H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q.V., Wei, J.: Scaling Instruction-Finetuned Language Models, 1â54 (2022)
- Zhang et al. [2022] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P.S., Sridhar, A., Wang, T., Zettlemoyer, L.: OPT: Open Pre-trained Transformer Language Models (2022)
- Fitzgerald et al. [2022] Fitzgerald, J., Ananthakrishnan, S., Arkoudas, K., Bernardi, D., Bhagia, A., Delli Bovi, C., Cao, J., Chada, R., Chauhan, A., Chen, L., Dwarakanath, A., Dwivedi, S., Gojayev, T., Gopalakrishnan, K., Gueudre, T., Hakkani-Tur, D., Hamza, W., Hueser, J.J., Jose, K.M., Khan, H., Liu, B., Lu, J., Manzotti, A., Natarajan, P., Owczarzak, K., Oz, G., Palumbo, E., Peris, C., Prakash, C.S., Rawls, S., Rosenbaum, A., Shenoy, A., Soltan, S., Sridhar, M.H., Tan, L., Triefenbach, F., Wei, P., Yu, H., Zheng, S., Tur, G., Natarajan, P.: Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2893â2902 (2022) https://doi.org/10.1145/3534678.3539173
- Touvron et al. [2023] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)
- Berant [2013] Berant, J.: Semantic Parsing on Freebase from Question-Answer Pairs (October), 1533â1544 (2013)
- Talmor and Berant [2013] Talmor, A., Berant, J.: The Web as a Knowledge-base for Answering Complex Questions (2013)
- [32] Dubey, M., Banerjee, D., Abdelkawi, A.: LC-QuAD 2 . 0 : A Large Dataset for Complex Question Answering over Wikidata and DBpedia
- [33] Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet::Similarity-Measuring the Relatedness of Concepts. Technical report. http://search.cpan.org/dist/WordNet-Similarityhttp://wn-similarity.sourceforge.net
- Hu et al. [2021] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-Rank Adaptation of Large Language Models (2021)
- Radhakrishnan et al. [2023] Radhakrishnan, A., Nguyen, K., Chen, A., Chen, C., Denison, C., Hernandez, D., Durmus, E., Hubinger, E., Kernion, J., LukoĆĄiĆ«tÄ, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., McCandlish, S., Showk, S.E., Lanham, T., Maxwell, T., Chandrasekaran, V., Hatfield-Dodds, Z., Kaplan, J., Brauner, J., Bowman, S.R., Perez, E.: Question Decomposition Improves the Faithfulness of Model-Generated Reasoning (2023)
- Jiang et al. [2023] Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.-A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7B (2023)
- Nguyen et al. [2024] Nguyen, M., Baker, A., Neo, C., Roush, A., Kirsch, A., Shwartz-Ziv, R.: Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs (2024)
- Steinmetz and Sattler [2021] Steinmetz, N., Sattler, K.U.: What is in the KGQA Benchmark Datasets? Survey on Challenges in Datasets for Question Answering on Knowledge Graphs. Journal on Data Semantics 10 (3-4), 241â265 (2021) https://doi.org/10.1007/s13740-021-00128-9
- Guo et al. [2023] Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., Xiong, D.: Evaluating Large Language Models: A Comprehensive Survey (2023)