2305.01157
Model: gemini-2.0-flash
# Complex Logical Reasoning over Knowledge Graphs using Large Language Models
**Authors**:
- Nurendra Choudhary (Department of Computer Science)
- &Chandan K. Reddy (Department of Computer Science)
Abstract
Reasoning over knowledge graphs (KGs) is a challenging task that requires a deep understanding of the complex relationships between entities and the underlying logic of their relations. Current approaches rely on learning geometries to embed entities in vector space for logical query operations, but they suffer from subpar performance on complex queries and dataset-specific representations. In this paper, we propose a novel decoupled approach, Language-guided Abstract Reasoning over Knowledge graphs (LARK), that formulates complex KG reasoning as a combination of contextual KG search and logical query reasoning, to leverage the strengths of graph extraction algorithms and large language models (LLM), respectively. Our experiments demonstrate that the proposed approach outperforms state-of-the-art KG reasoning methods on standard benchmark datasets across several logical query constructs, with significant performance gain for queries of higher complexity. Furthermore, we show that the performance of our approach improves proportionally to the increase in size of the underlying LLM, enabling the integration of the latest advancements in LLMs for logical reasoning over KGs. Our work presents a new direction for addressing the challenges of complex KG reasoning and paves the way for future research in this area.
1 Introduction
Knowledge graphs (KGs) encode knowledge in a flexible triplet schema where two entity nodes are connected by relational edges. However, several real-world KGs, such as Freebase (Bollacker et al., 2008), Yago (Suchanek et al., 2007), and NELL (Carlson et al., 2010), are often large-scale, noisy, and incomplete. Thus, reasoning over such KGs is a fundamental and challenging problem in AI research. The over-arching goal of logical reasoning is to develop answering mechanisms for first-order logic (FOL) queries over KGs using the operators of existential quantification ( $∃$ ), conjunction ( $\wedge$ ), disjunction ( $\vee$ ), and negation ( $\neg$ ). Current research on this topic primarily focuses on the creation of diverse latent space geometries, such as vectors (Hamilton et al., 2018), boxes (Ren et al., 2020), hyperboloids (Choudhary et al., 2021b), and probabilistic distributions (Ren & Leskovec, 2020), in order to effectively capture the semantic position and logical coverage of knowledge graph entities. Despite their success, these approaches are limited in their performance due to the following. (i) Complex queries: They rely on constrained formulations of FOL queries that lose information on complex queries that require chain reasoning (Choudhary et al., 2021a) and involve multiple relationships between entities in the KG, (ii) Generalizability: optimization for a particular KG may not generalize to other KGs which limits the applicability of these approaches in real-world scenarios where KGs can vary widely in terms of their structure and content, and (iii) Scalability: intensive training times that limit the scalability of these approaches to larger KGs and incorporation of new data into existing KGs. To address these limitations, we aim to leverage the reasoning abilities of large language models (LLMs) in a novel framework, shown in Figure 1, called Language-guided Abstract Reasoning over Knowledge graphs (LARK).
<details>
<summary>extracted/2305.01157v3/images/example_logical_query.png Details</summary>

### Visual Description
## Diagram: Nobel Prize Winners Outside Europe and North America
### Overview
The image is a diagram illustrating the logical query for identifying Nobel Prize winners who are not from Europe or North America. It uses nodes and directed edges to represent entities and relationships, respectively. The diagram visually represents a logical expression.
### Components/Axes
* **Nodes:** Represent entities or sets of entities.
* Blue nodes: "Nobel Prize", "Europe", "North America"
* Green nodes: "Nobel Prize Winners", "Europeans", "North Americans"
* Purple node: V (OR operator)
* Pink node: ¬ (NOT operator)
* Yellow node: "Non-Europeans and Non-North Americans"
* Light Blue node with ^ (AND operator)
* **Edges:** Represent relationships between entities.
* Solid blue edges: "winner", "citizen"
* Dashed green edges: connecting "Europeans" and "North Americans" to the OR operator
* Dashed purple edge: connecting the OR operator to the NOT operator
* Solid red edge: connecting the NOT operator to the "Non-Europeans and Non-North Americans" node
* Solid green edge: connecting "Nobel Prize Winners" to the AND operator
* Solid yellow edge: connecting "Non-Europeans and Non-North Americans" to the AND operator
* Solid blue edge: connecting the AND operator to "names"
### Detailed Analysis or ### Content Details
* **Top-Left:** "Nobel Prize" (blue node) has a "winner" (blue edge) relationship to "Nobel Prize Winners" (green node).
* **Middle-Left:** "Europe" (blue node) has a "citizen" (blue edge) relationship to "Europeans" (green node).
* **Bottom-Left:** "North America" (blue node) has a "citizen" (blue edge) relationship to "North Americans" (green node).
* **Center:** "Europeans" and "North Americans" are connected to a purple node labeled "V" (OR operator) via dashed green edges.
* **Right of Center:** The output of the OR operator is connected to a pink node labeled "¬" (NOT operator) via a dashed purple edge.
* **Right of the NOT operator:** The output of the NOT operator is connected to a yellow node labeled "Non-Europeans and Non-North Americans" via a solid red edge.
* **Top-Right:** "Nobel Prize Winners" and "Non-Europeans and Non-North Americans" are connected to a light blue node labeled "^" (AND operator) via solid green and yellow edges, respectively.
* **Far-Right:** The output of the AND operator is connected to "names" via a solid blue edge.
* **Textual Query (Top):**
* "Who(X) are the Nobel Prize (N) winners (W) not from Europe (E) or North America (A)?"
* "?X.∃X.names(X, W.∃W.[winners(W, NobelPrize) ∧ ∃T.[¬(citizen(T, E) ∨ citizen(T, A))]])"
### Key Observations
* The diagram represents a logical query to find the names of Nobel Prize winners who are not citizens of either Europe or North America.
* The diagram uses standard logical operators (AND, OR, NOT) to construct the query.
* The flow of the diagram visually represents the steps involved in executing the query.
### Interpretation
The diagram provides a visual representation of a complex logical query. It breaks down the query into smaller, more manageable components, making it easier to understand. The diagram effectively illustrates how the different entities and relationships are combined to arrive at the desired result: the names of Nobel Prize winners who are not from Europe or North America. The use of logical operators and clear labeling of nodes and edges enhances the clarity and interpretability of the diagram.
</details>
(a) Input logical query.
<details>
<summary>extracted/2305.01157v3/images/example_full_query.png Details</summary>

### Visual Description
## Set Theory Problem
### Overview
The image presents a set theory problem involving entities connected to Nobel Prize, Europe, and North America through specific relations. It defines sets E, F, G, and H based on these connections and asks for the entities in the intersection of E and H.
### Components/Axes
* **E**: Set of entities connected to Nobel Prize by relation winner (Black Text)
* **F**: Set of entities connected to Europe by the relation citizen (Black Text)
* **G**: Set of entities connected to North America by the relation citizen (Black Text)
* **H**: Set of entities connected to entities in the negation of the union of F and G (Black Text)
* **Nobel Prize**: Mentioned in Blue Text
* **Europe**: Mentioned in Green Text
* **North America**: Mentioned in Blue Text
* **negation**: Mentioned in Red Text
* **union**: Mentioned in Purple Text
* **intersection**: Mentioned in Blue Text
### Detailed Analysis or ### Content Details
The problem defines the following sets:
* **E**: Entities connected to "Nobel Prize" by the relation "winner".
* **F**: Entities connected to "Europe" by the relation "citizen".
* **G**: Entities connected to "North America" by the relation "citizen".
* **H**: Entities connected to the negation of the union of F and G. This means H contains entities connected to things that are *not* citizens of either Europe or North America.
The question asks to find the entities in the intersection of E and H (E ∩ H). This means finding the entities that are both Nobel Prize winners (E) and connected to entities that are not citizens of Europe or North America (H).
### Key Observations
* The problem involves set theory concepts like union, negation, and intersection.
* The sets are defined based on relationships between entities and specific locations or awards.
* The question requires finding the common elements between two derived sets (E and H).
### Interpretation
The problem is a theoretical exercise in set theory and relational logic. It asks us to consider the relationships between different sets of entities based on their connections to specific locations and achievements. The solution would involve identifying entities that are both Nobel Prize winners and connected to entities that are not citizens of either Europe or North America. This could potentially include individuals who have won the Nobel Prize but are citizens of other regions, or entities connected to Nobel Prize winners who are not citizens of Europe or North America.
</details>
(b) Query prompt.
<details>
<summary>extracted/2305.01157v3/images/example_decomp_query.png Details</summary>

### Visual Description
## Question Set: Entity Relations and Set Operations
### Overview
The image presents a series of questions related to entities and their relationships, as well as set operations (union, intersection, and exclusion). Each question is contained within a dashed-line box. The questions explore connections to specific entities (Nobel Prize, Europe, North America) based on defined relations (winner, citizen), and also involve set operations on abstract sets (A1, A2, A3, A4, A5).
### Components/Axes
The image consists of six distinct questions, each formatted as a query. The key components are:
1. **Entity**: The specific entity being queried (e.g., Nobel Prize, Europe, North America, A2, A3, A4, A1, A5).
2. **Relation**: The type of relationship being investigated (e.g., winner, citizen, union, intersection, belong).
3. **Set Operation**: The set operation being performed (union, intersection, exclusion).
### Detailed Analysis or ### Content Details
Here's a breakdown of each question:
1. **Question 1:** "What are the entities connected to **Nobel Prize** by relation **winner**?"
* Entity: Nobel Prize (in blue)
* Relation: winner (in bold black)
2. **Question 2:** "What are the entities connected to **Europe** by relation **citizen**?"
* Entity: Europe (in blue)
* Relation: citizen (in bold black)
3. **Question 3:** "What are the entities connected to **North America** by relation **citizen**?"
* Entity: North America (in blue)
* Relation: citizen (in bold black)
4. **Question 4:** "What are the entities in the **union** of **A2** and **A3**?"
* Set Operation: union (in purple)
* Sets: A2 (in orange) and A3 (in orange)
5. **Question 5:** "Which entities **do not belong** to the entity set **A4**?"
* Operation: exclusion (do not belong, in red)
* Set: A4 (in orange)
6. **Question 6:** "What are the entities in the **intersection** of **A1** and **A5**?"
* Set Operation: intersection (in blue)
* Sets: A1 (in orange) and A5 (in orange)
### Key Observations
* The questions are structured to explore relationships between entities and to perform set operations.
* Different colors are used to highlight specific keywords within the questions (blue, purple, red, orange, bold black).
* The questions cover both real-world entities (Nobel Prize, Europe, North America) and abstract sets (A1, A2, A3, A4, A5).
### Interpretation
The image presents a series of queries designed to test knowledge representation and reasoning capabilities. The questions require understanding of entity relationships (e.g., who has won a Nobel Prize, who is a citizen of Europe) and set theory (union, intersection, exclusion). The use of different colors may be intended to emphasize key terms or categories within the questions. The questions could be used to evaluate the performance of a knowledge graph or a question-answering system.
</details>
(c) Decomposed prompt.
<details>
<summary>extracted/2305.01157v3/images/example_answers.png Details</summary>

### Visual Description
## Set Diagram: Set Relationships
### Overview
The image depicts a series of sets (A1 through A6) containing names of individuals. The sets are presented in a vertical arrangement, with each set labeled and its elements listed within curly braces. The final set, A6, is highlighted as the "Final Answer."
### Components/Axes
* **Set Labels:** A1, A2, A3, A4, A5, A6
* **Set Elements:** Names of individuals (e.g., Rabindranath Tagore, Theodore Roosevelt, Wolfgang Pauli, Adolf von Baeyer, Jimmy Carter, Albert Einstein). The "..." indicates that the listed names are not exhaustive.
* **Final Answer Label:** Indicates that set A6 is the solution.
### Detailed Analysis or ### Content Details
* **A1:** {Rabindranath Tagore, Theodore Roosevelt, Wolfgang Pauli, ...}
* **A2:** {Wolfgang Pauli, Adolf von Baeyer, ...}
* **A3:** {Theodore Roosevelt, Jimmy Carter, Albert Einstein,...}
* **A4:** {Wolfgang Pauli, Theodore Roosevelt, ...}
* **A5:** {Rabindranath Tagore, ...}
* **A6:** {Rabindranath Tagore, ...}
### Key Observations
* Rabindranath Tagore appears in sets A1, A5, and A6.
* Theodore Roosevelt appears in sets A1, A3, and A4.
* Wolfgang Pauli appears in sets A1, A2, and A4.
* Set A6 is explicitly labeled as the "Final Answer" and contains Rabindranath Tagore.
### Interpretation
The diagram likely represents a step-by-step process of set operations or logical deductions, where each set is derived from the previous ones. The goal is to arrive at the "Final Answer," which in this case is set A6 containing Rabindranath Tagore. The "..." notation suggests that the sets may contain other elements not explicitly listed. The diagram illustrates how different individuals are grouped into sets, and how these sets are related to each other, ultimately leading to a specific solution.
</details>
(d) LLM answers.
Figure 1: Example of LARK’s query chain decomposition and logically-ordered LLM answering for effective performance. LLMs are more adept at answering simple queries, and hence, we decompose the multi-operation complex logical query (a,b) into elementary queries with single operation (c) and then use a sequential LLM-based answering method to output the final answer (d).
In LARK, we utilize the logical queries to search for relevant subgraph contexts over knowledge graphs and perform chain reasoning over these contexts using logically-decomposed LLM prompts. To achieve this, we first abstract out the logical information from both the input query and the KG. Given the invariant nature of logic logical queries follow the same set of rules and procedures irrespective of the KG context., this enables our method to focus on the logical formulation, avoid model hallucination the model ignores semantic common-sense knowledge and infers only from the KG entities for answers., and generalize over different knowledge graphs. From this abstract KG, we extract relevant subgraphs using the entities and relations present in the logical query. These subgraphs serve as context prompts for input to LLMs. In the next phase, we need to effectively handle complex reasoning queries. From previous works (Zhou et al., 2023; Khot et al., 2023), we realize that LLMs are significantly less effective on complex prompts, when compared to a sequence of simpler prompts. Thus to simplify the query, we exploit their logical nature and deterministically decompose the multi-operation query into logically-ordered elementary queries, each containing a single operation (depicted in the transition from Figure 1(b) to 1(c)). Each of these decomposed logical queries is then converted to a prompt and processed through the LLM to generate the final set of answers (shown in Figure 1(d)). The logical queries are handled sequentially, and if query $y$ depends on query $x$ , then $x$ is scheduled before $y$ . Operations are scheduled in a logically-ordered manner to enable batching different logical queries together, and answers are stored in caches for easy access.
The proposed approach effectively integrates logical reasoning over knowledge graphs with the capabilities of LLMs, and to the best of our knowledge, is the first of its kind. Unlike previous approaches that rely on constrained formulations of first-order logic (FOL) queries, our approach utilizes logically-decomposed LLM prompts to enable chain reasoning over subgraphs retrieved from knowledge graphs, allowing us to efficiently leverage the reasoning ability of LLMs. Our KG search model is inspired by retrieval-augmented techniques (Chen et al., 2022) but realizes the deterministic nature of knowledge graphs to simplify the retrieval of relevant subgraphs. Moreover, compared to other prompting methods (Wei et al., 2022; Zhou et al., 2023; Khot et al., 2023), our chain decomposition technique enhances the reasoning capabilities in knowledge graphs by leveraging the underlying chain of logical operations in complex queries, and by utilizing preceding answers amidst successive queries in a logically-ordered manner. To summarize, the primary contributions of this paper are as follows:
1. We propose, Language-guided Abstract Reasoning over Knowledge graphs (LARK), a novel model that utilizes the reasoning abilities of large language models to efficiently answer FOL queries over knowledge graphs.
1. Our model uses entities and relations in queries to find pertinent subgraph contexts within abstract knowledge graphs, and then, performs chain reasoning over these contexts using LLM prompts of decomposed logical queries.
1. Our experiments on logical reasoning across standard KG datasets demonstrate that LARK outperforms the previous state-of-the-art approaches by $35\%-84\%$ MRR on 14 FOL query types based on the operations of projection (p), intersection ( $\wedge$ ), union ( $\vee$ ), and negation ( $\neg$ ).
1. We establish the advantages of chain decomposition by showing that LARK performs $20\%-33\%$ better on decomposed logical queries when compared to complex queries on the task of logical reasoning. Additionally, our analysis of LLMs shows the significant contribution of increasing scale and better design of underlying LLMs to the performance of LARK.
2 Related Work
Our work is at the intersection of two topics, namely, logical reasoning over knowledge graphs and reasoning prompt techniques in LLMs.
Logical Reasoning over KGs: Initial approaches in this area (Bordes et al., 2013; Nickel et al., 2011; Das et al., 2017; Hamilton et al., 2018) focused on capturing the semantic information of entities and the relational operations involved in the projection between them. However, further research in the area revealed a need for new geometries to encode the spatial and hierarchical information present in the knowledge graphs. To tackle this issue, models such as Query2Box (Ren et al., 2020), HypE (Choudhary et al., 2021b), PERM (Choudhary et al., 2021a), and BetaE (Ren & Leskovec, 2020) encoded the entities and relations as boxes, hyperboloids, Gaussian distributions, and beta distributions, respectively. Additionally, approaches such as CQD (Arakelyan et al., 2021) have focused on improving the performance of complex reasoning tasks through the answer composition of simple intermediate queries. In another line of research, HamQA (Dong et al., 2023) and QA-GNN (Yasunaga et al., 2021) have developed question-answering techniques that use knowledge graph neighborhoods to enhance the overall performance. We notice that previous approaches in this area have focused on enhancing KG representations for logical reasoning. Contrary to these existing methods, our work provides a systematic framework that leverages the reasoning ability of LLMs and tailors them toward the problem of logical reasoning over knowledge graphs.
Reasoning prompts in LLMs: Recent studies have shown that LLMs can learn various NLP tasks with just context prompts (Brown et al., 2020). Furthermore, LLMs have been successfully applied to multi-step reasoning tasks by providing intermediate reasoning steps, also known as Chain-of-Thought (Wei et al., 2022; Chowdhery et al., 2022), needed to arrive at an answer. Alternatively, certain studies have composed multiple LLMs or LLMs with symbolic functions to perform multi-step reasoning (Jung et al., 2022; Creswell et al., 2023), with a pre-defined decomposition structure. More recent studies such as least-to-most (Zhou et al., 2023), successive (Dua et al., 2022) and decomposed (Khot et al., 2023) prompting strategies divide a complex prompt into sub-prompts and answer them sequentially for effective performance. While this line of work is close to our approach, they do not utilize previous answers to inform successive queries. LARK is unique due to its ability to utilize logical structure in the chain decomposition mechanism, augmentation of retrieved knowledge graph neighborhood, and multi-phase answering structure that incorporates preceding LLM answers amidst successive queries.
3 Methodology
In this section, we will describe the problem setup of logical reasoning over knowledge graphs, and describe the various components of our model.
3.1 Problem Formulation
In this work, we tackle the problem of logical reasoning over knowledge graphs (KGs) $\mathcal{G}:E× R$ that store entities ( $E$ ) and relations ( $R$ ). Without loss of generality, KGs can also be organized as a set of triplets $\langle e_{1},r,e_{2}\rangle⊂eq\mathcal{G}$ , where each relation $r∈ R$ is a Boolean function $r:E× E→\{True,False\}$ that indicates whether the relation $r$ exists between the pair of entities $(e_{1},e_{2})∈ E$ . We consider four fundamental first-order logical (FOL) operations: projection (p), intersection ( $\wedge$ ), union ( $\vee$ ), and negation ( $\neg$ ) to query the KG. These operations are defined as follows:
$$
\displaystyle q_{p}[Q_{p}] \displaystyle\triangleq?V_{p}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}\exists~{%
}a_{1} \displaystyle q_{\wedge}[Q_{\wedge}] \displaystyle\triangleq?V_{\wedge}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}%
\exists~{}a_{1}\wedge a_{2}\wedge...\wedge a_{i} \displaystyle q_{\vee}[Q_{\vee}] \displaystyle\triangleq?V_{\vee}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}%
\exists~{}a_{1}\vee a_{2}\vee...\vee a_{i} \displaystyle q_{\neg}[Q_{\neg}] \displaystyle\triangleq?V_{\neg}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}%
\exists~{}\neg a_{1} \displaystyle\text{where }Q_{p},Q_{\neg} \displaystyle=(e_{1},r_{1});~{}Q_{\wedge},Q_{\vee}=\{(e_{1},r_{1}),(e_{2},r_{2%
}),...,(e_{i},r_{i})\};\text{~{}~{}and~{} }a_{i}=r_{i}(e_{i},v_{i}) \tag{1}
$$
where $q_{p},q_{\wedge},q_{\vee}$ , and $q_{\neg}$ are projection, intersection, union, and negation queries, respectively; and $V_{p},V_{\wedge},V_{\vee}$ and $V_{\neg}$ are the corresponding results of those queries (Arakelyan et al., 2021; Choudhary et al., 2021a). $a_{i}$ is a Boolean indicator which will be 1 if $e_{i}$ is connected to $v_{i}$ by relation $r_{i}$ , 0 otherwise. The goal of logical reasoning is to formulate the operations such that for a given query $q_{\tau}$ of query type $\tau$ with inputs $Q_{\tau}$ , we are able to efficiently retrieve $V_{\tau}$ from entity set $E$ , e.g., for a projection query $q_{p}[\text{(Nobel Prize, winners)}]$ , we want to retrieve $V_{p}=\{\text{Nobel Prize winners}\}⊂eq E$ .
In conventional methods for logical reasoning, the query operations were typically expressed through a geometric function. For example, the intersection of queries was represented as an intersection of box representations in Query2Box (Ren et al., 2020). However, in our proposed approach, LARK, we leverage the advanced reasoning capabilities of Language Models (LLMs) and prioritize efficient decomposition of logical chains within the query to enhance performance. This novel strategy seeks to overcome the limitations of traditional methods by harnessing the power of LLMs in reasoning over KGs.
3.2 Neighborhood Retrieval and Logical Chain Decomposition
The foundation of LARK’s reasoning capability is built on large language models. Nevertheless, the limited input length of LLMs restricts their ability to process the entirety of a knowledge graph. Furthermore, while the set of entities and relations within a knowledge graph is unique, the reasoning behind logical operations remains universal. Therefore, we specifically tailor the LLM prompts to account for the above distinctive characteristics of logical reasoning over knowledge graphs. To address this need, we adopt a two-step process:
1. Query Abstraction: In order to make the process of logical reasoning over knowledge graphs more generalizable to different datasets, we propose to replace all the entities and relations in the knowledge graph and queries with a unique ID. This approach offers three significant advantages. Firstly, it reduces the number of tokens in the query, leading to improved LLM efficiency. Secondly, it allows us to solely utilize the reasoning ability of the language model, without relying on any external common sense knowledge of the underlying LLM. By avoiding the use of common sense knowledge, our approach mitigates the potential for model hallucination (which may lead to the generation of answers that are not supported by the KG). Finally, it removes any KG-specific information, thereby ensuring that the process remains generalizable to different datasets. While this may intuitively seem to result in a loss of information, our empirical findings, presented in Section 4.4, indicate that the impact on the overall performance is negligible.
1. Neighborhood Retrieval: In order to effectively answer logical queries, it is not necessary for the LLM to have access to the entire knowledge graph. Instead, the relevant neighborhoods containing the answers can be identified. Previous approaches (Guu et al., 2020; Chen et al., 2022) have focused on semantic retrieval for web documents. However, we note that logical queries are deterministic in nature, and thus we perform a $k$ -level depth-first traversal where $k$ is determined by the query type, e.g., for 3-level projection ( $3p$ ) queries, $k=3$ . over the entities and relations present in the query. Let $E^{1}_{\tau}$ and $R^{1}_{\tau}$ denote the set of entities and relations in query $Q_{\tau}$ for a query type $\tau$ , respectively. Then, the $k$ -level neighborhood of query $q_{\tau}$ is defined by $\mathcal{N}_{k}(q_{\tau}[Q_{\tau}])$ as:
$$
\displaystyle\mathcal{N}_{1}(q_{\tau}[Q_{\tau}]) \displaystyle=\left\{(h,r,t):\left(h\in E^{1}_{\tau}\right),\left(r\in R^{1}_{%
\tau}\right),\left(t\in E^{1}_{\tau}\right)\right\} \displaystyle E^{k}_{\tau} \displaystyle=\{h,t:(h,r,t)\in\mathcal{N}_{k-1}(q_{\tau}[Q_{\tau}]\},\quad R^{%
k}_{\tau}=\{r:(h,r,t)\in\mathcal{N}_{k-1}(q_{\tau}[Q_{\tau}]\} \displaystyle\mathcal{N}_{k}(q_{\tau}[Q_{\tau}]) \displaystyle=\left\{(h,r,t):\left(h\in E^{k}_{\tau}\right),\left(r\in R^{k}_{%
\tau}\right),\left(t\in E^{k}_{\tau}\right)\right\} \tag{5}
$$
We have taken steps to make our approach more generalizable and efficient by abstracting the query and limiting input context for LLMs. However, the complexity of a query still remains a concern. The complexity of a query type $\tau$ , denoted by $\mathcal{O}(q_{\tau})$ , is determined by the number of entities and relations it involves, i.e., $\mathcal{O}(q_{\tau})\propto|E_{\tau}|+|R_{\tau}|$ . In other words, the size of the query in terms of its constituent elements is a key factor in determining its computational complexity. This observation is particularly relevant in the context of LLMs, as previous studies have shown that their performance tends to decrease as the complexity of the queries they handle increases (Khot et al., 2023). To address this, we propose a logical query chain decomposition mechanism in LARK which reduces a complex multi-operation query to multiple single-operation queries. Due to the exhaustive set of operations, we apply the following strategy for decomposing the various query types:
- Reduce a $k$ -level projection query to $k$ one-level projection queries, e.g., a $3p$ query with one entity and three relations $e_{1}\xrightarrow{r_{1}}\xrightarrow{r_{2}}\xrightarrow{r_{3}}A$ is decomposed to $e_{1}\xrightarrow{r_{1}}A_{1},A_{1}\xrightarrow{r_{2}}A_{2},A_{2}\xrightarrow{%
r_{3}}A$ .
- Reduce a $k$ -intersection query to $k$ projection queries and an intersection query, e.g., a $3i$ query with intersection of two projection queries $(e_{1}\xrightarrow{r_{1}})\wedge(e_{2}\xrightarrow{r_{2}})\wedge(e_{3}%
\xrightarrow{r_{3}})=A$ is decomposed to $e_{1}\xrightarrow{r_{1}}A_{1},e_{2}\xrightarrow{r_{2}}A_{2},e_{3}\xrightarrow{%
r_{3}}A_{2},A_{1}\wedge A_{2}\wedge A_{3}=A$ . Similarly, reduce a $k$ -union query to $k$ projection queries and a union query.
The complete decomposition of the exhaustive set of query types used in previous work (Ren & Leskovec, 2020) and our empirical studies can be found in Appendix A.
<details>
<summary>extracted/2305.01157v3/images/model.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph Query Processing
### Overview
The image is a flowchart illustrating a process for querying a knowledge graph. It starts with a logical query, decomposes it into sub-queries, retrieves relevant subgraphs, and uses a prompt template to generate context prompts. These prompts are then used by a Large Language Model (LLM) to generate logically-ordered answers, which are combined to produce a final answer.
### Components/Axes
* **Top-Left**: "Query Type" - A table showing different query types represented as graphs.
* Rows: p, 2i, ip, inp
* Rows: 2p, 3i, pi, pin
* Rows: 3p, 2u, up, pni
* **Top-Center**: "Knowledge Graph" - A purple cylinder representing the knowledge graph.
* **Top-Center**: "Neighborhood Retrieval" - A process of retrieving relevant subgraphs from the knowledge graph.
* **Top-Center**: "Relevant Subgraphs" - A collection of subgraphs.
* **Top-Right**: "Prompt Template" - A diamond shape representing the prompt template.
* **Center-Left**: "Logical Query" - An orange box containing the question "Name the Asian Nobel Prize Winners?". It includes a graph with nodes labeled "Nobel Prize" (blue), "Nobel Prize Winners" (green), "citizen" (blue), and "Asians" (green), connected by edges labeled "winner".
* **Center-Left**: "Query Abstraction" - A process of abstracting the logical query into a graph representation.
* **Center-Left**: "Entities and Relations" - A blue box.
* **Center**: "Logical Chain Decomposition" - A process of decomposing the query into logical chains.
* **Center**: "Decomposed Question Prompts" - A yellow box containing questions derived from the logical chains.
* **Center-Right**: "Context Prompt" - A yellow box containing a set of (h,r,t) triplets.
* **Center-Right**: "LLM" - A green box representing the Large Language Model.
* **Right**: "Logically-ordered Answers" - A green box containing logically-ordered answers.
* **Right**: "Final Answer" - An orange box containing the final answer: "Malala Yousafzai, Rabindranath Tagore, ...".
### Detailed Analysis
1. **Query Type**:
* p: Two nodes connected by an edge.
* 2p: Three nodes, two connected to the central node.
* 3p: Four nodes, three connected to the central node.
* 2i: Two nodes connected to a central node.
* 3i: Three nodes connected to a central node.
* 2u: Two nodes connected to a central node.
* ip: Two nodes connected to a central node.
* pi: Two nodes connected to a central node.
* up: Two nodes connected to a central node.
* inp: Three nodes connected in a chain.
* pin: Three nodes connected in a chain.
* pni: Three nodes connected in a chain.
2. **Logical Query**:
* The query is "Name the Asian Nobel Prize Winners?".
* The query is represented as a graph with nodes "Nobel Prize", "Nobel Prize Winners", "citizen", and "Asians".
* The nodes are connected by edges labeled "winner".
3. **Logical Chain Decomposition**:
* The query is decomposed into logical chains involving entities e1, e2 and relations r1, r2.
* The chains are combined using a logical AND operation (Λ).
4. **Decomposed Question Prompts**:
* "What are the entities connected to e1 by relation r1?"
* "What are the entities connected to e2 by relation r1?"
* "What are the entities in the intersection of P1 and P2?"
5. **Context Prompt**:
* "Given the following (h,r,t) triplets: (e1,r1,t1), (e2,r2,t1), (e2,r2,t2), (e1,r1,t2), (e1,r1,t3), (e2,r2,t4), (e2,r2,t5), (e1,r1,t6), ..."
6. **Logically-ordered Answers**:
* A1 = P1 = {t1, t2, t3, t6, ...}
* A2 = P2 = {t1, t2, t4, t5, ...}
* A3 = P3 = {t1, t2, ...}
7. **Final Answer**:
* "Malala Yousafzai, Rabindranath Tagore, ..."
### Key Observations
* The diagram illustrates a multi-step process for answering complex questions using a knowledge graph and a large language model.
* The process involves decomposing the query into sub-queries, retrieving relevant information from the knowledge graph, and using a prompt template to generate context prompts for the LLM.
* The LLM generates logically-ordered answers, which are combined to produce the final answer.
### Interpretation
The diagram presents a sophisticated approach to question answering that leverages the strengths of both knowledge graphs and large language models. By decomposing complex queries into smaller, more manageable sub-queries, the system can effectively retrieve relevant information from the knowledge graph. The use of a prompt template ensures that the LLM receives the necessary context to generate accurate and logically-sound answers. The final answer is then constructed by combining the individual answers to the sub-queries. This approach is particularly useful for answering questions that require reasoning and inference over structured knowledge.
</details>
Figure 2: An overview of the LARK model. The model takes the logical query and infers the query type from it. The query abstraction function maps the entities and relations to abstract IDs, and the neighborhood retrieval mechanism collects the relevant subgraphs from the overall knowledge graph. The chains of the abstracted complex query are then logically decomposed to simpler single-operation queries. The retrieved neighborhood and decomposed queries are further converted into LLM prompts using a template and then processed in the LLM to get the final set of answers for evaluation.
3.3 Chain Reasoning Prompts
In the previous section, we outlined our approach to limit the neighborhood and decompose complex queries into chains of simple queries. Leveraging these, we can now use the reasoning capability of LLMs to obtain the final set of answers for the query, as shown in Figure 2. To achieve this, we employ a prompt template that converts the neighborhood into a context prompt and the decomposed queries into question prompts. It is worth noting that certain queries in the decomposition depend on the responses of preceding queries, such as intersection relying on the preceding projection queries. Additionally, unlike previous prompting methods such as chain-of-thought (Wei et al., 2022) and decomposition (Khot et al., 2023) prompting, the answers need to be integrated at a certain position in the prompt. To address this issue, we maintain a placeholder in dependent queries and a temporary cache of preceding answers that can replace the placeholders in real-time. This also has the added benefit of maintaining the parallelizability of queries, as we can run batches of decomposed queries in phases instead of sequentially running each decomposed query. The specific prompt templates of the complex and decomposed logical queries for different query types are provided in Appendix B.
3.4 Implementation Details
We implemented LARK in Pytorch (Paszke et al., 2019) on eight Nvidia A100 GPUs with 40 GB VRAM. In the case of LLMs, we chose the Llama2 model (Touvron et al., 2023) due to its public availability in the Huggingface library (Wolf et al., 2020) . For efficient inference over the large-scale models, we relied on the mixed-precision version of LLMs and the Deepspeed library (Rasley et al., 2020) with Zero stage 3 optimization. The algorithm of our model is provided in Appendix D and implementation code for all our experiments with exact configuration files and datasets for reproducibility are publicly available https://github.com/Akirato/LLM-KG-Reasoning. In our experiments, the highest complexity of a query required a 3-hop neighborhood around the entities and relations. Hence, we set the depth limit to 3 (i.e., $k=3$ ). Additionally, to further make our process completely compatible with different datasets, we added a limit of $n$ tokens on the input which is dependent on the LLM model (for Llama2, $n$ =4096). In practice, this implies that we stop the depth-first traversal when the context becomes longer than $n$ .
4 Experimental Results
This sections describes our experiments that aim to answer the following research questions (RQs):
- Does LARK outperform the state-of-the-art baselines on the task of logical reasoning over standard knowledge graph benchmarks?
- How does our combination of chain decomposition query and logically-ordered answer mechanism perform in comparison with the standard prompting techniques?
- How does the scale and design of LARK’s underlying LLM model affect its performance?
- How would our model perform with support for increased token size?
- Does query abstraction affect the reasoning performance of our model?
4.1 Datasets and Baselines
We select the following standard benchmark datasets to investigate the performance of our model against state-of-the-art models on the task of logical reasoning over knowledge graphs:
- FB15k (Bollacker et al., 2008) is based on Freebase, a large collaborative knowledge graph project that was created by Google. FB15k contains about 15,000 entities, 1,345 relations, and 592,213 triplets (statements that assert a fact about an entity).
- FB15k-237 (Toutanova et al., 2015) is a subset of FB15k, containing 14,541 entities, 237 relations, and 310,116 triplets. The relations in FB15k-237 are a subset of the relations in FB15k, and was created to address some of the limitations of FB15k, such as the presence of many irrelevant or ambiguous relations, and to provide a more challenging benchmark for knowledge graph completion models.
- NELL995 (Carlson et al., 2010) was created using the Never-Ending Language Learning (NELL) system, which is a machine learning system that automatically extracts knowledge from the web by reading text and inferring new facts. NELL995 contains 9,959 entities, 200 relations, and 114,934 triplets. The relations in NELL995 cover a wide range of domains, including geography, sports, and politics.
Our criteria for selecting the above datasets was their ubiquity in previous works on this research problem. Further details on their token size is provided in Appendix E. For the baselines, we chose the following methods:
- GQE (Hamilton et al., 2018) encodes a query as a single vector and represents entities and relations in a low-dimensional space. It uses translation and deep set operators, which are modeled as projection and intersection operators, respectively.
- Query2Box (Q2B) (Ren et al., 2020) uses a box embedding model which is a generalization of the traditional vector embedding model and can capture richer semantics.
- BetaE (Ren & Leskovec, 2020) uses a novel beta distribution to model the uncertainty in the representation of entities and relations. BetaE can capture both the point estimate and the uncertainty of the embeddings, which leads to more accurate predictions in knowledge graph completion tasks.
- HQE (Choudhary et al., 2021b) uses the hyperbolic query embedding mechanism to model the complex queries in knowledge graph completion tasks.
- HypE (Choudhary et al., 2021b) uses the hyperboloid model to represent entities and relations in a knowledge graph that simultaneously captures their semantic, spatial, and hierarchical features.
- CQD (Arakelyan et al., 2021) decomposes complex queries into simpler sub-queries and applies a query-specific attention mechanism to the sub-queries.
4.2 RQ1. Efficacy on Logical Reasoning
To study the efficacy of our model on the task of logical reasoning, we compare it against the previous baselines on the following standard logical query constructs:
1. Multi-hop Projection traverses multiple relations from a head entity in a knowledge graph to answer complex queries by projecting the query onto the target entities. In our experiments, we consider $1p,2p$ , and $3p$ queries that denote 1-relation, 2-relation, and 3-relation hop from the head entity, respectively.
1. Geometric Operations apply the operations of intersection ( $\wedge$ ) and union ( $\vee$ ) to answer the query. Our experiments use $2i$ and $3i$ queries that represent the intersection over 2 and 3 entities, respectively. Also, we study $2u$ queries that perform union over 2 entities.
1. Compound Operations integrate multiple operations such as intersection, union, and projection to handle complex queries over a knowledge graph.
1. Negation Operations negate the query by finding entities that do not satisfy the given logic. In our experiments, we examine $2in,3in,inp,$ and $pin$ queries that negate $2i,3i,ip,$ and $pi$ queries, respectively. We also analyze $pni$ (an additional variant of the $pi$ query), where the negation is over both entities in the intersection. It should be noted that BetaE is the only method in the existing literature that supports negation, and hence, we only compare against it in our experiments.
We present the results of our experimental study, which compares the Mean Reciprocal Rank (MRR) score of the retrieved candidate entities using different query constructions. MRR is calculated as the average of the reciprocal ranks of the candidate entities More metrics such as HITS@K=1,3,10 are reported in Appendix C.. In order to ensure a fair comparison, We selected these query constructions which were used in most of the previous works in this domain (Ren & Leskovec, 2020). An illustration of these query types is provided in Appendix A for better understanding. Our experiments show that LARK outperforms previous state-of-the-art baselines by $35\%-84\%$ on an average across different query types, as reported in Table 1. We observe that the performance improvement is higher for simpler queries, where $1p>2p>3p$ and $2i>3i$ . This suggests that LLMs are better at capturing breadth across relations but may not be as effective at capturing depth over multiple relations. Moreover, our evaluation also encompasses testing against challenging negation queries, for which BetaE (Ren & Leskovec, 2020) remains to be the only existing approach. Even in this complex scenario, our findings, as illustrated in Table 2, indicate that LARK significantly outperforms the baselines by $140\%$ . This affirms the superior reasoning capabilities of our model in tackling complex query scenarios. Another point of note is that certain baselines such as CQD are able to outperform LARK in the FB15k dataset for certain query types such as $1p,3i$ , and $ip$ . The reason for this is that FB15k suffers from a data leakage from training to validation and testing sets (Toutanova et al., 2015). This unfairly benefits the training-based baselines over the inference-only LARK model.
Table 1: Performance comparison between LARK and the baseline in terms of their efficacy of logical reasoning using MRR scores. The rows present various models and the columns correspond to different query structures of multi-hop projections, geometric operations, and compound operations. The best results for each query type in every dataset is highlighted in bold font.
| FB15k | GQE Q2B BetaE | 54.6 68.0 65.1 | 15.3 21.0 25.7 | 10.8 14.2 24.7 | 39.7 55.1 55.8 | 51.4 66.5 66.5 | 27.6 39.4 43.9 | 19.1 26.1 28.1 | 22.1 35.1 40.1 | 11.6 16.7 25.2 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| HQE | 54.3 | 33.9 | 23.3 | 38.4 | 50.6 | 12.5 | 24.9 | 35.0 | 25.9 | |
| HypE | 67.3 | 43.9 | 33.0 | 49.5 | 61.7 | 18.9 | 34.7 | 47.0 | 37.4 | |
| CQD | 79.4 | 39.6 | 27.0 | 74.0 | 78.2 | 70.0 | 43.3 | 48.4 | 17.5 | |
| LARK(complex) | 73.6 | 46.5 | 32.0 | 66.9 | 61.8 | 24.8 | 47.2 | 47.7 | 37.5 | |
| LARK(ours) | 73.6 | 49.3 | 35.1 | 67.8 | 62.6 | 29.3 | 54.5 | 51.9 | 37.7 | |
| FB15k-237 | GQE | 35.0 | 7.2 | 5.3 | 23.3 | 34.6 | 16.5 | 10.7 | 8.2 | 5.7 |
| Q2B | 40.6 | 9.4 | 6.8 | 29.5 | 42.3 | 21.2 | 12.6 | 11.3 | 7.6 | |
| BetaE | 39.0 | 10.9 | 10.0 | 28.8 | 42.5 | 22.4 | 12.6 | 12.4 | 9.7 | |
| HQE | 37.6 | 20.9 | 16.9 | 25.3 | 35.2 | 17.3 | 8.2 | 15.6 | 17.9 | |
| HypE | 49.0 | 34.3 | 23.7 | 33.9 | 44 | 18.6 | 30.5 | 41.0 | 26.0 | |
| CQD | 44.5 | 11.3 | 8.1 | 32.0 | 42.7 | 25.3 | 15.3 | 13.4 | 4.8 | |
| LARK(complex) | 70.0 | 34.0 | 21.5 | 43.4 | 42.2 | 18.7 | 38.4 | 49.2 | 25.1 | |
| LARK(ours) | 70.0 | 36.9 | 24.5 | 44.3 | 43.1 | 23.2 | 45.6 | 56.6 | 25.4 | |
| NELL995 | GQE | 32.8 | 11.9 | 9.6 | 27.5 | 35.2 | 18.4 | 14.4 | 8.5 | 8.8 |
| Q2B | 42.2 | 14.0 | 11.2 | 33.3 | 44.5 | 22.4 | 16.8 | 11.3 | 10.3 | |
| BetaE | 53.0 | 13.0 | 11.4 | 37.6 | 47.5 | 24.1 | 14.3 | 12.2 | 8.5 | |
| HQE | 35.5 | 20.9 | 18.9 | 23.2 | 36.3 | 8.8 | 13.7 | 21.3 | 15.5 | |
| HypE | 46.0 | 30.6 | 27.9 | 33.6 | 48.6 | 31.8 | 13.5 | 20.7 | 26.4 | |
| CQD | 50.7 | 18.4 | 13.8 | 39.8 | 49.0 | 29.0 | 22.0 | 16.3 | 9.9 | |
| LARK(complex) | 83.2 | 39.8 | 27.6 | 49.3 | 48.0 | 18.7 | 19.6 | 8.3 | 36.8 | |
| LARK(ours) | 83.2 | 42.3 | 31.0 | 49.9 | 48.7 | 23.1 | 23.0 | 20.1 | 37.2 | |
Table 2: Performance comparison between LARK and the baseline for negation query types using MRR scores. The best results for each query type in every dataset is highlighted in bold font. Our model’s performance is significantly higher on most negation queries. However, the performance is limited in 3in and pni queries due to their high number of tokens (shown in Appendix E).
| FB15k | BetaE LARK(complex) LARK(ours) | 14.3 16.5 17.5 | 14.7 6.2 7.0 | 11.5 32.5 34.7 | 6.5 22.8 26.7 | 12.4 10.5 11.1 |
| --- | --- | --- | --- | --- | --- | --- |
| FB15k-237 | BetaE | 5.1 | 7.9 | 7.4 | 3.6 | 3.4 |
| LARK(complex) | 6.1 | 3.4 | 21.6 | 12.8 | 2.9 | |
| LARK(ours) | 7.0 | 4.1 | 23.9 | 16.8 | 3.5 | |
| NELL995 | BetaE | 5.1 | 7.8 | 10.0 | 3.1 | 3.5 |
| LARK(complex) | 8.9 | 5.3 | 23.0 | 10.4 | 6.3 | |
| LARK(ours) | 10.4 | 6.6 | 25.4 | 13.6 | 7.6 | |
4.3 RQ2. Advantages of Chain Decomposition
The aim of this experiment is to investigate the advantages of using chain decomposed queries over standard complex queries. We employ the same experimental setup described in Section 4.2. Our results, in Tables 1 and 2, demonstrate that utilizing chain decomposition contributes to a significant improvement of $20\%-33\%$ in our model’s performance. This improvement is a clear indication of the LLMs’ ability to capture a broad range of relations and effectively utilize this capability for enhancing the performance on complex queries. This study highlights the potential of using chain decomposition to overcome the limitations of complex queries and improve the efficiency of logical reasoning tasks. This finding is a significant contribution to the field of natural language processing and has implications for various other applications such as question-answering systems and knowledge graph completion. Overall, our results suggest that chain-decomposed queries could be a promising approach for improving the performance of LLMs on complex logical reasoning tasks.
4.4 RQ3. Analysis of LLM scale
This experiment analyzes the impact of the size of the underlying LLMs and query abstraction on the overall LARK model performance. To examine the effect of LLM size, we compared two variants of the Llama2 model which have 7 billion and 13 billion parameters. Our evaluation results, presented in Table 3, show that the performance of the LARK model improves by $123\%$ from Llama2-7B to Llama2-13B. This indicates that increasing the number of LLM parameters can enhance the performance of LARK model.
Table 3: MRR scores of LARK on FB15k-237 dataset with underlying LLMs of different sizes. The best results for each query type is highlighted in bold font.
| Llama2 | 7B | 73.1 | 33.2 | 20.6 | 10.6 | 25.2 | 25.9 | 17.2 | 20.8 | 24.3 | 4 | 1.8 | 14.2 | 7.4 | 1.9 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 13B | 73.6 | 49.3 | 35.1 | 67.8 | 62.6 | 29.3 | 54.5 | 51.9 | 37.7 | 7.0 | 4.1 | 23.9 | 16.8 | 3.5 | |
4.5 RQ4. Study on Increased Token Limit of LLMs
From the dataset details provided in Appendix E, we observe that the token size of different query types shows considerable fluctuation from $58$ to over $100,000$ . Unfortunately, the token limit of LLama2, considered as the base in our experiments, is 4096. This limit is insufficient to demonstrate the full potential performance of LARK on our tasks. To address this limitation, we consider the availability of models with higher token limits, such as GPT-3.5 (OpenAI, 2023). However, we acknowledge that these models are expensive to run and thus, we could not conduct a thorough analysis on the entire dataset. Nevertheless, to gain insight into LARK’s potential with increased token size, we randomly sampled 1000 queries per query type from each dataset with token length over 4096 and less than 4096 and compared our model on these queries with GPT-3.5 and Llama2 as the base. The evaluation results, which are displayed in Table 4, demonstrate that transitioning from Llama2 to GPT-3.5 can lead to a significant performance improvement of 29%-40% for the LARK model which suggests that increasing the token limit of LLMs may have significant potential of further performance enhancement.
Table 4: MRR scores of LARK with Llama2 and GPT LLMs as the underlying base models. The best results for each query type in every dataset is highlighted in bold font.
| | FB15k | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLM | 1p | 2p | 3p | 2i | 3i | ip | pi | 2u | up | 2in | 3in | inp | pin | pni |
| Llama2-7B | 23.4 | 21.5 | 22.6 | 3.4 | 3 | 26.1 | 18.4 | 14.8 | 3.9 | 9.5 | 4.7 | 21.7 | 26.4 | 5.8 |
| Llama2-13B | 23.8 | 22.8 | 24.2 | 3.5 | 3 | 23.3 | 30.8 | 30.7 | 3.9 | 12.4 | 6.6 | 28.4 | 51.4 | 7.7 |
| GPT-3.5 | 36.1 | 34.6 | 36.8 | 17.0 | 14.4 | 35.4 | 46.7 | 39.3 | 19.5 | 18.8 | 10.0 | 43.1 | 56.7 | 11.6 |
| FB15k-237 | | | | | | | | | | | | | | |
| LLM | 1p | 2p | 3p | 2i | 3i | ip | pi | 2u | up | 2in | 3in | inp | pin | pni |
| Llama2-7B | 23.1 | 27.4 | 31.5 | 5 | 4.1 | 26.6 | 20.9 | 15.3 | 5.6 | 26.6 | 8.8 | 33.7 | 31 | 21.1 |
| Llama2-13B | 23.5 | 29.2 | 33.8 | 5 | 4.1 | 23.7 | 35 | 31.7 | 5.6 | 34.7 | 12.3 | 44 | 60.4 | 28 |
| GPT-3.5 | 35.7 | 44.2 | 51.2 | 24.8 | 20.2 | 36.0 | 53.1 | 40.6 | 28.1 | 52.5 | 18.7 | 66.8 | 66.6 | 42.4 |
| NELL995 | | | | | | | | | | | | | | |
| LLM | 1p | 2p | 3p | 2i | 3i | ip | pi | 2u | up | 2in | 3in | inp | pin | pni |
| Llama2-7B | 28 | 24.4 | 27.6 | 3.7 | 3.2 | 24 | 8.4 | 14.5 | 5.7 | 14 | 7.7 | 23.1 | 21.3 | 13.4 |
| Llama2-13B | 28.4 | 26 | 29.5 | 3.7 | 3.2 | 21.5 | 14.1 | 25.4 | 5.7 | 18.3 | 10.8 | 30.1 | 30.2 | 17.7 |
| GPT-3.5 | 43.1 | 39.4 | 44.8 | 18.3 | 15.5 | 32.6 | 21.4 | 38.5 | 28.3 | 27.7 | 16.4 | 45.7 | 45.9 | 26.8 |
4.6 RQ5. Effects of Query Abstraction
<details>
<summary>extracted/2305.01157v3/images/query_abstraction.png Details</summary>

### Visual Description
## Horizontal Bar Chart: MRR Score Comparison on FB15k-237 Dataset
### Overview
The image is a horizontal bar chart comparing the MRR (Mean Reciprocal Rank) scores of two models, "LARK (semantic)" and "LARK (ours)", across five different categories: Negation, Compound Operation, Geometric Operation, Multi-hop Projection, and Simple Projection. The x-axis represents the MRR score on the FB15k-237 dataset, ranging from 0 to 60. The y-axis represents the categories.
### Components/Axes
* **X-axis:** MRR Score on FB15k-237 Dataset, with scale markers at 0, 20, 40, and 60.
* **Y-axis:** Categories: Negation, Compound Operation, Geometric Operation, Multi-hop Projection, and Simple Projection.
* **Legend:** Located at the bottom of the chart.
* Orange: LARK (semantic)
* Blue: LARK (ours)
### Detailed Analysis
Here's a breakdown of the MRR scores for each category and model:
* **Negation:**
* LARK (semantic) (Orange): Approximately 10
* LARK (ours) (Blue): Approximately 8
* **Compound Operation:**
* LARK (semantic) (Orange): Approximately 38
* LARK (ours) (Blue): Approximately 35
* **Geometric Operation:**
* LARK (semantic) (Orange): Approximately 60
* LARK (ours) (Blue): Approximately 58
* **Multi-hop Projection:**
* LARK (semantic) (Orange): Approximately 38
* LARK (ours) (Blue): Approximately 35
* **Simple Projection:**
* LARK (semantic) (Orange): Approximately 68
* LARK (ours) (Blue): Approximately 65
### Key Observations
* For all categories, "LARK (semantic)" (orange) has a slightly higher MRR score than "LARK (ours)" (blue).
* The "Simple Projection" category has the highest MRR scores for both models, significantly higher than the other categories.
* The "Negation" category has the lowest MRR scores for both models.
* The difference in MRR scores between the two models is relatively consistent across all categories, with "LARK (semantic)" performing slightly better.
### Interpretation
The chart suggests that "LARK (semantic)" generally outperforms "LARK (ours)" on the FB15k-237 dataset across the tested categories. The "Simple Projection" task appears to be the easiest for both models, while "Negation" is the most challenging. The consistent difference in performance between the two models suggests a systematic advantage for "LARK (semantic)" in this context. The data highlights the relative difficulty of different knowledge graph reasoning tasks, with projection operations being easier than negation or compound operations.
</details>
Figure 3: Effects of Query Abstraction.
Regarding the analysis of query abstraction, we considered a variant of LARK called ‘LARK (semantic)’, which retains semantic information in KG entities and relations. As shown in Figure 3, we observe that semantic information provides a minor performance enhancement of $0.01\%$ for simple projection queries. However, in more complex queries, it results in a performance degradation of $0.7\%-1.4\%$ . The primary cause of this degradation is that the inclusion of semantic information exceeds the LLMs’ token limit, leading to a loss of neighborhood information. Hence, we assert that query abstraction is not only a valuable technique for mitigating model hallucination and achieving generalization across different KG datasets but can also enhance performance by reducing token size.
5 Concluding Discussion
In this paper, we presented LARK, the first approach to integrate logical reasoning over knowledge graphs with the capabilities of LLMs. Our approach utilizes logically-decomposed LLM prompts to enable chain reasoning over subgraphs retrieved from knowledge graphs, allowing us to efficiently leverage the reasoning ability of LLMs. Through our experiments on logical reasoning across standard KG datasets, we demonstrated that LARK outperforms previous state-of-the-art approaches by a significant margin on 14 different FOL query types. Finally, our work also showed that the performance of LARK improves with increasing scale and better design of the underlying LLMs. We demonstrated that LLMs that can handle larger input token lengths can lead to significant performance improvements. Overall, our approach presents a promising direction for integrating LLMs with logical reasoning over knowledge graphs.
The proposed approach of using LLMs for complex logical reasoning over KGs is expected to pave a new way for improved reasoning over large, noisy, and incomplete real-world KGs. This can potentially have a significant impact on various applications such as natural language understanding, question answering systems, intelligent information retrieval systems, etc. For example, in healthcare, KGs can be used to represent patient data, medical knowledge, and clinical research, and logical reasoning over these KGs can enable better diagnosis, treatment, and drug discovery. However, there can also be some ethical considerations that can be taken into account. As with most of the AI-based technologies, there is a potential risk of inducing bias into the model, which can lead to unfair decisions and actions. Bias can be introduced in the KGs themselves, as they are often created semi-automatically from biased sources, and can be amplified by the logical reasoning process. Moreover, the large amount of data used to train LLMs can also introduce bias, as it may reflect societal prejudices and stereotypes. Therefore, it is essential to carefully monitor and evaluate the KGs and LLMs used in this approach to ensure fairness and avoid discrimination. The performance of this method is also dependent on the quality and completeness of the KGs used, and the limited token size of current LLMs. But, we also observe that the current trend of increasing LLM token limits will soon resolve some of these limitations.
References
- Arakelyan et al. (2021) Erik Arakelyan, Daniel Daza, Pasquale Minervini, and Michael Cochez. Complex query answering with neural link predictors. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Mos9F9kDwkz.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 1247–1250, New York, NY, USA, 2008. Association for Computing Machinery. URL https://doi.org/10.1145/1376616.1376746.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka, and Tom M. Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1306–1313. AAAI Press, 2010.
- Chen et al. (2022) Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Q8GnGqT-GTJ.
- Choudhary et al. (2021a) Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan Reddy. Probabilistic entity representation model for reasoning over knowledge graphs. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 23440–23451. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/c4d2ce3f3ebb5393a77c33c0cd95dc93-Paper.pdf.
- Choudhary et al. (2021b) Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan K. Reddy. Self-supervised hyperboloid representations from logical queries over knowledge graphs. In Proceedings of the Web Conference 2021, WWW ’21, pp. 1373–1384, New York, NY, USA, 2021b. Association for Computing Machinery. URL https://doi.org/10.1145/3442381.3449974.
- Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3Pf3Wg6o-A4.
- Das et al. (2017) Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 132–141, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-1013.
- Dong et al. (2023) Junnan Dong, Qinggang Zhang, Xiao Huang, Keyu Duan, Qiaoyu Tan, and Zhimeng Jiang. Hierarchy-aware multi-hop question answering over knowledge graphs. In Proceedings of the Web Conference 2023, WWW ’23, New York, NY, USA, 2023. Association for Computing Machinery. URL https://doi.org/10.1145/3543507.3583376.
- Dua et al. (2022) Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing complex questions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1251–1265, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.81.
- Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Hamilton et al. (2018) Will Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding logical queries on knowledge graphs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ef50c335cca9f340bde656363ebd02fd-Paper.pdf.
- Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1266–1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.82.
- Khot et al. (2023) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_nGgzQjzaRy.
- Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 809–816, Madison, WI, USA, 2011. Omnipress.
- OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv, 2023.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
- Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pp. 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. URL https://doi.org/10.1145/3394486.3406703.
- Ren & Leskovec (2020) Hongyu Ren and Jure Leskovec. Beta embeddings for multi-hop logical reasoning in knowledge graphs. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Ren et al. (2020) Hongyu Ren, Weihua Hu, and Jure Leskovec. Query2box: Reasoning over knowledge graphs in vector space using box embeddings. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJgr4kSFDS.
- Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 697–706, New York, NY, USA, 2007. Association for Computing Machinery. URL https://doi.org/10.1145/1242572.1242667.
- Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1499–1509, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL https://aclanthology.org/D15-1174.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.emnlp-demos.6.
- Yasunaga et al. (2021) Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 535–546, Online, June 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.naacl-main.45.
- Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.
Appendix
Appendix A Query Decomposition of Different Query Types
Figure 4 provides the query decomposition of different query types considered in our empirical study as well as previous literature in the area.
<details>
<summary>extracted/2305.01157v3/images/query_decomposition.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph Reasoning Paths
### Overview
The image presents a series of diagrams illustrating different reasoning paths in knowledge graphs. Each diagram shows how to derive an answer (A) from question entities (e) using relations (r), intersections, unions, and negations. The diagrams are labeled with codes like "1p", "2p", "3i", etc., likely representing different types of reasoning patterns.
### Components/Axes
* **Nodes:**
* Orange circles labeled "e1", "e2", "e3": Represent Question Entities.
* Orange squares labeled "A", "A1", "A2", "A3", "A4": Represent Answer Entities.
* Green circles labeled "r1", "r2", "r3": Represent Relations.
* Light Blue circles with "^": Represent Intersection over the entity sets.
* Pink circles with "V": Represent Union over the entity sets.
* Pink circles with "¬r2", "¬r3": Represent Negation of a relation.
* **Edges:** Arrows indicate the direction of the relation or operation.
* **Layout:** Each reasoning path is shown in two stages, separated by a downward-pointing arrow. The first stage shows the initial entities and relations, and the second stage shows the derived entities and relations.
* **Legend (Bottom-Right):**
* Green circle with "r": Projection with Relation r
* Orange circle with "e": Question Entities
* Orange square with "A": Answer Entities
* Light Blue circle with "^": Intersection over the entity sets
* Pink circle with "V": Union over the entity sets
* Pink circle with "¬": Negation of a relation
### Detailed Analysis
**1p:**
* e1 -> r1 -> A
* After the arrow: e1 -> r1 -> A
**2p:**
* e1 -> r1 -> r2 -> A
* After the arrow: e1 -> r1 -> A1, A1 -> r2 -> A
**3p:**
* e1 -> r1 -> r2 -> r3 -> A
* After the arrow: e1 -> r1 -> A1, A1 -> r2 -> A2, A2 -> r3 -> A
**2i:**
* e1 -> r1 -> ^ -> A
* e2 -> r2 ->
* After the arrow: e1 -> r1 -> A1, e2 -> r2 -> A2, A1 ^ A2 -> A
**3i:**
* e1 -> r1 -> ^ -> A
* e2 -> r2 ->
* e3 -> r3 ->
* After the arrow: e1 -> r1 -> A1, e2 -> r2 -> A2, e3 -> r3 -> A3, A1 ^ A2 ^ A3 -> A4
**ip:**
* e1 -> r1 -> ^ -> r3 -> A
* e2 -> r2 ->
* After the arrow: e1 -> r1 -> A1, e2 -> r2 -> A2, A1 ^ A2 -> r3 -> A
**pi:**
* e1 -> r1 -> r2 -> ^ -> A
* e2 -> r3 ->
* After the arrow: e1 -> r1 -> A1, A1 -> r2 -> A2, e2 -> r3 -> A3, A2 ^ A3 -> A4
**2u:**
* e1 -> r1 -> V -> A
* e2 -> r2 ->
* After the arrow: e1 -> r1 -> A1, e2 -> r2 -> A2, A1 V A2 -> A
**up:**
* e1 -> r1 -> V -> r3 -> A
* e2 -> r2 ->
* After the arrow: e1 -> r1 -> A1, e2 -> r2 -> A2, A1 V A2 -> A3, A3 -> r3 -> A
**2in:**
* e1 -> r1 -> ¬r2 -> ^ -> A
* e2 -> r2 ->
* After the arrow: e1 -> r1 -> A1, e2 -> ¬r2 -> A2, A1 ^ A2 -> A
**3in:**
* e1 -> r1 -> ^ -> A
* e2 -> r2 ->
* e3 -> ¬r3 ->
* After the arrow: e1 -> r1 -> A1, e2 -> r2 -> A2, e3 -> ¬r3 -> A3, A1 ^ A2 ^ A3 -> A4
**inp:**
* e1 -> r1 -> ^ -> r3 -> A
* e2 -> ¬r2 ->
* After the arrow: e1 -> r1 -> A1, e2 -> ¬r2 -> A2, A1 ^ A2 -> r3 -> A
**pin:**
* e1 -> r1 -> r2 -> ¬r3 -> ^ -> A
* e2 -> r3 ->
* After the arrow: e1 -> r1 -> A1, A1 -> r2 -> A2, e2 -> r3 -> A3, A2 ^ A3 -> A4
**pni:**
* e1 -> r1 -> r2 -> ^ -> A
* e2 -> ¬r3 ->
* After the arrow: e1 -> r1 -> A1, A1 -> r2 -> A2, e2 -> ¬r3 -> A3, A2 ^ A3 -> A4
### Key Observations
* The diagrams illustrate various ways to combine entities and relations to infer new entities.
* The use of intersection, union, and negation allows for complex reasoning patterns.
* The "p" in the diagram labels likely stands for "path", "i" for "intersection", "u" for "union", and "n" for "negation".
### Interpretation
The diagrams provide a visual representation of different reasoning strategies in knowledge graphs. They demonstrate how to derive answers to complex queries by combining information from multiple sources and applying logical operations. These patterns are fundamental to knowledge graph reasoning and can be used to develop algorithms for question answering, information retrieval, and other applications. The diagrams highlight the importance of relations, intersections, unions, and negations in knowledge representation and reasoning.
</details>
Figure 4: Query Decomposition of different query types considered in our experiments.
Appendix B Prompt Templates of Different Query Types
The prompt templates for full complex logical queries with multiple operations and decomposed elementary logical queries with single operation are provided in Tables 5 and 6, respectively.
Table 5: Full Prompt Templates of Different Query Types.
| Context | $\mathcal{N}_{k}(q_{\tau}[Q_{\tau}])$ | Given the following (h,r,t) triplets where entity h is related to entity t by relation r; $(h_{1},r_{1},t_{1}),(h_{2},r_{2},t_{2}),(h_{3},r_{3},t_{3}),(h_{4},r_{4},t_{4}),$ |
| --- | --- | --- |
| $(h_{5},r_{5},t_{5}),(h_{6},r_{6},t_{6}),(h_{7},r_{7},t_{7}),(h_{8},r_{8},t_{8})$ | | |
| 1p | $∃ X.r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| 2p | $∃ X.r_{1}(X,∃ Y.r_{2}(Y,e_{1})$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ . Then, what are the entities connected to E by relation $r_{2}$ ? |
| 3p | $∃ X.r_{1}(X,∃ Y.r_{2}(Y,∃ Z.r_{3}(Z,e_{1})$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ and the set of entities F is connected to entities in E by relation $r_{2}$ . Then, what are the entities connected to F by relation $r_{3}$ ? |
| 2i | $∃ X.[r_{1}(X,e_{1})\wedge r_{2}(X,e_{2})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ and the set of entities F is connected to entity $e_{2}$ by relation $r_{2}$ . Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G? |
| 3i | $∃ X.[r_{1}(X,e_{1})\wedge r_{2}(X,e_{2})\wedge r_{3}(X,e_{3})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , the set of entities F is connected to entity $e_{2}$ by relation $r_{2}$ and the set of entities G is connected to entity $e_{3}$ by relation $r_{3}$ . Then, what are the entities in the intersection of set E, F and G, i.e., entities present in all E, F and G? |
| ip | $∃ X.r_{3}(X,∃ Y.[r_{1}(Y,e_{1})\wedge r_{2}(Y,e_{2})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , F is the set of entities connected to entity $e_{2}$ by relation $r_{2}$ , and G is the set of entities in the intersection of E and F. Then, what are the entities connected to entities in set G by relation $r_{3}$ ? |
| pi | $∃ X.[r_{1}(X,∃ Y.r_{2}(Y,e_{2}))\wedge r_{3}(X,e_{3})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , F is the set of entities connected to entities in E by relation $r_{2}$ , and G is the set of entities connected to entity $e_{2}$ by relation $r_{3}$ . Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G? |
| 2u | $∃ X.[r_{1}(X,e_{1})\vee r_{2}(X,e_{2})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ and F is the set of entities connected to entity $e_{2}$ by relation $r_{2}$ . Then, what are the entities in the union of set F and G, i.e., entities present in either F or G? |
| up | $∃ X.r_{3}(X,∃ Y.[r_{1}(Y,e_{1})\vee r_{2}(Y,e_{2})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ and F is the set of entities connected to entity $e_{2}$ by relation $r_{2}$ . G is the set of entities in the union of E and F. Then, what are the entities connected to entities in G by relation $r_{3}$ ? |
| 2in | $∃ X.[r_{1}(X,e_{1})\wedge\neg r_{2}(X,e_{2})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ and F is the set of entities connected to entity $e_{2}$ by any relation other than relation $r_{2}$ . Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G? |
| 3in | $∃ X.[r_{1}(X,e_{1})\wedge r_{2}(X,e_{2})\wedge\neg r_{3}(X,e_{3})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , F is the set of entities connected to entity $e_{2}$ by relation $r_{2}$ , and F is the set of entities connected to entity $e_{3}$ by any relation other than relation $r_{3}$ . Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G? |
| inp | $∃ X.r_{3}(X,∃ Y.[r_{1}(Y,e_{1})\wedge\neg r_{2}(Y,e_{2})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , and F is the set of entities connected to entity $e_{2}$ by any relation other than relation $r_{2}$ . Then, what are the entities that are connected to the entities in the intersection of set E and F by relation $r_{3}$ ? |
| pin | $∃ X.[r_{1}(X,∃ Y.\neg r_{2}(Y,e_{2}))\wedge r_{3}(X,e_{3})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , F is the set of entities connected to entities in E by relation $r_{2}$ , and G is the set of entities connected to entity $e_{2}$ by any relation other than relation $r_{3}$ . Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G? |
| pni | $∃ X.[r_{1}(X,∃ Y.\neg r_{2}(Y,e_{2}))\wedge\neg r_{3}(X,e_{3})]$ | Let us assume that the set of entities E is connected to entity $e_{1}$ by relation $r_{1}$ , F is the set of entities connected to entities in E by any relation other than $r_{2}$ , and G is the set of entities connected to entity $e_{2}$ by relation $r_{3}$ . Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G? |
Table 6: Decomposed Prompt Templates of Different Query Types.
| Context | $\mathcal{N}_{k}(q_{\tau}[Q_{\tau}])$ | Given the following (h,r,t) triplets where entity h is related to entity t by relation r; $(h_{1},r_{1},t_{1}),(h_{2},r_{2},t_{2}),(h_{3},r_{3},t_{3}),(h_{4},r_{4},t_{4}),$ |
| --- | --- | --- |
| $(h_{5},r_{5},t_{5}),(h_{6},r_{6},t_{6}),(h_{7},r_{7},t_{7}),(h_{8},r_{8},t_{8})$ | | |
| 1p | $∃ X.r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| 2p | $∃ X.r_{1}(X,∃ Y.$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $r_{2}(Y,e_{1})$ | Which entities are connected to any entity in [PP1] by relation $r_{2}$ ? | |
| 3p | $∃ X.r_{1}(X,∃ Y$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $.r_{2}(Y,∃ Z.$ | Which entities are connected to any entity in [PP1] by relation $r_{2}$ ? | |
| $r_{3}(Z,e_{1})$ | Which entities are connected to any entity in [PP2] by relation $r_{3}$ ? | |
| 2i | $∃ X.[r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge r_{2}(X,e_{2})]$ | Which entities are connected to $e_{2}$ by relation $r_{2}$ ? | |
| What are the entities in the intersection of entity sets [PP1] and [PP2]? | | |
| 3i | $∃ X.[r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge r_{2}(X,e_{2})$ | Which entities are connected to $e_{2}$ by relation $r_{2}$ ? | |
| $\wedge r_{3}(X,e_{3})]$ | Which entities are connected to $e_{3}$ by relation $r_{3}$ ? | |
| What are the entities in the intersection of entity sets [PP1], [PP2] and [PP3]? | | |
| ip | $∃ X.r_{3}(X,∃ Y.[r_{1}(Y,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge r_{2}(Y,e_{2})]$ | Which entities are connected to $e_{2}$ by relation $r_{2}$ ? | |
| What are the entities in the intersection of entity sets [PP1] and [PP2]? | | |
| What are the entities connected to any entity in [PP3] by relation $r_{3}$ ? | | |
| pi | $∃ X.[r_{1}(X,∃ Y.r_{2}(Y,e_{2}))$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge r_{3}(X,e_{3})]$ | Which entities are connected to [PP1] by relation $r_{2}$ ? | |
| Which entities are connected to $e_{2}$ by relation $r_{3}$ ? | | |
| What are the entities in the intersection of entity sets [PP2] and [PP3]? | | |
| 2u | $∃ X.[r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\vee r_{2}(X,e_{2})]$ | Which entities are connected to $e_{2}$ by relation $r_{2}$ ? | |
| What are the entities in the union of entity sets [PP1] and [PP2]? | | |
| up | $∃ X.r_{3}(X,∃ Y.[r_{1}(Y,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\vee r_{2}(Y,e_{2})]$ | Which entities are connected to $e_{2}$ by relation $r_{2}$ ? | |
| What are the entities in the union of entity sets [PP1] and [PP2]? | | |
| Which entities are connected to any entity in [PP3] by relation $r_{3}$ ? | | |
| 2in | $∃ X.[r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by any relation other than $r_{1}$ ? |
| $\wedge\neg r_{2}(X,e_{2})]$ | Which entities are connected to $e_{2}$ by any relation other than $r_{2}$ ? | |
| What are the entities in the intersection of entity sets [PP1] and [PP2]? | | |
| 3in | $∃ X.[r_{1}(X,e_{1})$ | Which entities are connected to $e_{1}$ by any relation other than $r_{1}$ ? |
| $\wedge r_{2}(X,e_{2})$ | Which entities are connected to $e_{2}$ by any relation other than $r_{2}$ ? | |
| $\wedge\neg r_{3}(X,e_{3})]$ | Which entities are connected to $e_{3}$ by any relation other than $r_{3}$ ? | |
| What are the entities in the intersection of entity sets [PP1], [PP2] and [PP3]? | | |
| inp | $∃ X.r_{3}(X,∃ Y.[r_{1}(Y,e_{1})$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge\neg r_{2}(Y,e_{2})]$ | Which entities are connected to $e_{2}$ by any relation other than $r_{2}$ ? | |
| What are the entities in the intersection of entity sets [PP1], and [PP2]? | | |
| What are the entities connected to any entity in [PP3] by relation $r_{3}$ ? | | |
| pin | $∃ X.[r_{1}(X,∃ Y.\neg r_{2}(Y,e_{2}))$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge r_{3}(X,e_{3})]$ | Which entities are connected to entity set in [PP1] by relation $r_{2}$ ? | |
| Which entities are connected to $e_{2}$ by any relation other than $r_{3}$ ? | | |
| What are the entities in the intersection of entity sets [PP2] and [PP3]? | | |
| pni | $∃ X.[r_{1}(X,∃ Y.\neg r_{2}(Y,e_{2}))$ | Which entities are connected to $e_{1}$ by relation $r_{1}$ ? |
| $\wedge\neg r_{3}(X,e_{3})]$ | Which entities are connected to any entity in [PP1] by any relation other than $r_{2}$ ? | |
| Which entities are connected to $e_{2}$ by relation $r_{3}$ ? | | |
| What are the entities in the intersection of entity sets [PP2] and [PP3]? | | |
Appendix C Analysis of Logical Reasoning Performance using HITS Metric
Tables 7 and 8 present the HITS@K=3 results of baselines and our model. HITS@K indicates the accuracy of predicting correct candidates in the top-K results.
Table 7: Performance comparison study between LARK and the baseline, focusing on their efficacy of logical reasoning using HITS@K=1,3,10 scores. The rows correspond to the models and columns denote the different query structures of multi-hop projections, geometric operations, and compound operations. The best results for each query type in every dataset are highlighted in bold font.
| Dataset FB15k | Variant Llama2-7B | 1p HITS@1 74.6 | 2p 26 | 3p 18.5 | 2i 59.9 | 3i 47.7 | ip 2.4 | pi 5.7 | 2u 5.8 | up 5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| complex | 77.5 | 37.9 | 26.3 | 67.4 | 54.6 | 8.2 | 20.7 | 20.7 | 17.6 | |
| step | 77.5 | 41.8 | 28.1 | 70.2 | 57.3 | 10.3 | 24.3 | 22.8 | 17.8 | |
| FB15k-237 | Llama2-7B | 77.2 | 28.5 | 17.7 | 10.9 | 22.6 | 10.8 | 8.7 | 10.5 | 13.2 |
| complex | 78.5 | 30.8 | 19.3 | 41.1 | 38.1 | 9.6 | 18.7 | 24.2 | 14.0 | |
| step | 78.5 | 34.3 | 21.3 | 43.2 | 40.2 | 11.7 | 22.2 | 27.9 | 14.2 | |
| NELL995 | Llama2-7B | 86.4 | 28.3 | 19.6 | 10.2 | 24 | 8.6 | 3.5 | 1.5 | 15.9 |
| complex | 88.0 | 30.9 | 21.7 | 44.1 | 41.6 | 7.4 | 8.2 | 3.3 | 17 | |
| step | 88.0 | 34.3 | 24.0 | 46.1 | 43.8 | 9.5 | 9.8 | 8.9 | 17.3 | |
| HITS@3 | | | | | | | | | | |
| FB15k | Llama2-7B | 74 | 53.4 | 34.6 | 18.2 | 36.4 | 44.7 | 39.4 | 35.7 | 77.1 |
| complex | 77.7 | 57.6 | 37.9 | 68.5 | 61.3 | 39.6 | 84.8 | 82.9 | 81.7 | |
| step | 77.7 | 57.4 | 40.1 | 69.4 | 62.5 | 48.4 | 91.2 | 92.7 | 82.6 | |
| FB15k-237 | Llama2-7B | 75.9 | 42.6 | 25.7 | 12.6 | 25.9 | 43.6 | 35.1 | 42.9 | 53.8 |
| complex | 78.3 | 45.9 | 28.1 | 47.2 | 43.7 | 38.7 | 75.6 | 89.4 | 57 | |
| step | 78.3 | 45.9 | 29.8 | 48.2 | 44.6 | 47.3 | 80.0 | 93.6 | 57.6 | |
| NELL995 | Llama2-7B | 85.6 | 42.9 | 28.7 | 11.8 | 27.6 | 34.6 | 14.1 | 5.7 | 65 |
| complex | 87.8 | 46.8 | 31.6 | 50.7 | 47.9 | 29.8 | 32.9 | 13.2 | 69.4 | |
| step | 87.8 | 45.7 | 33.5 | 51.3 | 48.7 | 38.1 | 39.6 | 35.8 | 70.3 | |
| HITS@10 | | | | | | | | | | |
| FB15k | Llama2-7B | 73.6 | 53.9 | 35.7 | 18.1 | 36.3 | 44.6 | 39.5 | 35.7 | 77.1 |
| complex | 77.7 | 58.2 | 39.1 | 68.2 | 61.4 | 39.5 | 85 | 82.9 | 81.7 | |
| step | 77.7 | 57.4 | 46.0 | 69.4 | 62.5 | 48.2 | 91.2 | 84.7 | 82.6 | |
| FB15k-237 | Llama2-7B | 75.2 | 43 | 26.5 | 12.6 | 25.9 | 43.6 | 35.1 | 42.9 | 53.8 |
| complex | 78.3 | 46.4 | 29 | 47.3 | 43.8 | 38.7 | 75.6 | 89.4 | 57 | |
| step | 78.3 | 45.9 | 34.1 | 48.2 | 44.6 | 47.3 | 80.0 | 93.6 | 57.6 | |
| NELL995 | Llama2-7B | 84.9 | 43.4 | 29.2 | 11.8 | 27.6 | 34.6 | 14.1 | 5.7 | 65 |
| complex | 87.8 | 47.4 | 32.2 | 50.8 | 48 | 29.8 | 32.9 | 13.2 | 69.4 | |
| step | 87.8 | 45.7 | 38.3 | 51.3 | 48.7 | 38.1 | 39.6 | 35.8 | 70.3 | |
Table 8: Performance comparison between LARK and the baseline for negation query types using HITS@K=1,3,10 scores. The best results for each query type in every dataset are given in bold font.
| FB15k | Llama2-7B complex step | 1.8 6.7 7.4 | 0.7 2.4 2.7 | 4.0 14.2 14.9 | 2.1 7.8 9.1 | 0.9 3.3 3.4 | 18.6 26.6 31.0 | 5.7 9.5 12.1 | 40.8 59.2 64.8 | 18.8 30.3 38.7 | 8.6 12.3 14.4 | 18.6 26.6 31.0 | 5.7 9.5 12.1 | 40.8 59.3 64.8 | 18.8 30.3 38.7 | 8.6 12.4 14.4 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| FB15k-237 | Llama2-7B | 1.9 | 0.8 | 6.8 | 2.8 | 0.7 | 7.5 | 3.5 | 27.3 | 11.6 | 2.7 | 7.5 | 3.5 | 27.3 | 11.6 | 2.7 |
| complex | 2.7 | 1.4 | 9.8 | 4.6 | 1 | 10.8 | 5.8 | 39.6 | 18.7 | 3.9 | 10.8 | 5.8 | 39.6 | 18.7 | 3.9 | |
| step | 3.2 | 1.7 | 10.6 | 5.8 | 1.1 | 12.6 | 7.4 | 43.3 | 23.9 | 4.6 | 12.6 | 7.4 | 43.3 | 23.9 | 4.6 | |
| NELL995 | Llama2-7B | 2.8 | 1.4 | 7.2 | 2.2 | 1.5 | 11.2 | 6 | 29.1 | 9.2 | 6.2 | 11.2 | 6 | 29.1 | 9.2 | 6.2 |
| complex | 3.9 | 2.3 | 10.2 | 3.7 | 2.2 | 16.1 | 9.4 | 41.8 | 15.1 | 9 | 16.1 | 9.4 | 41.8 | 15.1 | 9 | |
| step | 4.6 | 2.8 | 11.1 | 4.7 | 2.7 | 18.5 | 12.0 | 46.0 | 19.3 | 10.9 | 18.5 | 12.0 | 46.0 | 19.3 | 10.9 | |
Appendix D Algorithm
Algorithm for the LARK’s procedure is provided in Algorithm 1.
Input: Logical query $q_{\tau}$ , Knowledge Graph $\mathcal{G}:E× R$ ;
Output: Answer entities $V_{\tau}$ ;
1 # Query Abstraction: Map entity and relations to IDs
2 $q_{\tau}=Abstract(q_{\tau});$
3 $\mathcal{G}=Abstract(\mathcal{G});$
4 # Neighborhood Retrieval
5 $\mathcal{N}_{k}(q_{\tau}[Q_{\tau}])=\left\{(h,r,t)\right\}$ , using Eq. (7)
6 # Query Decomposition
7 $q^{d}_{\tau}=Decomp(q_{\tau});$
8 # Initialize Answer Cache $ans=\{\}$ ;
9 for $i∈ 1:length\left(q^{d}_{\tau}\right)$ do
10 # Replace Answer Cache in Question
11 $q^{d}_{\tau}[i]=replace(q^{d}_{\tau}[i],ans[i-1]);$
12 $ans[i]=LLM\left(q^{d}_{\tau}[i]\right);$
13
14 end for
return $ans[length\left(q^{d}_{\tau}\right)]$
Algorithm 1 LARK Algorithm
Table 9: Details of the token distribution for various query types in different datasets. The columns present the mean, median, minimum (Min), and maximum (Max) values of the number of tokens in the queries of different query types. Column ‘Cov’ presents the percentage of queries (coverage) that contain less than 4096 tokens, which is the token limit of Llama2 model.
| 1p | 70.2 | 61 | 58 | 10338 | 100 | 82.1 | 61 | 58 | 30326 | 99.9 | 81.7 | 61 | 58 | 30250 | 99.9 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 2p | 331.2 | 106 | 86 | 27549 | 97.1 | 1420.9 | 140 | 83 | 130044 | 89.7 | 893.4 | 136 | 83 | 108950 | 90.9 |
| 3p | 785.2 | 165 | 103 | 80665 | 91 | 3579.8 | 329 | 103 | 208616 | 75.7 | 3052.6 | 389 | 100 | 164545 | 73.7 |
| 2i | 1136.7 | 276 | 119 | 20039 | 86.3 | 4482.8 | 636 | 119 | 60655 | 67.7 | 4469.3 | 680 | 119 | 54916 | 67.3 |
| 3i | 2575.4 | 860 | 145 | 29148 | 68.4 | 8760.2 | 2294 | 145 | 85326 | 48.3 | 8979.4 | 2856 | 145 | 76834 | 44.8 |
| ip | 1923.8 | 1235 | 135 | 21048 | 67.4 | 4035.8 | 2017 | 131 | 32795 | 50.5 | 4838 | 2676 | 131 | 33271 | 43.6 |
| pi | 1036.8 | 455 | 140 | 10937 | 85.8 | 1255.6 | 343 | 141 | 45769 | 83.4 | 1535.3 | 435 | 135 | 21125 | 79.9 |
| 2u | 1325.4 | 790 | 121 | 14703 | 80.8 | 2109.5 | 868 | 123 | 60655 | 68.9 | 2294.9 | 1138 | 125 | 23637 | 65.7 |
| up | 115.3 | 112 | 110 | 958 | 100 | 113.7 | 112 | 110 | 981 | 100 | 113.2 | 112 | 110 | 427 | 100 |
| 2in | 1169.1 | 548 | 123 | 18016 | 84.9 | 5264.7 | 1116 | 128 | 60281 | 61.8 | 3496 | 774 | 124 | 58032 | 71.6 |
| 3in | 4070.3 | 2230 | 159 | 28679 | 46.6 | 13695.8 | 8344 | 175 | 88561 | 25.9 | 12575.9 | 7061 | 164 | 88250 | 28.1 |
| inp | 629 | 112 | 110 | 73457 | 91.8 | 1949.4 | 394 | 110 | 115169 | 78.4 | 696.7 | 112 | 110 | 89660 | 93.8 |
| pin | 400.7 | 154 | 129 | 6802 | 95.8 | 1106.5 | 242 | 129 | 44010 | 87.2 | 418.1 | 131 | 129 | 24062 | 96.7 |
| pni | 345.9 | 129 | 127 | 7938 | 96.6 | 547.1 | 129 | 127 | 18057 | 95.1 | 289.3 | 129 | 127 | 17489 | 97.9 |
Appendix E Query Token Distribution in Datasets
The quantitative details of the query token’s lengths is provided in Table 9 and their complete distribution plots are provided in Figure 5. From the results, we observe that the distribution of token lengths is positively-skewed for most of the query types, which indicates that the number of samples with high token lengths is small in number. Thus, small improvements in the LLMs’ token limit can potentially lead to better coverage on most of the reasoning queries in standard KG datasets.
<details>
<summary>extracted/2305.01157v3/images/prob_dist.png Details</summary>

### Visual Description
## Chart Type: Probability Density Plots of Token Counts for Different Query Types
### Overview
The image presents a grid of 12 probability density plots. Each plot visualizes the distribution of the number of tokens for a specific query type across three datasets: NELL, FB15k, and FB15k-237. The x-axis represents the number of tokens, and the y-axis represents the probability density.
### Components/Axes
* **Title:** Each plot has a title indicating the "Query Type" (e.g., "Query Type=1p", "Query Type=2p", etc.).
* **X-axis:** Labeled "Number of Tokens". The scale varies across plots, ranging from 0 to different maximum values (e.g., 600, 1200, 3500, 25000, 10000).
* **Y-axis:** Labeled "Probability Density". The scale varies across plots, ranging from 0 to different maximum values (e.g., 0.012, 0.0035, 0.0010, 0.0004).
* **Legend (Key):** Located in the top-right corner of each plot. It identifies the three datasets:
* NELL (Blue)
* FB15k (Orange)
* FB15k-237 (Green)
### Detailed Analysis
**Plot 1: Query Type=1p**
* X-axis: 0 to 600 tokens
* Y-axis: 0 to 0.012 probability density
* NELL (Blue): Shows a sharp peak around 100 tokens, then rapidly decreases.
* FB15k (Orange): Low probability density, peaking around 200 tokens.
* FB15k-237 (Green): Low probability density, peaking around 200 tokens, similar to FB15k.
**Plot 2: Query Type=2p**
* X-axis: 0 to 1200 tokens
* Y-axis: 0 to 0.0035 probability density
* NELL (Blue): Peak around 200 tokens, then decreases.
* FB15k (Orange): Peak around 600 tokens.
* FB15k-237 (Green): Peak around 400 tokens.
**Plot 3: Query Type=3p**
* X-axis: 0 to 3500 tokens
* Y-axis: 0 to 0.0010 probability density
* NELL (Blue): Peak around 500 tokens, then decreases.
* FB15k (Orange): Peak around 1000 tokens.
* FB15k-237 (Green): Peak around 800 tokens.
**Plot 4: Query Type=2i**
* X-axis: 0 to 6000 tokens
* Y-axis: 0 to 0.0010 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak around 1000 tokens.
* FB15k-237 (Green): Peak around 500 tokens.
**Plot 5: Query Type=3i**
* X-axis: 0 to 25000 tokens
* Y-axis: 0 to 0.0004 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak around 5000 tokens.
* FB15k-237 (Green): Peak around 2500 tokens.
**Plot 6: Query Type=ip**
* X-axis: 0 to 25000 tokens
* Y-axis: 0 to 0.0004 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak around 5000 tokens.
* FB15k-237 (Green): Peak around 2500 tokens.
**Plot 7: Query Type=pi**
* X-axis: 0 to 4000 tokens
* Y-axis: 0 to 0.0008 probability density
* NELL (Blue): Peak around 200 tokens, then decreases.
* FB15k (Orange): Peak around 500 tokens.
* FB15k-237 (Green): Peak around 400 tokens.
**Plot 8: Query Type=2u**
* X-axis: 0 to 10000 tokens
* Y-axis: 0 to 0.0006 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak around 1000 tokens.
* FB15k-237 (Green): Peak around 500 tokens.
**Plot 9: Query Type=up**
* X-axis: 0 to 1000 tokens
* Y-axis: 0 to 0.200 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak near 0 tokens, then decreases.
* FB15k-237 (Green): Peak near 0 tokens, then decreases.
**Plot 10: Query Type=2in**
* X-axis: 0 to 7000 tokens
* Y-axis: 0 to 0.0008 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak around 1000 tokens.
* FB15k-237 (Green): Peak around 500 tokens.
**Plot 11: Query Type=3in**
* X-axis: 0 to 70000 tokens
* Y-axis: 0 to 0.00020 probability density
* NELL (Blue): Peak near 0 tokens, then decreases.
* FB15k (Orange): Peak around 10000 tokens.
* FB15k-237 (Green): Peak around 5000 tokens.
**Plot 12: Query Type=inp**
* X-axis: 0 to 1000 tokens
* Y-axis: 0 to 0.0012 probability density
* NELL (Blue): Peak around 200 tokens, then decreases.
* FB15k (Orange): Peak around 200 tokens.
* FB15k-237 (Green): Peak around 100 tokens.
### Key Observations
* The distribution of token counts varies significantly depending on the query type.
* NELL tends to have lower token counts compared to FB15k and FB15k-237.
* FB15k and FB15k-237 often have similar distributions, but FB15k tends to have slightly higher token counts.
* Some query types (e.g., "up") have very low token counts across all datasets.
* The x-axis scales vary widely, indicating that some query types involve much larger numbers of tokens than others.
### Interpretation
The plots illustrate the probability density of the number of tokens for different query types across three knowledge graph datasets. The data suggests that the complexity and structure of queries, as defined by their type, significantly influence the number of tokens required to represent them. The differences between the datasets (NELL, FB15k, FB15k-237) indicate variations in how these knowledge graphs are structured and queried. NELL generally uses fewer tokens, possibly indicating a simpler or more concise query structure compared to FB15k and FB15k-237. The specific query types (1p, 2p, 3p, etc.) likely correspond to different types of relationships or patterns being queried within the knowledge graphs. The plots can be used to understand the characteristics of different query types and how they relate to the structure of the underlying knowledge graphs. The "up" query type is a notable outlier, suggesting a very simple query structure with minimal token usage.
</details>
Figure 5: Probability distribution of the number of tokens in each query type. The figures contains 14 graphs for the 14 different query types. The x-axis and y-axis presents the number of tokens in the query and their probability density, respectively.