# Complex Logical Reasoning over Knowledge Graphs using Large Language Models
**Authors**:
- Nurendra Choudhary (Department of Computer Science)
- &Chandan K. Reddy (Department of Computer Science)
## Abstract
Reasoning over knowledge graphs (KGs) is a challenging task that requires a deep understanding of the complex relationships between entities and the underlying logic of their relations. Current approaches rely on learning geometries to embed entities in vector space for logical query operations, but they suffer from subpar performance on complex queries and dataset-specific representations. In this paper, we propose a novel decoupled approach, Language-guided Abstract Reasoning over Knowledge graphs (LARK), that formulates complex KG reasoning as a combination of contextual KG search and logical query reasoning, to leverage the strengths of graph extraction algorithms and large language models (LLM), respectively. Our experiments demonstrate that the proposed approach outperforms state-of-the-art KG reasoning methods on standard benchmark datasets across several logical query constructs, with significant performance gain for queries of higher complexity. Furthermore, we show that the performance of our approach improves proportionally to the increase in size of the underlying LLM, enabling the integration of the latest advancements in LLMs for logical reasoning over KGs. Our work presents a new direction for addressing the challenges of complex KG reasoning and paves the way for future research in this area.
## 1 Introduction
Knowledge graphs (KGs) encode knowledge in a flexible triplet schema where two entity nodes are connected by relational edges. However, several real-world KGs, such as Freebase (Bollacker et al., 2008), Yago (Suchanek et al., 2007), and NELL (Carlson et al., 2010), are often large-scale, noisy, and incomplete. Thus, reasoning over such KGs is a fundamental and challenging problem in AI research. The over-arching goal of logical reasoning is to develop answering mechanisms for first-order logic (FOL) queries over KGs using the operators of existential quantification ( $∃$ ), conjunction ( $\wedge$ ), disjunction ( $\vee$ ), and negation ( $¬$ ). Current research on this topic primarily focuses on the creation of diverse latent space geometries, such as vectors (Hamilton et al., 2018), boxes (Ren et al., 2020), hyperboloids (Choudhary et al., 2021b), and probabilistic distributions (Ren & Leskovec, 2020), in order to effectively capture the semantic position and logical coverage of knowledge graph entities. Despite their success, these approaches are limited in their performance due to the following. (i) Complex queries: They rely on constrained formulations of FOL queries that lose information on complex queries that require chain reasoning (Choudhary et al., 2021a) and involve multiple relationships between entities in the KG, (ii) Generalizability: optimization for a particular KG may not generalize to other KGs which limits the applicability of these approaches in real-world scenarios where KGs can vary widely in terms of their structure and content, and (iii) Scalability: intensive training times that limit the scalability of these approaches to larger KGs and incorporation of new data into existing KGs. To address these limitations, we aim to leverage the reasoning abilities of large language models (LLMs) in a novel framework, shown in Figure 1, called Language-guided Abstract Reasoning over Knowledge graphs (LARK).
<details>
<summary>extracted/2305.01157v3/images/example_logical_query.png Details</summary>

### Visual Description
## Logical Diagram: Query for Non-European and Non-North American Nobel Prize Winners
### Overview
The image is a conceptual diagram illustrating the logical structure of a query. The query asks: "Who(X) are the Nobel Prize (N) winners (W) not from Europe (E) or North America (A)?" It visually maps the relationships between entities (regions, people) and logical operations (conjunction, disjunction, negation) required to answer this question. The diagram is not a data chart but a flowchart for a knowledge base query.
### Components/Axes
The diagram is composed of labeled nodes (circles) connected by directed arrows representing relationships. The elements are spatially arranged to show a logical flow from left to right.
**1. Header (Top):**
* **Title Text:** "Who(X) are the Nobel Prize (N) winners (W) not from Europe (E) or North America (A)?"
* **Formal Logic Expression:** `?X.∃X.names(X, W.∃W.[winners(W, NobelPrize) ∧ ∃T.[¬(citizen(T, E) ∨ citizen(T, A))]])`
* *Note: This is formal logical syntax, not a natural language.*
**2. Diagram Nodes & Labels (Left to Right, Top to Bottom):**
* **Blue Circle (Top-Left):** Label: "Nobel Prize"
* **Blue Circle (Middle-Left):** Label: "Europe"
* **Blue Circle (Bottom-Left):** Label: "North America"
* **Green Circle (Top-Right):** Label: "Nobel Prize Winners"
* **Green Circle (Middle, connected to Europe):** Label: "Europeans"
* **Green Circle (Bottom, connected to North America):** Label: "North Americans"
* **Purple Circle (Center):** Contains the logical symbol "∨" (OR).
* **Pink Circle (Center-Right):** Contains the logical symbol "¬" (NOT).
* **Yellow Circle (Right):** Label: "Non-Europeans and Non-North Americans"
* **Light Blue Circle (Far Right):** Contains the logical symbol "∧" (AND). An arrow from this node points right with the label "names".
**3. Relationship Arrows & Labels:**
* Arrow from "Nobel Prize" to "Nobel Prize Winners": Label: "winner"
* Arrow from "Europe" to "Europeans": Label: "citizen"
* Arrow from "North America" to "North Americans": Label: "citizen"
* Dashed green arrows from "Europeans" and "North Americans" point to the purple "∨" (OR) node.
* A dashed purple arrow points from the "∨" node to the pink "¬" (NOT) node.
* A solid red arrow points from the "¬" node to the yellow "Non-Europeans and Non-North Americans" node.
* A solid green arrow points from "Nobel Prize Winners" to the light blue "∧" (AND) node.
* A solid yellow arrow points from "Non-Europeans and Non-North Americans" to the same light blue "∧" (AND) node.
### Detailed Analysis
The diagram visually deconstructs the logical query into its constituent parts and operations:
1. **Entity Definition:** It defines base entities (Nobel Prize, Europe, North America) and derived groups (Nobel Prize Winners, Europeans, North Americans) via "winner" and "citizen" relationships.
2. **Logical Operations:**
* **Disjunction (∨):** The set of "Europeans" and "North Americans" are combined using the logical OR operation.
* **Negation (¬):** The result of the OR operation is negated, creating the set of individuals who are **not** (European OR North American). This is labeled "Non-Europeans and Non-North Americans".
* **Conjunction (∧):** The set of "Nobel Prize Winners" is intersected (AND) with the set of "Non-Europeans and Non-North Americans". The result of this conjunction is the answer set.
3. **Query Output:** The final arrow labeled "names" from the conjunction node indicates that the query seeks the names (X) of the individuals in this final intersected set.
### Key Observations
* **Color Coding:** Colors are used semantically: Blue for source regions, Green for derived population groups, Purple/Pink/Yellow for logical operations and results, Light Blue for the final conjunction.
* **Spatial Flow:** The layout guides the viewer from left (inputs) through center (processing/logic) to right (output), mirroring the logical evaluation order.
* **Formal vs. Visual:** The diagram provides a visual, intuitive counterpart to the dense formal logic expression at the top, making the query's structure easier to grasp.
### Interpretation
This diagram is a pedagogical or design tool for knowledge representation and query planning. It demonstrates how a complex natural language question can be systematically broken down into a sequence of set-theoretic and logical operations over a knowledge base.
* **What it demonstrates:** It shows the stepwise construction of a target set (X) by first defining prerequisite sets (Winners, Europeans, North Americans), applying logical filters (NOT (Europeans OR North Americans)), and finally intersecting the filtered set with the Winners set.
* **Relationships:** The arrows explicitly define the predicates (`winner`, `citizen`) that connect entities in the underlying knowledge graph. The logical nodes (∨, ¬, ∧) show how these sets are combined.
* **Notable Structure:** The use of the intermediate yellow node "Non-Europeans and Non-North Americans" is a key visual aid. It explicitly represents the result of the negation before the final conjunction, clarifying a step that is compressed in the formal logic string.
* **Purpose:** The diagram's purpose is to bridge the gap between human-readable questions and machine-executable logical queries, ensuring the query's semantics are correctly captured and visualized before implementation. It acts as a blueprint for querying a structured database or knowledge graph.
</details>
(a) Input logical query.
<details>
<summary>extracted/2305.01157v3/images/example_full_query.png Details</summary>

### Visual Description
## Diagram: Set Theory Problem Statement
### Overview
The image displays a rectangular box with a dashed black border containing a multi-line text problem statement. The text defines four sets (E, F, G, H) using relations between entities and then poses a question about the intersection of two of these sets. The text uses color and font styling to distinguish different components of the logical statement.
### Components/Axes
The image contains only text, structured as a single paragraph. There are no traditional chart axes, legends, or data points. The components are the textual elements themselves, which are styled as follows:
- **Set Labels (E, F, G, H):** Displayed in a bold, green font.
- **Entity Names (Nobel Prize, Europe, North America):** Displayed in a bold, blue, italic font.
- **Relation Names (winner, citizen):** Displayed in a bold, black, italic font.
- **Logical Operators (negation, union, intersection):** Displayed in bold font with distinct colors: "negation" in red, "union" in purple, and "intersection" in blue.
- **General Text:** The remaining connecting words and punctuation are in a standard black font.
### Content Details
The text is transcribed verbatim below. The language is English.
**Transcription:**
"Let **E** be the set of entities connected to *Nobel Prize* by relation *winner*, **F** is the set of entities connected to *Europe* by the relation *citizen*, and **G** is the set of entities connected to *North America* by the relation *citizen*. Let **H** be the set of entities connected to entities in the **negation** of **union** of **F** and **G**, then what are the entities in the **intersection** of **E** and **H**?"
**Logical Structure Breakdown:**
1. **Set E:** Entities with a "winner" relation to the entity "Nobel Prize". (Interpretation: Nobel Prize laureates).
2. **Set F:** Entities with a "citizen" relation to the entity "Europe". (Interpretation: Citizens of European countries).
3. **Set G:** Entities with a "citizen" relation to the entity "North America". (Interpretation: Citizens of North American countries).
4. **Set H:** Entities connected to entities that are *not* in the union of F and G.
* **Union of F and G (F ∪ G):** The set of entities that are citizens of Europe *or* North America (or both).
* **Negation of (F ∪ G):** The complement set. Entities that are *not* citizens of Europe *and* *not* citizens of North America.
* **Set H:** Entities connected to any entity in this complement set. The specific connecting relation is not defined in the text.
5. **Final Question:** Identify the entities in the **intersection of E and H (E ∩ H)**. These are entities that are both in set E (Nobel Prize winners) *and* in set H (connected to non-European, non-North American citizens).
### Key Observations
* The problem is a formal logical or set-theoretic puzzle, likely from the domain of knowledge graphs, semantic networks, or logic.
* The use of color and font styling is systematic and serves to visually parse the complex statement into its constituent parts: sets (green), entities (blue italics), relations (black italics), and logical operators (colored bold).
* The definition of set H is ambiguous because the nature of the "connection" is not specified. It could be any relation (e.g., "collaborator", "relative", "employer").
* The problem is abstract. It does not provide a specific database or list of entities; it asks for a description of the resulting set based on the given definitions.
### Interpretation
This image presents a logical query on a hypothetical knowledge base. The data suggests a scenario where one might want to find a specific subset of Nobel laureates.
* **What the data demonstrates:** It defines a method to filter entities (potentially people) based on a combination of their achievements (winning a Nobel Prize) and their geopolitical associations (via citizenship of others).
* **How elements relate:** The sets are built hierarchically. E, F, and G are directly defined. H is derived from the complement of the union of F and G. The final answer (E ∩ H) is the intersection of a directly defined set (E) and a derived set (H).
* **Notable implications:** The query `E ∩ H` would identify Nobel Prize winners who are connected in some way to individuals who are *not* citizens of Europe or North America. Without knowing the specific "connection" relation for H, the precise meaning is open-ended. It could be used to find laureates with international collaborations, familial ties, or institutional affiliations outside the Western world. The problem tests the ability to parse and combine set operations (union, complement, intersection) applied to relation-based definitions.
</details>
(b) Query prompt.
<details>
<summary>extracted/2305.01157v3/images/example_decomp_query.png Details</summary>

### Visual Description
## Textual Query List: Entity-Relationship and Set Operations
### Overview
The image displays a vertical list of six distinct questions, each enclosed within a rectangular box defined by a dashed black border. The questions are presented in English and appear to be structured queries, likely for a knowledge graph, database, or logical reasoning system. Key terms within each question are highlighted using specific colors and text styles to denote their semantic role.
### Components/Axes
The image is composed solely of text elements arranged in a stacked, list-like format. There are no traditional chart axes, legends, or data points. The primary structural and semantic components are:
* **Container Boxes:** Six separate boxes with dashed black outlines, stacked vertically.
* **Text Content:** Each box contains one complete question in English.
* **Stylistic Highlighting:** Specific words or phrases within the questions are formatted with distinct colors and font styles (bold, italics) to categorize them.
### Detailed Analysis
The following is a precise transcription of the text within each box, from top to bottom. The color and style of highlighted text are noted in brackets.
**Box 1 (Top):**
* **Text:** "What are the entities connected to **Nobel Prize** by relation *winner*?"
* **Highlighting:** "Nobel Prize" is in blue. "winner" is in bold italics.
**Box 2:**
* **Text:** "What are the entities connected to **Europe** by relation *citizen*?"
* **Highlighting:** "Europe" is in blue. "citizen" is in bold italics.
**Box 3:**
* **Text:** "What are the entities connected to **North America** by relation *citizen*?"
* **Highlighting:** "North America" is in blue. "citizen" is in bold italics.
**Box 4:**
* **Text:** "What are the entities in the **union** of A2 and A3?"
* **Highlighting:** "union" is in purple. "A2" and "A3" are in orange.
**Box 5:**
* **Text:** "Which entities **do not belong** to the entity set A4?"
* **Highlighting:** "do not belong" is in red. "A4" is in orange.
**Box 6 (Bottom):**
* **Text:** "What are the entities in the **intersection** of A1 and A5?"
* **Highlighting:** "intersection" is in blue. "A1" and "A5" are in orange.
### Key Observations
1. **Query Pattern:** The first three questions follow an identical syntactic pattern: "What are the entities connected to [Entity] by relation [Relation]?" This suggests a template for querying direct relationships in a knowledge graph.
2. **Color-Coded Semantics:** The highlighting scheme appears consistent and meaningful:
* **Blue:** Used for named entities (Nobel Prize, Europe, North America) and the set operation "intersection".
* **Bold Italics:** Used exclusively for relationship types (winner, citizen).
* **Purple:** Used for the set operation "union".
* **Orange:** Used for identifiers of entity sets (A1, A2, A3, A4, A5).
* **Red:** Used for the negation concept "do not belong".
3. **Set Theory Operations:** The last three questions shift from relational queries to operations on abstract entity sets (A1-A5), involving union, negation (complement), and intersection.
4. **Spatial Layout:** The questions are presented in a clear, top-to-bottom sequence without overlap. The dashed borders create a clean separation between distinct queries.
### Interpretation
This image is not a data visualization but a **structured query template or test set**. It demonstrates two fundamental types of information retrieval operations:
1. **Relational Lookup (Boxes 1-3):** These questions are designed to retrieve entities based on a specific, named relationship to a given anchor entity. This is a core operation in knowledge graphs and semantic networks. The use of "citizen" for geographical entities (Europe, North America) and "winner" for an award (Nobel Prize) illustrates how relations define connections between different types of nodes.
2. **Set-Theoretic Operations (Boxes 4-6):** These questions operate on pre-defined collections of entities (sets A1-A5). They test the ability to compute:
* **Union:** Combining all elements from two sets.
* **Negation/Complement:** Identifying elements outside a specified set.
* **Intersection:** Finding common elements between two sets.
The color-coding is a critical visual aid that reinforces the grammatical and logical roles of the terms, making the query structure immediately apparent. The progression from concrete relational queries to abstract set operations suggests this could be part of an educational exercise, a benchmark for a reasoning system, or a specification for a query language interface. The absence of specific answers or data indicates the image's purpose is to define the *questions themselves*, not to present results.
</details>
(c) Decomposed prompt.
<details>
<summary>extracted/2305.01157v3/images/example_answers.png Details</summary>

### Visual Description
\n
## Diagram: Hierarchical Name Sets with Final Answer
### Overview
The image displays a vertically structured diagram composed of seven rectangular boxes with dashed borders, each containing a labeled set of names. The boxes are arranged in a single column. The sixth box from the top is visually distinct, highlighted with an orange background and labeled "Final Answer." The diagram appears to represent a process of filtering, selection, or aggregation of names from initial sets (A1-A5) leading to a final result (A6).
### Components/Axes
The diagram has no traditional chart axes. Its components are:
* **Labeled Boxes:** Six boxes labeled A1 through A6.
* **"Final Answer" Box:** A highlighted box positioned between A5 and A6.
* **Content:** Each box contains a set notation listing proper names, followed by an ellipsis ("..."), indicating the sets are incomplete or truncated in this view.
* **Spatial Layout:** The boxes are stacked vertically. A1 is at the top, followed sequentially by A2, A3, A4, and A5. The "Final Answer" box is centered below A5. A6 is at the bottom, below the "Final Answer" box.
### Detailed Analysis
**Transcription of Textual Content:**
* **Box A1 (Top):**
* Label: `A1`
* Content: `{Rabindranath Tagore, Theodore Roosevelt, Wolfgang Pauli, ... }`
* **Box A2:**
* Label: `A2`
* Content: `{Wolfgang Pauli, Adolf von Baeyer, ...}`
* **Box A3:**
* Label: `A3`
* Content: `{Theodore Roosevelt, Jimmy Carter, Albert Einstein,...}`
* **Box A4:**
* Label: `A4`
* Content: `{Wolfgang Pauli, Theodore Roosevelt, ...}`
* **Box A5:**
* Label: `A5`
* Content: `{Rabindranath Tagore, ...}`
* **"Final Answer" Box:**
* Label: `Final Answer` (centered, bold text on an orange background)
* Content: (This box contains only the label, acting as a header for the box below it.)
* **Box A6 (Bottom):**
* Label: `A6`
* Content: `{Rabindranath Tagore, ...}`
### Key Observations
1. **Name Recurrence:** Several names appear across multiple sets:
* **Wolfgang Pauli:** Appears in A1, A2, and A4.
* **Theodore Roosevelt:** Appears in A1, A3, and A4.
* **Rabindranath Tagore:** Appears in A1, A5, and is the sole listed name in A6.
2. **Set Reduction:** The sets appear to become less populated or more specific as they progress downward, culminating in A6 which contains only one named individual.
3. **Visual Hierarchy:** The "Final Answer" box creates a clear visual break, suggesting that A6 is the output or conclusion derived from the preceding sets (A1-A5).
4. **Ellipses:** The consistent use of "..." in every set (A1-A5 and A6) explicitly indicates that the lists shown are not exhaustive; there are additional members in each set that are not displayed in this diagram.
### Interpretation
This diagram likely illustrates a logical or computational process, such as a **set operation, a filtering algorithm, or a decision tree**. The flow is top-down: initial candidate sets (A1-A5) are provided, and through some unstated process (perhaps intersection, union, or a rule-based filter), a final, narrowed-down result is produced in A6.
The key investigative insight is the **convergence on Rabindranath Tagore**. While other notable figures like Wolfgang Pauli and Theodore Roosevelt appear frequently in the intermediate sets, only Tagore persists into the final answer set (A6). This suggests the underlying process is designed to identify an individual who meets specific criteria present across the initial sets—for example, perhaps a person who is a Nobel laureate in Literature (Tagore, 1913) and also appears in other curated lists. The ellipses are crucial; they imply the full data contains more names, and the diagram is a simplified view of a more complex selection mechanism. The "Final Answer" is not necessarily the only member of the ultimate set, but the one highlighted by this specific process.
</details>
(d) LLM answers.
Figure 1: Example of LARK’s query chain decomposition and logically-ordered LLM answering for effective performance. LLMs are more adept at answering simple queries, and hence, we decompose the multi-operation complex logical query (a,b) into elementary queries with single operation (c) and then use a sequential LLM-based answering method to output the final answer (d).
In LARK, we utilize the logical queries to search for relevant subgraph contexts over knowledge graphs and perform chain reasoning over these contexts using logically-decomposed LLM prompts. To achieve this, we first abstract out the logical information from both the input query and the KG. Given the invariant nature of logic logical queries follow the same set of rules and procedures irrespective of the KG context., this enables our method to focus on the logical formulation, avoid model hallucination the model ignores semantic common-sense knowledge and infers only from the KG entities for answers., and generalize over different knowledge graphs. From this abstract KG, we extract relevant subgraphs using the entities and relations present in the logical query. These subgraphs serve as context prompts for input to LLMs. In the next phase, we need to effectively handle complex reasoning queries. From previous works (Zhou et al., 2023; Khot et al., 2023), we realize that LLMs are significantly less effective on complex prompts, when compared to a sequence of simpler prompts. Thus to simplify the query, we exploit their logical nature and deterministically decompose the multi-operation query into logically-ordered elementary queries, each containing a single operation (depicted in the transition from Figure 1(b) to 1(c)). Each of these decomposed logical queries is then converted to a prompt and processed through the LLM to generate the final set of answers (shown in Figure 1(d)). The logical queries are handled sequentially, and if query $y$ depends on query $x$ , then $x$ is scheduled before $y$ . Operations are scheduled in a logically-ordered manner to enable batching different logical queries together, and answers are stored in caches for easy access.
The proposed approach effectively integrates logical reasoning over knowledge graphs with the capabilities of LLMs, and to the best of our knowledge, is the first of its kind. Unlike previous approaches that rely on constrained formulations of first-order logic (FOL) queries, our approach utilizes logically-decomposed LLM prompts to enable chain reasoning over subgraphs retrieved from knowledge graphs, allowing us to efficiently leverage the reasoning ability of LLMs. Our KG search model is inspired by retrieval-augmented techniques (Chen et al., 2022) but realizes the deterministic nature of knowledge graphs to simplify the retrieval of relevant subgraphs. Moreover, compared to other prompting methods (Wei et al., 2022; Zhou et al., 2023; Khot et al., 2023), our chain decomposition technique enhances the reasoning capabilities in knowledge graphs by leveraging the underlying chain of logical operations in complex queries, and by utilizing preceding answers amidst successive queries in a logically-ordered manner. To summarize, the primary contributions of this paper are as follows:
1. We propose, Language-guided Abstract Reasoning over Knowledge graphs (LARK), a novel model that utilizes the reasoning abilities of large language models to efficiently answer FOL queries over knowledge graphs.
1. Our model uses entities and relations in queries to find pertinent subgraph contexts within abstract knowledge graphs, and then, performs chain reasoning over these contexts using LLM prompts of decomposed logical queries.
1. Our experiments on logical reasoning across standard KG datasets demonstrate that LARK outperforms the previous state-of-the-art approaches by $35\$ MRR on 14 FOL query types based on the operations of projection (p), intersection ( $\wedge$ ), union ( $\vee$ ), and negation ( $¬$ ).
1. We establish the advantages of chain decomposition by showing that LARK performs $20\$ better on decomposed logical queries when compared to complex queries on the task of logical reasoning. Additionally, our analysis of LLMs shows the significant contribution of increasing scale and better design of underlying LLMs to the performance of LARK.
## 2 Related Work
Our work is at the intersection of two topics, namely, logical reasoning over knowledge graphs and reasoning prompt techniques in LLMs.
Logical Reasoning over KGs: Initial approaches in this area (Bordes et al., 2013; Nickel et al., 2011; Das et al., 2017; Hamilton et al., 2018) focused on capturing the semantic information of entities and the relational operations involved in the projection between them. However, further research in the area revealed a need for new geometries to encode the spatial and hierarchical information present in the knowledge graphs. To tackle this issue, models such as Query2Box (Ren et al., 2020), HypE (Choudhary et al., 2021b), PERM (Choudhary et al., 2021a), and BetaE (Ren & Leskovec, 2020) encoded the entities and relations as boxes, hyperboloids, Gaussian distributions, and beta distributions, respectively. Additionally, approaches such as CQD (Arakelyan et al., 2021) have focused on improving the performance of complex reasoning tasks through the answer composition of simple intermediate queries. In another line of research, HamQA (Dong et al., 2023) and QA-GNN (Yasunaga et al., 2021) have developed question-answering techniques that use knowledge graph neighborhoods to enhance the overall performance. We notice that previous approaches in this area have focused on enhancing KG representations for logical reasoning. Contrary to these existing methods, our work provides a systematic framework that leverages the reasoning ability of LLMs and tailors them toward the problem of logical reasoning over knowledge graphs.
Reasoning prompts in LLMs: Recent studies have shown that LLMs can learn various NLP tasks with just context prompts (Brown et al., 2020). Furthermore, LLMs have been successfully applied to multi-step reasoning tasks by providing intermediate reasoning steps, also known as Chain-of-Thought (Wei et al., 2022; Chowdhery et al., 2022), needed to arrive at an answer. Alternatively, certain studies have composed multiple LLMs or LLMs with symbolic functions to perform multi-step reasoning (Jung et al., 2022; Creswell et al., 2023), with a pre-defined decomposition structure. More recent studies such as least-to-most (Zhou et al., 2023), successive (Dua et al., 2022) and decomposed (Khot et al., 2023) prompting strategies divide a complex prompt into sub-prompts and answer them sequentially for effective performance. While this line of work is close to our approach, they do not utilize previous answers to inform successive queries. LARK is unique due to its ability to utilize logical structure in the chain decomposition mechanism, augmentation of retrieved knowledge graph neighborhood, and multi-phase answering structure that incorporates preceding LLM answers amidst successive queries.
## 3 Methodology
In this section, we will describe the problem setup of logical reasoning over knowledge graphs, and describe the various components of our model.
### 3.1 Problem Formulation
In this work, we tackle the problem of logical reasoning over knowledge graphs (KGs) $G:E× R$ that store entities ( $E$ ) and relations ( $R$ ). Without loss of generality, KGs can also be organized as a set of triplets $⟨ e_1,r,e_2⟩⊆G$ , where each relation $r∈ R$ is a Boolean function $r:E× E→\{True,False\}$ that indicates whether the relation $r$ exists between the pair of entities $(e_1,e_2)∈ E$ . We consider four fundamental first-order logical (FOL) operations: projection (p), intersection ( $\wedge$ ), union ( $\vee$ ), and negation ( $¬$ ) to query the KG. These operations are defined as follows:
$$
\displaystyle q_p[Q_p] \displaystyle\triangleq?V_p:\{v_1,v_2,...,v_k\}⊆ E~{}∃~{
}a_1 \displaystyle q_\wedge[Q_\wedge] \displaystyle\triangleq?V_\wedge:\{v_1,v_2,...,v_k\}⊆ E~{}
∃~{}a_1\wedge a_2\wedge...\wedge a_i \displaystyle q_\vee[Q_\vee] \displaystyle\triangleq?V_\vee:\{v_1,v_2,...,v_k\}⊆ E~{}
∃~{}a_1\vee a_2\vee...\vee a_i \displaystyle q_¬[Q_¬] \displaystyle\triangleq?V_¬:\{v_1,v_2,...,v_k\}⊆ E~{}
∃~{}¬ a_1 \displaystylewhere Q_p,Q_¬ \displaystyle=(e_1,r_1);~{}Q_\wedge,Q_\vee=\{(e_1,r_1),(e_2,r_2
),...,(e_i,r_i)\};~{~{}and~{} }a_i=r_i(e_i,v_i) \tag{1}
$$
where $q_p,q_\wedge,q_\vee$ , and $q_¬$ are projection, intersection, union, and negation queries, respectively; and $V_p,V_\wedge,V_\vee$ and $V_¬$ are the corresponding results of those queries (Arakelyan et al., 2021; Choudhary et al., 2021a). $a_i$ is a Boolean indicator which will be 1 if $e_i$ is connected to $v_i$ by relation $r_i$ , 0 otherwise. The goal of logical reasoning is to formulate the operations such that for a given query $q_τ$ of query type $τ$ with inputs $Q_τ$ , we are able to efficiently retrieve $V_τ$ from entity set $E$ , e.g., for a projection query $q_p[(Nobel Prize, winners)]$ , we want to retrieve $V_p=\{Nobel Prize winners\}⊆ E$ .
In conventional methods for logical reasoning, the query operations were typically expressed through a geometric function. For example, the intersection of queries was represented as an intersection of box representations in Query2Box (Ren et al., 2020). However, in our proposed approach, LARK, we leverage the advanced reasoning capabilities of Language Models (LLMs) and prioritize efficient decomposition of logical chains within the query to enhance performance. This novel strategy seeks to overcome the limitations of traditional methods by harnessing the power of LLMs in reasoning over KGs.
### 3.2 Neighborhood Retrieval and Logical Chain Decomposition
The foundation of LARK’s reasoning capability is built on large language models. Nevertheless, the limited input length of LLMs restricts their ability to process the entirety of a knowledge graph. Furthermore, while the set of entities and relations within a knowledge graph is unique, the reasoning behind logical operations remains universal. Therefore, we specifically tailor the LLM prompts to account for the above distinctive characteristics of logical reasoning over knowledge graphs. To address this need, we adopt a two-step process:
1. Query Abstraction: In order to make the process of logical reasoning over knowledge graphs more generalizable to different datasets, we propose to replace all the entities and relations in the knowledge graph and queries with a unique ID. This approach offers three significant advantages. Firstly, it reduces the number of tokens in the query, leading to improved LLM efficiency. Secondly, it allows us to solely utilize the reasoning ability of the language model, without relying on any external common sense knowledge of the underlying LLM. By avoiding the use of common sense knowledge, our approach mitigates the potential for model hallucination (which may lead to the generation of answers that are not supported by the KG). Finally, it removes any KG-specific information, thereby ensuring that the process remains generalizable to different datasets. While this may intuitively seem to result in a loss of information, our empirical findings, presented in Section 4.4, indicate that the impact on the overall performance is negligible.
1. Neighborhood Retrieval: In order to effectively answer logical queries, it is not necessary for the LLM to have access to the entire knowledge graph. Instead, the relevant neighborhoods containing the answers can be identified. Previous approaches (Guu et al., 2020; Chen et al., 2022) have focused on semantic retrieval for web documents. However, we note that logical queries are deterministic in nature, and thus we perform a $k$ -level depth-first traversal where $k$ is determined by the query type, e.g., for 3-level projection ( $3p$ ) queries, $k=3$ . over the entities and relations present in the query. Let $E^1_τ$ and $R^1_τ$ denote the set of entities and relations in query $Q_τ$ for a query type $τ$ , respectively. Then, the $k$ -level neighborhood of query $q_τ$ is defined by $N_k(q_τ[Q_τ])$ as:
$$
\displaystyleN_1(q_τ[Q_τ]) \displaystyle=≤ft\{(h,r,t):≤ft(h∈ E^1_τ\right),≤ft(r∈ R^1_
τ\right),≤ft(t∈ E^1_τ\right)\right\} \displaystyle E^k_τ \displaystyle=\{h,t:(h,r,t)∈N_k-1(q_τ[Q_τ]\}, R^
k_τ=\{r:(h,r,t)∈N_k-1(q_τ[Q_τ]\} \displaystyleN_k(q_τ[Q_τ]) \displaystyle=≤ft\{(h,r,t):≤ft(h∈ E^k_τ\right),≤ft(r∈ R^k_
τ\right),≤ft(t∈ E^k_τ\right)\right\} \tag{5}
$$
We have taken steps to make our approach more generalizable and efficient by abstracting the query and limiting input context for LLMs. However, the complexity of a query still remains a concern. The complexity of a query type $τ$ , denoted by $O(q_τ)$ , is determined by the number of entities and relations it involves, i.e., $O(q_τ)∝|E_τ|+|R_τ|$ . In other words, the size of the query in terms of its constituent elements is a key factor in determining its computational complexity. This observation is particularly relevant in the context of LLMs, as previous studies have shown that their performance tends to decrease as the complexity of the queries they handle increases (Khot et al., 2023). To address this, we propose a logical query chain decomposition mechanism in LARK which reduces a complex multi-operation query to multiple single-operation queries. Due to the exhaustive set of operations, we apply the following strategy for decomposing the various query types:
- Reduce a $k$ -level projection query to $k$ one-level projection queries, e.g., a $3p$ query with one entity and three relations $e_1\xrightarrow{r_1}\xrightarrow{r_2}\xrightarrow{r_3}A$ is decomposed to $e_1\xrightarrow{r_1}A_1,A_1\xrightarrow{r_2}A_2,A_2\xrightarrow{ r_3}A$ .
- Reduce a $k$ -intersection query to $k$ projection queries and an intersection query, e.g., a $3i$ query with intersection of two projection queries $(e_1\xrightarrow{r_1})\wedge(e_2\xrightarrow{r_2})\wedge(e_3 \xrightarrow{r_3})=A$ is decomposed to $e_1\xrightarrow{r_1}A_1,e_2\xrightarrow{r_2}A_2,e_3\xrightarrow{ r_3}A_2,A_1\wedge A_2\wedge A_3=A$ . Similarly, reduce a $k$ -union query to $k$ projection queries and a union query.
The complete decomposition of the exhaustive set of query types used in previous work (Ren & Leskovec, 2020) and our empirical studies can be found in Appendix A.
<details>
<summary>extracted/2305.01157v3/images/model.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Query Processing System
### Overview
This image is a technical flowchart illustrating a system that processes natural language logical queries by decomposing them into sub-queries, retrieving relevant information from a knowledge graph, and using a Large Language Model (LLM) to generate a final answer. The diagram shows a multi-stage pipeline from query input to answer output.
### Components/Axes
The diagram is organized into several interconnected blocks, flowing generally from left to right.
**1. Query Type (Top-Left, Blue Box):**
* **Label:** "Query Type"
* **Content:** A grid of 12 small graph pattern icons, each labeled with a code:
* Row 1: `p`, `2p`, `3p`
* Row 2: `2i`, `3i`, `2u`
* Row 3: `ip`, `pi`, `up`
* Row 4: `inp`, `pin`, `pni`
* **Function:** Represents different types of logical query structures (e.g., projection, intersection, union).
**2. Logical Query (Bottom-Left, Orange Box):**
* **Label:** "Logical Query"
* **Content:** An example query and its graph representation.
* **Text:** "Name the Asian Nobel Prize Winners?"
* **Graph:** Two parallel chains converging with an intersection symbol (∧).
* Chain 1: Node "Nobel Prize" --[winner]--> Node "Nobel Prize Winners"
* Chain 2: Node "Asian" --[citizen]--> Node "Asians"
* **Label:** "2i" (indicating a 2-intersection query type).
**3. Knowledge Graph (Top-Center, Purple Cylinder):**
* **Label:** "Knowledge Graph"
* **Function:** The data source containing entities and relations.
**4. Query Processing Pipeline (Center):**
* **Query Type Box:** Receives input from the "Query Type" block.
* **Entities and Relations Box:** Receives input from the "Logical Query" block.
* **Neighborhood Retrieval:** An arrow labeled "Neighborhood Retrieval" points from the above two boxes to the "Relevant Subgraphs" block.
* **Relevant Subgraphs (Icon):** Depicts multiple small graph fragments.
* **Prompt Template (Pink Diamond):** Connects to the "Relevant Subgraphs" and feeds into the prompt construction stage.
**5. Prompt Construction & LLM Processing (Right Side):**
* **Context Prompt (Yellow Box):**
* **Text:** "Given the following (h,r,t) triplets: (e₁,r₁,t₁), (e₂,r₂,t₁), (e₂,r₂,t₂), (e₁,r₁,t₂), (e₁,r₁,t₃), (e₂,r₂,t₄), (e₂,r₂,t₅), (e₁,r₁,t₆), ..."
* **Decomposed Question Prompts (Yellow Box):** Contains three dashed-line sub-prompts:
1. "What are the entities connected to **e₁** by relation **r₁**?"
2. "What are the entities connected to **e₂** by relation **r₁**?"
3. "What are the entities in the **intersection** of **P1** and **P2**?"
* **LLM (Green Vertical Box):** The central processing unit. Arrows from the "Context Prompt" and "Decomposed Question Prompts" feed into it. An orange arrow labeled "Add context to question" also points into the LLM.
* **Logically-ordered Answers (Green Box):** Output from the LLM.
* `A1 = P1 = {t₁, t₂, t₃, t₆, ...}`
* `A2 = P2 = {t₁, t₂, t₄, t₅, ...}`
* `A3 = P3 = {t₁, t₂, ...}` (This is the intersection result).
**6. Final Output (Far Right, Orange Box):**
* **Label:** "Final Answer"
* **Content:** "Malala Yousafzai, Rabindranath Tagore, ..."
* **Process:** An arrow labeled "Replace entity ids" points from the "Logically-ordered Answers" (specifically A3) to this final box.
**7. Logical Chain Decomposition (Center-Bottom, Orange Box):**
* **Label:** "Logical Chain Decomposition"
* **Content:** Shows the decomposition of the example "2i" query.
* **Input:** A graph with two entities (e₁, e₂) connected via relations (r₁, r₂) to two predicates (P1, P2), which are then intersected (∧).
* **Output:** Three yellow boxes showing the decomposed sub-queries:
1. e₁ --[r₁]--> P1
2. e₂ --[r₂]--> P2
3. P1 ∧ P2 (Intersection of P1 and P2)
### Detailed Analysis
The diagram details a specific workflow for the example query "Name the Asian Nobel Prize Winners?":
1. **Query Abstraction:** The natural language query is abstracted into a logical graph structure of type "2i" (intersection of two sets).
2. **Decomposition:** The "2i" query is decomposed into two fundamental sub-queries (find Nobel Prize winners; find Asians) and a final intersection operation.
3. **Retrieval:** The system uses the query type and entities/relations to perform "Neighborhood Retrieval" from the Knowledge Graph, obtaining "Relevant Subgraphs."
4. **Prompt Engineering:** A "Prompt Template" is used with the retrieved subgraphs to construct:
* A **Context Prompt** providing example (head, relation, tail) triplets.
* **Decomposed Question Prompts** that ask the LLM to solve each sub-query and the intersection.
5. **LLM Execution:** The LLM processes these prompts. It first answers the sub-queries, producing sets P1 (Nobel Prize Winners) and P2 (Asians). It then computes their intersection (P3).
6. **Answer Generation:** The final set of entity IDs (P3) is mapped back to human-readable names ("Malala Yousafzai, Rabindranath Tagore, ...") to produce the "Final Answer."
### Key Observations
* **Color Coding:** Consistent color use aids understanding: Blue for query types, Orange for logical queries/answers, Purple for the knowledge base, Yellow for prompts, Green for LLM processing.
* **Spatial Flow:** The process flows clearly from left (input) to right (output), with a major decomposition step shown in the lower center.
* **Mathematical Notation:** Uses standard logical symbols (∧ for intersection) and set notation ({t₁, t₂, ...}).
* **Example-Driven:** The entire pipeline is illustrated using a single, concrete example, making the abstract process tangible.
### Interpretation
This diagram demonstrates a **neuro-symbolic** approach to question answering. It combines:
1. **Symbolic AI:** Representing knowledge as a graph, using logical query types (p, 2i, etc.), and performing explicit decomposition and set operations (intersection).
2. **Neural AI (LLM):** Using the LLM's natural language understanding to interpret the decomposed prompts and its reasoning ability to compute the answers from the provided context.
The system's core innovation appears to be the **"Logical Chain Decomposition"** step. Instead of asking the LLM the complex question directly, it breaks the problem into simpler, verifiable sub-tasks (find set A, find set B, find A∩B). This makes the process more transparent, potentially more accurate, and allows the use of retrieved graph data as precise context for the LLM. The "Replace entity ids" step highlights a common challenge in such systems: bridging the gap between internal symbolic representations (IDs like t₁, t₂) and human-readable output. The example answer suggests the knowledge graph contains entities for individuals like Malala Yousafzai and Rabindranath Tagore, linked via relations like "winner" and "citizen."
</details>
Figure 2: An overview of the LARK model. The model takes the logical query and infers the query type from it. The query abstraction function maps the entities and relations to abstract IDs, and the neighborhood retrieval mechanism collects the relevant subgraphs from the overall knowledge graph. The chains of the abstracted complex query are then logically decomposed to simpler single-operation queries. The retrieved neighborhood and decomposed queries are further converted into LLM prompts using a template and then processed in the LLM to get the final set of answers for evaluation.
### 3.3 Chain Reasoning Prompts
In the previous section, we outlined our approach to limit the neighborhood and decompose complex queries into chains of simple queries. Leveraging these, we can now use the reasoning capability of LLMs to obtain the final set of answers for the query, as shown in Figure 2. To achieve this, we employ a prompt template that converts the neighborhood into a context prompt and the decomposed queries into question prompts. It is worth noting that certain queries in the decomposition depend on the responses of preceding queries, such as intersection relying on the preceding projection queries. Additionally, unlike previous prompting methods such as chain-of-thought (Wei et al., 2022) and decomposition (Khot et al., 2023) prompting, the answers need to be integrated at a certain position in the prompt. To address this issue, we maintain a placeholder in dependent queries and a temporary cache of preceding answers that can replace the placeholders in real-time. This also has the added benefit of maintaining the parallelizability of queries, as we can run batches of decomposed queries in phases instead of sequentially running each decomposed query. The specific prompt templates of the complex and decomposed logical queries for different query types are provided in Appendix B.
### 3.4 Implementation Details
We implemented LARK in Pytorch (Paszke et al., 2019) on eight Nvidia A100 GPUs with 40 GB VRAM. In the case of LLMs, we chose the Llama2 model (Touvron et al., 2023) due to its public availability in the Huggingface library (Wolf et al., 2020) . For efficient inference over the large-scale models, we relied on the mixed-precision version of LLMs and the Deepspeed library (Rasley et al., 2020) with Zero stage 3 optimization. The algorithm of our model is provided in Appendix D and implementation code for all our experiments with exact configuration files and datasets for reproducibility are publicly available https://github.com/Akirato/LLM-KG-Reasoning. In our experiments, the highest complexity of a query required a 3-hop neighborhood around the entities and relations. Hence, we set the depth limit to 3 (i.e., $k=3$ ). Additionally, to further make our process completely compatible with different datasets, we added a limit of $n$ tokens on the input which is dependent on the LLM model (for Llama2, $n$ =4096). In practice, this implies that we stop the depth-first traversal when the context becomes longer than $n$ .
## 4 Experimental Results
This sections describes our experiments that aim to answer the following research questions (RQs):
- Does LARK outperform the state-of-the-art baselines on the task of logical reasoning over standard knowledge graph benchmarks?
- How does our combination of chain decomposition query and logically-ordered answer mechanism perform in comparison with the standard prompting techniques?
- How does the scale and design of LARK’s underlying LLM model affect its performance?
- How would our model perform with support for increased token size?
- Does query abstraction affect the reasoning performance of our model?
### 4.1 Datasets and Baselines
We select the following standard benchmark datasets to investigate the performance of our model against state-of-the-art models on the task of logical reasoning over knowledge graphs:
- FB15k (Bollacker et al., 2008) is based on Freebase, a large collaborative knowledge graph project that was created by Google. FB15k contains about 15,000 entities, 1,345 relations, and 592,213 triplets (statements that assert a fact about an entity).
- FB15k-237 (Toutanova et al., 2015) is a subset of FB15k, containing 14,541 entities, 237 relations, and 310,116 triplets. The relations in FB15k-237 are a subset of the relations in FB15k, and was created to address some of the limitations of FB15k, such as the presence of many irrelevant or ambiguous relations, and to provide a more challenging benchmark for knowledge graph completion models.
- NELL995 (Carlson et al., 2010) was created using the Never-Ending Language Learning (NELL) system, which is a machine learning system that automatically extracts knowledge from the web by reading text and inferring new facts. NELL995 contains 9,959 entities, 200 relations, and 114,934 triplets. The relations in NELL995 cover a wide range of domains, including geography, sports, and politics.
Our criteria for selecting the above datasets was their ubiquity in previous works on this research problem. Further details on their token size is provided in Appendix E. For the baselines, we chose the following methods:
- GQE (Hamilton et al., 2018) encodes a query as a single vector and represents entities and relations in a low-dimensional space. It uses translation and deep set operators, which are modeled as projection and intersection operators, respectively.
- Query2Box (Q2B) (Ren et al., 2020) uses a box embedding model which is a generalization of the traditional vector embedding model and can capture richer semantics.
- BetaE (Ren & Leskovec, 2020) uses a novel beta distribution to model the uncertainty in the representation of entities and relations. BetaE can capture both the point estimate and the uncertainty of the embeddings, which leads to more accurate predictions in knowledge graph completion tasks.
- HQE (Choudhary et al., 2021b) uses the hyperbolic query embedding mechanism to model the complex queries in knowledge graph completion tasks.
- HypE (Choudhary et al., 2021b) uses the hyperboloid model to represent entities and relations in a knowledge graph that simultaneously captures their semantic, spatial, and hierarchical features.
- CQD (Arakelyan et al., 2021) decomposes complex queries into simpler sub-queries and applies a query-specific attention mechanism to the sub-queries.
### 4.2 RQ1. Efficacy on Logical Reasoning
To study the efficacy of our model on the task of logical reasoning, we compare it against the previous baselines on the following standard logical query constructs:
1. Multi-hop Projection traverses multiple relations from a head entity in a knowledge graph to answer complex queries by projecting the query onto the target entities. In our experiments, we consider $1p,2p$ , and $3p$ queries that denote 1-relation, 2-relation, and 3-relation hop from the head entity, respectively.
1. Geometric Operations apply the operations of intersection ( $\wedge$ ) and union ( $\vee$ ) to answer the query. Our experiments use $2i$ and $3i$ queries that represent the intersection over 2 and 3 entities, respectively. Also, we study $2u$ queries that perform union over 2 entities.
1. Compound Operations integrate multiple operations such as intersection, union, and projection to handle complex queries over a knowledge graph.
1. Negation Operations negate the query by finding entities that do not satisfy the given logic. In our experiments, we examine $2in,3in,inp,$ and $pin$ queries that negate $2i,3i,ip,$ and $pi$ queries, respectively. We also analyze $pni$ (an additional variant of the $pi$ query), where the negation is over both entities in the intersection. It should be noted that BetaE is the only method in the existing literature that supports negation, and hence, we only compare against it in our experiments.
We present the results of our experimental study, which compares the Mean Reciprocal Rank (MRR) score of the retrieved candidate entities using different query constructions. MRR is calculated as the average of the reciprocal ranks of the candidate entities More metrics such as HITS@K=1,3,10 are reported in Appendix C.. In order to ensure a fair comparison, We selected these query constructions which were used in most of the previous works in this domain (Ren & Leskovec, 2020). An illustration of these query types is provided in Appendix A for better understanding. Our experiments show that LARK outperforms previous state-of-the-art baselines by $35\$ on an average across different query types, as reported in Table 1. We observe that the performance improvement is higher for simpler queries, where $1p>2p>3p$ and $2i>3i$ . This suggests that LLMs are better at capturing breadth across relations but may not be as effective at capturing depth over multiple relations. Moreover, our evaluation also encompasses testing against challenging negation queries, for which BetaE (Ren & Leskovec, 2020) remains to be the only existing approach. Even in this complex scenario, our findings, as illustrated in Table 2, indicate that LARK significantly outperforms the baselines by $140\$ . This affirms the superior reasoning capabilities of our model in tackling complex query scenarios. Another point of note is that certain baselines such as CQD are able to outperform LARK in the FB15k dataset for certain query types such as $1p,3i$ , and $ip$ . The reason for this is that FB15k suffers from a data leakage from training to validation and testing sets (Toutanova et al., 2015). This unfairly benefits the training-based baselines over the inference-only LARK model.
Table 1: Performance comparison between LARK and the baseline in terms of their efficacy of logical reasoning using MRR scores. The rows present various models and the columns correspond to different query structures of multi-hop projections, geometric operations, and compound operations. The best results for each query type in every dataset is highlighted in bold font.
| FB15k | GQE Q2B BetaE | 54.6 68.0 65.1 | 15.3 21.0 25.7 | 10.8 14.2 24.7 | 39.7 55.1 55.8 | 51.4 66.5 66.5 | 27.6 39.4 43.9 | 19.1 26.1 28.1 | 22.1 35.1 40.1 | 11.6 16.7 25.2 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| HQE | 54.3 | 33.9 | 23.3 | 38.4 | 50.6 | 12.5 | 24.9 | 35.0 | 25.9 | |
| HypE | 67.3 | 43.9 | 33.0 | 49.5 | 61.7 | 18.9 | 34.7 | 47.0 | 37.4 | |
| CQD | 79.4 | 39.6 | 27.0 | 74.0 | 78.2 | 70.0 | 43.3 | 48.4 | 17.5 | |
| LARK(complex) | 73.6 | 46.5 | 32.0 | 66.9 | 61.8 | 24.8 | 47.2 | 47.7 | 37.5 | |
| LARK(ours) | 73.6 | 49.3 | 35.1 | 67.8 | 62.6 | 29.3 | 54.5 | 51.9 | 37.7 | |
| FB15k-237 | GQE | 35.0 | 7.2 | 5.3 | 23.3 | 34.6 | 16.5 | 10.7 | 8.2 | 5.7 |
| Q2B | 40.6 | 9.4 | 6.8 | 29.5 | 42.3 | 21.2 | 12.6 | 11.3 | 7.6 | |
| BetaE | 39.0 | 10.9 | 10.0 | 28.8 | 42.5 | 22.4 | 12.6 | 12.4 | 9.7 | |
| HQE | 37.6 | 20.9 | 16.9 | 25.3 | 35.2 | 17.3 | 8.2 | 15.6 | 17.9 | |
| HypE | 49.0 | 34.3 | 23.7 | 33.9 | 44 | 18.6 | 30.5 | 41.0 | 26.0 | |
| CQD | 44.5 | 11.3 | 8.1 | 32.0 | 42.7 | 25.3 | 15.3 | 13.4 | 4.8 | |
| LARK(complex) | 70.0 | 34.0 | 21.5 | 43.4 | 42.2 | 18.7 | 38.4 | 49.2 | 25.1 | |
| LARK(ours) | 70.0 | 36.9 | 24.5 | 44.3 | 43.1 | 23.2 | 45.6 | 56.6 | 25.4 | |
| NELL995 | GQE | 32.8 | 11.9 | 9.6 | 27.5 | 35.2 | 18.4 | 14.4 | 8.5 | 8.8 |
| Q2B | 42.2 | 14.0 | 11.2 | 33.3 | 44.5 | 22.4 | 16.8 | 11.3 | 10.3 | |
| BetaE | 53.0 | 13.0 | 11.4 | 37.6 | 47.5 | 24.1 | 14.3 | 12.2 | 8.5 | |
| HQE | 35.5 | 20.9 | 18.9 | 23.2 | 36.3 | 8.8 | 13.7 | 21.3 | 15.5 | |
| HypE | 46.0 | 30.6 | 27.9 | 33.6 | 48.6 | 31.8 | 13.5 | 20.7 | 26.4 | |
| CQD | 50.7 | 18.4 | 13.8 | 39.8 | 49.0 | 29.0 | 22.0 | 16.3 | 9.9 | |
| LARK(complex) | 83.2 | 39.8 | 27.6 | 49.3 | 48.0 | 18.7 | 19.6 | 8.3 | 36.8 | |
| LARK(ours) | 83.2 | 42.3 | 31.0 | 49.9 | 48.7 | 23.1 | 23.0 | 20.1 | 37.2 | |
Table 2: Performance comparison between LARK and the baseline for negation query types using MRR scores. The best results for each query type in every dataset is highlighted in bold font. Our model’s performance is significantly higher on most negation queries. However, the performance is limited in 3in and pni queries due to their high number of tokens (shown in Appendix E).
| FB15k | BetaE LARK(complex) LARK(ours) | 14.3 16.5 17.5 | 14.7 6.2 7.0 | 11.5 32.5 34.7 | 6.5 22.8 26.7 | 12.4 10.5 11.1 |
| --- | --- | --- | --- | --- | --- | --- |
| FB15k-237 | BetaE | 5.1 | 7.9 | 7.4 | 3.6 | 3.4 |
| LARK(complex) | 6.1 | 3.4 | 21.6 | 12.8 | 2.9 | |
| LARK(ours) | 7.0 | 4.1 | 23.9 | 16.8 | 3.5 | |
| NELL995 | BetaE | 5.1 | 7.8 | 10.0 | 3.1 | 3.5 |
| LARK(complex) | 8.9 | 5.3 | 23.0 | 10.4 | 6.3 | |
| LARK(ours) | 10.4 | 6.6 | 25.4 | 13.6 | 7.6 | |
### 4.3 RQ2. Advantages of Chain Decomposition
The aim of this experiment is to investigate the advantages of using chain decomposed queries over standard complex queries. We employ the same experimental setup described in Section 4.2. Our results, in Tables 1 and 2, demonstrate that utilizing chain decomposition contributes to a significant improvement of $20\$ in our model’s performance. This improvement is a clear indication of the LLMs’ ability to capture a broad range of relations and effectively utilize this capability for enhancing the performance on complex queries. This study highlights the potential of using chain decomposition to overcome the limitations of complex queries and improve the efficiency of logical reasoning tasks. This finding is a significant contribution to the field of natural language processing and has implications for various other applications such as question-answering systems and knowledge graph completion. Overall, our results suggest that chain-decomposed queries could be a promising approach for improving the performance of LLMs on complex logical reasoning tasks.
### 4.4 RQ3. Analysis of LLM scale
This experiment analyzes the impact of the size of the underlying LLMs and query abstraction on the overall LARK model performance. To examine the effect of LLM size, we compared two variants of the Llama2 model which have 7 billion and 13 billion parameters. Our evaluation results, presented in Table 3, show that the performance of the LARK model improves by $123\$ from Llama2-7B to Llama2-13B. This indicates that increasing the number of LLM parameters can enhance the performance of LARK model.
Table 3: MRR scores of LARK on FB15k-237 dataset with underlying LLMs of different sizes. The best results for each query type is highlighted in bold font.
| Llama2 | 7B | 73.1 | 33.2 | 20.6 | 10.6 | 25.2 | 25.9 | 17.2 | 20.8 | 24.3 | 4 | 1.8 | 14.2 | 7.4 | 1.9 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 13B | 73.6 | 49.3 | 35.1 | 67.8 | 62.6 | 29.3 | 54.5 | 51.9 | 37.7 | 7.0 | 4.1 | 23.9 | 16.8 | 3.5 | |
### 4.5 RQ4. Study on Increased Token Limit of LLMs
From the dataset details provided in Appendix E, we observe that the token size of different query types shows considerable fluctuation from $58$ to over $100,000$ . Unfortunately, the token limit of LLama2, considered as the base in our experiments, is 4096. This limit is insufficient to demonstrate the full potential performance of LARK on our tasks. To address this limitation, we consider the availability of models with higher token limits, such as GPT-3.5 (OpenAI, 2023). However, we acknowledge that these models are expensive to run and thus, we could not conduct a thorough analysis on the entire dataset. Nevertheless, to gain insight into LARK’s potential with increased token size, we randomly sampled 1000 queries per query type from each dataset with token length over 4096 and less than 4096 and compared our model on these queries with GPT-3.5 and Llama2 as the base. The evaluation results, which are displayed in Table 4, demonstrate that transitioning from Llama2 to GPT-3.5 can lead to a significant performance improvement of 29%-40% for the LARK model which suggests that increasing the token limit of LLMs may have significant potential of further performance enhancement.
Table 4: MRR scores of LARK with Llama2 and GPT LLMs as the underlying base models. The best results for each query type in every dataset is highlighted in bold font.
| | FB15k | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLM | 1p | 2p | 3p | 2i | 3i | ip | pi | 2u | up | 2in | 3in | inp | pin | pni |
| Llama2-7B | 23.4 | 21.5 | 22.6 | 3.4 | 3 | 26.1 | 18.4 | 14.8 | 3.9 | 9.5 | 4.7 | 21.7 | 26.4 | 5.8 |
| Llama2-13B | 23.8 | 22.8 | 24.2 | 3.5 | 3 | 23.3 | 30.8 | 30.7 | 3.9 | 12.4 | 6.6 | 28.4 | 51.4 | 7.7 |
| GPT-3.5 | 36.1 | 34.6 | 36.8 | 17.0 | 14.4 | 35.4 | 46.7 | 39.3 | 19.5 | 18.8 | 10.0 | 43.1 | 56.7 | 11.6 |
| FB15k-237 | | | | | | | | | | | | | | |
| LLM | 1p | 2p | 3p | 2i | 3i | ip | pi | 2u | up | 2in | 3in | inp | pin | pni |
| Llama2-7B | 23.1 | 27.4 | 31.5 | 5 | 4.1 | 26.6 | 20.9 | 15.3 | 5.6 | 26.6 | 8.8 | 33.7 | 31 | 21.1 |
| Llama2-13B | 23.5 | 29.2 | 33.8 | 5 | 4.1 | 23.7 | 35 | 31.7 | 5.6 | 34.7 | 12.3 | 44 | 60.4 | 28 |
| GPT-3.5 | 35.7 | 44.2 | 51.2 | 24.8 | 20.2 | 36.0 | 53.1 | 40.6 | 28.1 | 52.5 | 18.7 | 66.8 | 66.6 | 42.4 |
| NELL995 | | | | | | | | | | | | | | |
| LLM | 1p | 2p | 3p | 2i | 3i | ip | pi | 2u | up | 2in | 3in | inp | pin | pni |
| Llama2-7B | 28 | 24.4 | 27.6 | 3.7 | 3.2 | 24 | 8.4 | 14.5 | 5.7 | 14 | 7.7 | 23.1 | 21.3 | 13.4 |
| Llama2-13B | 28.4 | 26 | 29.5 | 3.7 | 3.2 | 21.5 | 14.1 | 25.4 | 5.7 | 18.3 | 10.8 | 30.1 | 30.2 | 17.7 |
| GPT-3.5 | 43.1 | 39.4 | 44.8 | 18.3 | 15.5 | 32.6 | 21.4 | 38.5 | 28.3 | 27.7 | 16.4 | 45.7 | 45.9 | 26.8 |
### 4.6 RQ5. Effects of Query Abstraction
<details>
<summary>extracted/2305.01157v3/images/query_abstraction.png Details</summary>

### Visual Description
## Horizontal Bar Chart: Performance Comparison on FB15k-237 Dataset
### Overview
This image displays a horizontal bar chart comparing the performance of two models or methods, labeled "LARK (semantic)" and "LARK (ours)", across five distinct reasoning or operation categories. The performance metric is the Mean Reciprocal Rank (MRR) score on the FB15k-237 dataset, a common benchmark for knowledge graph completion tasks.
### Components/Axes
* **Chart Type:** Horizontal grouped bar chart.
* **Y-Axis (Vertical):** Lists five categorical variables representing different types of operations or reasoning tasks. From top to bottom:
1. Negation
2. Compound Operation
3. Geometric Operation
4. Multi-hop Projection
5. Simple Projection
* **X-Axis (Horizontal):** Represents the quantitative metric, labeled **"MRR Score on FB15k-237 Dataset"**. The axis has major tick marks at 0, 20, 40, and 60.
* **Legend:** Positioned at the bottom center of the chart. It defines the two data series:
* **Orange Square:** LARK (semantic)
* **Blue Square:** LARK (ours)
* **Data Series:** For each category on the Y-axis, there are two bars placed side-by-side (grouped). The top bar in each group is orange, and the bottom bar is blue.
### Detailed Analysis
The chart presents the following approximate MRR scores for each category. Values are estimated based on bar length relative to the x-axis scale.
1. **Negation:**
* **Trend:** Both bars are the shortest on the chart, indicating the lowest performance for this task.
* **LARK (semantic) [Orange]:** ~10
* **LARK (ours) [Blue]:** ~10 (appears equal to or very slightly longer than the orange bar).
2. **Compound Operation:**
* **Trend:** Moderate performance, with both bars extending to a similar length.
* **LARK (semantic) [Orange]:** ~35
* **LARK (ours) [Blue]:** ~36 (appears marginally longer than the orange bar).
3. **Geometric Operation:**
* **Trend:** High performance for both methods, representing the second-highest scores on the chart.
* **LARK (semantic) [Orange]:** ~55
* **LARK (ours) [Blue]:** ~56 (appears slightly longer than the orange bar).
4. **Multi-hop Projection:**
* **Trend:** Moderate performance, very similar in magnitude to "Compound Operation."
* **LARK (semantic) [Orange]:** ~35
* **LARK (ours) [Blue]:** ~35 (bars appear nearly identical in length).
5. **Simple Projection:**
* **Trend:** The highest performance category for both methods, with the longest bars on the chart.
* **LARK (semantic) [Orange]:** ~65 (extends slightly past the 60 mark).
* **LARK (ours) [Blue]:** ~64 (appears very slightly shorter than the orange bar).
### Key Observations
* **Performance Hierarchy:** There is a clear hierarchy of task difficulty as measured by MRR score: Simple Projection > Geometric Operation > Compound Operation ≈ Multi-hop Projection > Negation.
* **Model Comparison:** The performance of "LARK (ours)" and "LARK (semantic)" is extremely close across all five categories. The differences are minimal, often appearing as slight leads for one method or the other within a single point.
* **Notable Anomaly:** The "Simple Projection" category is the only one where "LARK (semantic)" (orange) appears to have a slight but visible performance advantage over "LARK (ours)" (blue). In all other categories, "LARK (ours)" is either equal or marginally better.
* **Consistency:** Both methods show a consistent pattern of strengths and weaknesses across the different operation types.
### Interpretation
This chart is likely from a research paper or technical report evaluating a new model variant ("LARK (ours)") against a baseline or alternative version ("LARK (semantic)"). The FB15k-237 dataset is a standard benchmark for evaluating knowledge graph embedding models on link prediction tasks.
The data suggests that the modifications or approach taken in "LARK (ours)" yield performance that is highly comparable to the "semantic" version across a diverse set of reasoning operations. The near-identical scores imply that the core capability for these specific tasks is preserved. The slight underperformance on "Simple Projection" could be a minor trade-off for potential gains elsewhere (e.g., computational efficiency, performance on other datasets, or robustness), which would need to be examined in the broader context of the full study. The consistently low scores for "Negation" highlight this as a particularly challenging reasoning type for both model variants, a common finding in knowledge graph completion research. The chart effectively communicates that the new method maintains state-of-the-art level performance across a taxonomy of complex queries.
</details>
Figure 3: Effects of Query Abstraction.
Regarding the analysis of query abstraction, we considered a variant of LARK called ‘LARK (semantic)’, which retains semantic information in KG entities and relations. As shown in Figure 3, we observe that semantic information provides a minor performance enhancement of $0.01\$ for simple projection queries. However, in more complex queries, it results in a performance degradation of $0.7\$ . The primary cause of this degradation is that the inclusion of semantic information exceeds the LLMs’ token limit, leading to a loss of neighborhood information. Hence, we assert that query abstraction is not only a valuable technique for mitigating model hallucination and achieving generalization across different KG datasets but can also enhance performance by reducing token size.
## 5 Concluding Discussion
In this paper, we presented LARK, the first approach to integrate logical reasoning over knowledge graphs with the capabilities of LLMs. Our approach utilizes logically-decomposed LLM prompts to enable chain reasoning over subgraphs retrieved from knowledge graphs, allowing us to efficiently leverage the reasoning ability of LLMs. Through our experiments on logical reasoning across standard KG datasets, we demonstrated that LARK outperforms previous state-of-the-art approaches by a significant margin on 14 different FOL query types. Finally, our work also showed that the performance of LARK improves with increasing scale and better design of the underlying LLMs. We demonstrated that LLMs that can handle larger input token lengths can lead to significant performance improvements. Overall, our approach presents a promising direction for integrating LLMs with logical reasoning over knowledge graphs.
The proposed approach of using LLMs for complex logical reasoning over KGs is expected to pave a new way for improved reasoning over large, noisy, and incomplete real-world KGs. This can potentially have a significant impact on various applications such as natural language understanding, question answering systems, intelligent information retrieval systems, etc. For example, in healthcare, KGs can be used to represent patient data, medical knowledge, and clinical research, and logical reasoning over these KGs can enable better diagnosis, treatment, and drug discovery. However, there can also be some ethical considerations that can be taken into account. As with most of the AI-based technologies, there is a potential risk of inducing bias into the model, which can lead to unfair decisions and actions. Bias can be introduced in the KGs themselves, as they are often created semi-automatically from biased sources, and can be amplified by the logical reasoning process. Moreover, the large amount of data used to train LLMs can also introduce bias, as it may reflect societal prejudices and stereotypes. Therefore, it is essential to carefully monitor and evaluate the KGs and LLMs used in this approach to ensure fairness and avoid discrimination. The performance of this method is also dependent on the quality and completeness of the KGs used, and the limited token size of current LLMs. But, we also observe that the current trend of increasing LLM token limits will soon resolve some of these limitations.
## References
- Arakelyan et al. (2021) Erik Arakelyan, Daniel Daza, Pasquale Minervini, and Michael Cochez. Complex query answering with neural link predictors. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Mos9F9kDwkz.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 1247–1250, New York, NY, USA, 2008. Association for Computing Machinery. URL https://doi.org/10.1145/1376616.1376746.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka, and Tom M. Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1306–1313. AAAI Press, 2010.
- Chen et al. (2022) Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Q8GnGqT-GTJ.
- Choudhary et al. (2021a) Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan Reddy. Probabilistic entity representation model for reasoning over knowledge graphs. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 23440–23451. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/c4d2ce3f3ebb5393a77c33c0cd95dc93-Paper.pdf.
- Choudhary et al. (2021b) Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan K. Reddy. Self-supervised hyperboloid representations from logical queries over knowledge graphs. In Proceedings of the Web Conference 2021, WWW ’21, pp. 1373–1384, New York, NY, USA, 2021b. Association for Computing Machinery. URL https://doi.org/10.1145/3442381.3449974.
- Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3Pf3Wg6o-A4.
- Das et al. (2017) Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 132–141, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-1013.
- Dong et al. (2023) Junnan Dong, Qinggang Zhang, Xiao Huang, Keyu Duan, Qiaoyu Tan, and Zhimeng Jiang. Hierarchy-aware multi-hop question answering over knowledge graphs. In Proceedings of the Web Conference 2023, WWW ’23, New York, NY, USA, 2023. Association for Computing Machinery. URL https://doi.org/10.1145/3543507.3583376.
- Dua et al. (2022) Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing complex questions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1251–1265, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.81.
- Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Hamilton et al. (2018) Will Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding logical queries on knowledge graphs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ef50c335cca9f340bde656363ebd02fd-Paper.pdf.
- Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1266–1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.82.
- Khot et al. (2023) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_nGgzQjzaRy.
- Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 809–816, Madison, WI, USA, 2011. Omnipress.
- OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv, 2023.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
- Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pp. 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. URL https://doi.org/10.1145/3394486.3406703.
- Ren & Leskovec (2020) Hongyu Ren and Jure Leskovec. Beta embeddings for multi-hop logical reasoning in knowledge graphs. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Ren et al. (2020) Hongyu Ren, Weihua Hu, and Jure Leskovec. Query2box: Reasoning over knowledge graphs in vector space using box embeddings. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJgr4kSFDS.
- Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 697–706, New York, NY, USA, 2007. Association for Computing Machinery. URL https://doi.org/10.1145/1242572.1242667.
- Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1499–1509, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL https://aclanthology.org/D15-1174.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.emnlp-demos.6.
- Yasunaga et al. (2021) Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 535–546, Online, June 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.naacl-main.45.
- Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.
## Appendix
## Appendix A Query Decomposition of Different Query Types
Figure 4 provides the query decomposition of different query types considered in our empirical study as well as previous literature in the area.
<details>
<summary>extracted/2305.01157v3/images/query_decomposition.png Details</summary>

### Visual Description
## Diagram Set: Logical Query Decomposition Patterns
### Overview
The image displays a collection of 14 distinct diagrams arranged in a grid pattern (4 rows, with the last row containing 2 diagrams and a legend). Each diagram is labeled with a short alphanumeric code (e.g., "1p", "2i", "pin") and illustrates a transformation from an initial logical query structure (top) to a decomposed or executed form (bottom), indicated by a large downward-pointing arrow. The diagrams use a consistent visual language of colored shapes and symbols to represent entities, relations, and logical operations, likely in the context of knowledge graph querying or semantic reasoning.
### Components/Axes
There are no traditional chart axes. The components are defined by a legend located in the bottom-right corner of the image.
**Legend (Bottom-Right):**
* **Green Circle (r):** "Projection with Relation r"
* **Orange Circle (e):** "Question Entities"
* **Orange Square (A):** "Answer Entities"
* **Blue Circle (Λ):** "Intersection over the entity sets"
* **Purple Circle (V):** "Union over the entity sets"
* **Pink Circle (¬):** "Negation of a relation"
**Diagram Labels (Top of each diagram):**
The 14 diagrams are labeled as follows, reading left-to-right, top-to-bottom:
1. **1p**
2. **2p**
3. **3p**
4. **2i**
5. **3i**
6. **ip**
7. **pi**
8. **2u**
9. **up**
10. **2in**
11. **3in**
12. **inp**
13. **pin**
14. **pni**
### Detailed Analysis
Each diagram follows a pattern: an initial query graph at the top transforms into a set of simpler, executable sub-queries at the bottom. The transformation is indicated by a large, white, downward-pointing arrow.
**Diagram-by-Diagram Breakdown:**
1. **1p (Single Projection):**
* **Initial:** `e1 -> r1 -> A`
* **Transformed:** `e1 -> r1 -> A1` (A single sub-query).
2. **2p (Two-hop Projection):**
* **Initial:** `e1 -> r1 -> r2 -> A`
* **Transformed:** Two sub-queries: `e1 -> r1 -> A1` and `A1 -> r2 -> A`.
3. **3p (Three-hop Projection):**
* **Initial:** `e1 -> r1 -> r2 -> r3 -> A`
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `A1 -> r2 -> A2`, `A2 -> r3 -> A`.
4. **2i (Two-entity Intersection):**
* **Initial:** Two paths (`e1 -> r1` and `e2 -> r2`) converge via an Intersection (Λ) to `A`.
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `e2 -> r2 -> A2`, and `A1, A2 -> Λ -> A`.
5. **3i (Three-entity Intersection):**
* **Initial:** Three paths (`e1 -> r1`, `e2 -> r2`, `e3 -> r3`) converge via an Intersection (Λ) to `A`.
* **Transformed:** Four sub-queries: `e1 -> r1 -> A1`, `e2 -> r2 -> A2`, `e3 -> r3 -> A3`, and `A1, A2, A3 -> Λ -> A4`.
6. **ip (Intersection then Projection):**
* **Initial:** Two paths (`e1 -> r1`, `e2 -> r2`) intersect (Λ), then project via `r3` to `A`.
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `e2 -> r2 -> A2`, `A1, A2 -> Λ -> A3`, and `A3 -> r3 -> A`.
7. **pi (Projection then Intersection):**
* **Initial:** Two paths: one is `e1 -> r1 -> r2`, the other is `e2 -> r3`. They intersect (Λ) to `A`.
* **Transformed:** Four sub-queries: `e1 -> r1 -> A1`, `A1 -> r2 -> A2`, `e2 -> r3 -> A3`, and `A2, A3 -> Λ -> A4`.
8. **2u (Two-entity Union):**
* **Initial:** Two paths (`e1 -> r1`, `e2 -> r2`) converge via a Union (V) to `A`.
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `e2 -> r2 -> A2`, and `A1, A2 -> V -> A`.
9. **up (Union then Projection):**
* **Initial:** Two paths (`e1 -> r1`, `e2 -> r2`) unite (V), then project via `r3` to `A`.
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `e2 -> r2 -> A2`, `A1, A2 -> V -> A3`, and `A3 -> r3 -> A`.
10. **2in (Two-entity Intersection with Negation):**
* **Initial:** Two paths: `e1 -> r1` and `e2 -> r2 -> ¬` (negation). They intersect (Λ) to `A`.
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `e2 -> ¬r2 -> A2`, and `A1, A2 -> Λ -> A`.
11. **3in (Three-entity Intersection with Negation):**
* **Initial:** Three paths: `e1 -> r1`, `e2 -> r2`, `e3 -> r3 -> ¬`. They intersect (Λ) to `A`.
* **Transformed:** Four sub-queries: `e1 -> r1 -> A1`, `e2 -> r2 -> A2`, `e3 -> ¬r3 -> A3`, and `A1, A2, A3 -> Λ -> A4`.
12. **inp (Intersection with Negation, then Projection):**
* **Initial:** Two paths (`e1 -> r1`, `e2 -> r2 -> ¬`) intersect (Λ), then project via `r3`.
* **Transformed:** Three sub-queries: `e1 -> r1 -> A1`, `e2 -> ¬r2 -> A2`, `A1, A2 -> Λ -> A3`, and `A3 -> r3 -> A`.
13. **pin (Projection, then Intersection with Negation):**
* **Initial:** Two paths: one is `e1 -> r1 -> r2 -> ¬`, the other is `e2 -> r3`. They intersect (Λ) to `A`.
* **Transformed:** Four sub-queries: `e1 -> r1 -> A1`, `A1 -> ¬r2 -> A2`, `e2 -> r3 -> A3`, and `A2, A3 -> Λ -> A4`.
14. **pni (Projection, Negation, then Intersection):**
* **Initial:** Two paths: one is `e1 -> r1 -> r2`, the other is `e2 -> r3 -> ¬`. They intersect (Λ) to `A`.
* **Transformed:** Four sub-queries: `e1 -> r1 -> A1`, `A1 -> r2 -> A2`, `e2 -> ¬r3 -> A3`, and `A2, A3 -> Λ -> A4`.
### Key Observations
* **Naming Convention:** The labels appear to be abbreviations describing the query structure: `p` for projection/hop, `i` for intersection, `u` for union, `n` for negation. The number indicates the count (e.g., `2i` = two-entity intersection).
* **Consistent Transformation:** Every diagram shows a complex query being decomposed into a sequence of simpler, likely executable, sub-queries. The final sub-query always produces the ultimate answer `A`.
* **Logical Operators:** The diagrams explicitly model core logical operations (Intersection Λ, Union V, Negation ¬) as nodes within the query graph.
* **Color Coding:** Strict adherence to the legend's color scheme is maintained throughout all diagrams for visual consistency.
### Interpretation
This image serves as a visual taxonomy or reference sheet for **logical query patterns over knowledge graphs or semantic networks**. It demonstrates how complex queries involving multi-hop paths, set operations (intersection, union), and negation can be systematically decomposed into a series of simpler, executable sub-queries.
The diagrams likely illustrate the operational semantics of a query language or reasoning system. The transformation from top to bottom represents the **query planning or rewriting phase**, where a high-level logical query is broken down into a sequence of basic graph traversal operations (projections) and set operations. This is a fundamental concept in database theory, semantic web technologies (like SPARQL), and neuro-symbolic AI, where symbolic queries are executed over structured knowledge bases.
The presence of negation (`¬`) is particularly significant, as it introduces the concept of "non-existence" or "exclusion" into the query logic, which is more complex to implement than positive projections. The set of diagrams provides a comprehensive catalog of common query shapes, serving as a valuable reference for understanding, designing, or implementing systems that perform logical reasoning over graph-structured data.
</details>
Figure 4: Query Decomposition of different query types considered in our experiments.
## Appendix B Prompt Templates of Different Query Types
The prompt templates for full complex logical queries with multiple operations and decomposed elementary logical queries with single operation are provided in Tables 5 and 6, respectively.
Table 5: Full Prompt Templates of Different Query Types.
| Context | $N_k(q_τ[Q_τ])$ | Given the following (h,r,t) triplets where entity h is related to entity t by relation r; $(h_1,r_1,t_1),(h_2,r_2,t_2),(h_3,r_3,t_3),(h_4,r_4,t_4),$ |
| --- | --- | --- |
| $(h_5,r_5,t_5),(h_6,r_6,t_6),(h_7,r_7,t_7),(h_8,r_8,t_8)$ | | |
| 1p | $∃ X.r_1(X,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| 2p | $∃ X.r_1(X,∃ Y.r_2(Y,e_1)$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ . Then, what are the entities connected to E by relation $r_2$ ? |
| 3p | $∃ X.r_1(X,∃ Y.r_2(Y,∃ Z.r_3(Z,e_1)$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ and the set of entities F is connected to entities in E by relation $r_2$ . Then, what are the entities connected to F by relation $r_3$ ? |
| 2i | $∃ X.[r_1(X,e_1)\wedge r_2(X,e_2)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ and the set of entities F is connected to entity $e_2$ by relation $r_2$ . Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G? |
| 3i | $∃ X.[r_1(X,e_1)\wedge r_2(X,e_2)\wedge r_3(X,e_3)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , the set of entities F is connected to entity $e_2$ by relation $r_2$ and the set of entities G is connected to entity $e_3$ by relation $r_3$ . Then, what are the entities in the intersection of set E, F and G, i.e., entities present in all E, F and G? |
| ip | $∃ X.r_3(X,∃ Y.[r_1(Y,e_1)\wedge r_2(Y,e_2)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , F is the set of entities connected to entity $e_2$ by relation $r_2$ , and G is the set of entities in the intersection of E and F. Then, what are the entities connected to entities in set G by relation $r_3$ ? |
| pi | $∃ X.[r_1(X,∃ Y.r_2(Y,e_2))\wedge r_3(X,e_3)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , F is the set of entities connected to entities in E by relation $r_2$ , and G is the set of entities connected to entity $e_2$ by relation $r_3$ . Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G? |
| 2u | $∃ X.[r_1(X,e_1)\vee r_2(X,e_2)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ and F is the set of entities connected to entity $e_2$ by relation $r_2$ . Then, what are the entities in the union of set F and G, i.e., entities present in either F or G? |
| up | $∃ X.r_3(X,∃ Y.[r_1(Y,e_1)\vee r_2(Y,e_2)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ and F is the set of entities connected to entity $e_2$ by relation $r_2$ . G is the set of entities in the union of E and F. Then, what are the entities connected to entities in G by relation $r_3$ ? |
| 2in | $∃ X.[r_1(X,e_1)\wedge¬ r_2(X,e_2)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ and F is the set of entities connected to entity $e_2$ by any relation other than relation $r_2$ . Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G? |
| 3in | $∃ X.[r_1(X,e_1)\wedge r_2(X,e_2)\wedge¬ r_3(X,e_3)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , F is the set of entities connected to entity $e_2$ by relation $r_2$ , and F is the set of entities connected to entity $e_3$ by any relation other than relation $r_3$ . Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G? |
| inp | $∃ X.r_3(X,∃ Y.[r_1(Y,e_1)\wedge¬ r_2(Y,e_2)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , and F is the set of entities connected to entity $e_2$ by any relation other than relation $r_2$ . Then, what are the entities that are connected to the entities in the intersection of set E and F by relation $r_3$ ? |
| pin | $∃ X.[r_1(X,∃ Y.¬ r_2(Y,e_2))\wedge r_3(X,e_3)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , F is the set of entities connected to entities in E by relation $r_2$ , and G is the set of entities connected to entity $e_2$ by any relation other than relation $r_3$ . Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G? |
| pni | $∃ X.[r_1(X,∃ Y.¬ r_2(Y,e_2))\wedge¬ r_3(X,e_3)]$ | Let us assume that the set of entities E is connected to entity $e_1$ by relation $r_1$ , F is the set of entities connected to entities in E by any relation other than $r_2$ , and G is the set of entities connected to entity $e_2$ by relation $r_3$ . Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G? |
Table 6: Decomposed Prompt Templates of Different Query Types.
| Context | $N_k(q_τ[Q_τ])$ | Given the following (h,r,t) triplets where entity h is related to entity t by relation r; $(h_1,r_1,t_1),(h_2,r_2,t_2),(h_3,r_3,t_3),(h_4,r_4,t_4),$ |
| --- | --- | --- |
| $(h_5,r_5,t_5),(h_6,r_6,t_6),(h_7,r_7,t_7),(h_8,r_8,t_8)$ | | |
| 1p | $∃ X.r_1(X,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| 2p | $∃ X.r_1(X,∃ Y.$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $r_2(Y,e_1)$ | Which entities are connected to any entity in [PP1] by relation $r_2$ ? | |
| 3p | $∃ X.r_1(X,∃ Y$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $.r_2(Y,∃ Z.$ | Which entities are connected to any entity in [PP1] by relation $r_2$ ? | |
| $r_3(Z,e_1)$ | Which entities are connected to any entity in [PP2] by relation $r_3$ ? | |
| 2i | $∃ X.[r_1(X,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge r_2(X,e_2)]$ | Which entities are connected to $e_2$ by relation $r_2$ ? | |
| What are the entities in the intersection of entity sets [PP1] and [PP2]? | | |
| 3i | $∃ X.[r_1(X,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge r_2(X,e_2)$ | Which entities are connected to $e_2$ by relation $r_2$ ? | |
| $\wedge r_3(X,e_3)]$ | Which entities are connected to $e_3$ by relation $r_3$ ? | |
| What are the entities in the intersection of entity sets [PP1], [PP2] and [PP3]? | | |
| ip | $∃ X.r_3(X,∃ Y.[r_1(Y,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge r_2(Y,e_2)]$ | Which entities are connected to $e_2$ by relation $r_2$ ? | |
| What are the entities in the intersection of entity sets [PP1] and [PP2]? | | |
| What are the entities connected to any entity in [PP3] by relation $r_3$ ? | | |
| pi | $∃ X.[r_1(X,∃ Y.r_2(Y,e_2))$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge r_3(X,e_3)]$ | Which entities are connected to [PP1] by relation $r_2$ ? | |
| Which entities are connected to $e_2$ by relation $r_3$ ? | | |
| What are the entities in the intersection of entity sets [PP2] and [PP3]? | | |
| 2u | $∃ X.[r_1(X,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\vee r_2(X,e_2)]$ | Which entities are connected to $e_2$ by relation $r_2$ ? | |
| What are the entities in the union of entity sets [PP1] and [PP2]? | | |
| up | $∃ X.r_3(X,∃ Y.[r_1(Y,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\vee r_2(Y,e_2)]$ | Which entities are connected to $e_2$ by relation $r_2$ ? | |
| What are the entities in the union of entity sets [PP1] and [PP2]? | | |
| Which entities are connected to any entity in [PP3] by relation $r_3$ ? | | |
| 2in | $∃ X.[r_1(X,e_1)$ | Which entities are connected to $e_1$ by any relation other than $r_1$ ? |
| $\wedge¬ r_2(X,e_2)]$ | Which entities are connected to $e_2$ by any relation other than $r_2$ ? | |
| What are the entities in the intersection of entity sets [PP1] and [PP2]? | | |
| 3in | $∃ X.[r_1(X,e_1)$ | Which entities are connected to $e_1$ by any relation other than $r_1$ ? |
| $\wedge r_2(X,e_2)$ | Which entities are connected to $e_2$ by any relation other than $r_2$ ? | |
| $\wedge¬ r_3(X,e_3)]$ | Which entities are connected to $e_3$ by any relation other than $r_3$ ? | |
| What are the entities in the intersection of entity sets [PP1], [PP2] and [PP3]? | | |
| inp | $∃ X.r_3(X,∃ Y.[r_1(Y,e_1)$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge¬ r_2(Y,e_2)]$ | Which entities are connected to $e_2$ by any relation other than $r_2$ ? | |
| What are the entities in the intersection of entity sets [PP1], and [PP2]? | | |
| What are the entities connected to any entity in [PP3] by relation $r_3$ ? | | |
| pin | $∃ X.[r_1(X,∃ Y.¬ r_2(Y,e_2))$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge r_3(X,e_3)]$ | Which entities are connected to entity set in [PP1] by relation $r_2$ ? | |
| Which entities are connected to $e_2$ by any relation other than $r_3$ ? | | |
| What are the entities in the intersection of entity sets [PP2] and [PP3]? | | |
| pni | $∃ X.[r_1(X,∃ Y.¬ r_2(Y,e_2))$ | Which entities are connected to $e_1$ by relation $r_1$ ? |
| $\wedge¬ r_3(X,e_3)]$ | Which entities are connected to any entity in [PP1] by any relation other than $r_2$ ? | |
| Which entities are connected to $e_2$ by relation $r_3$ ? | | |
| What are the entities in the intersection of entity sets [PP2] and [PP3]? | | |
## Appendix C Analysis of Logical Reasoning Performance using HITS Metric
Tables 7 and 8 present the HITS@K=3 results of baselines and our model. HITS@K indicates the accuracy of predicting correct candidates in the top-K results.
Table 7: Performance comparison study between LARK and the baseline, focusing on their efficacy of logical reasoning using HITS@K=1,3,10 scores. The rows correspond to the models and columns denote the different query structures of multi-hop projections, geometric operations, and compound operations. The best results for each query type in every dataset are highlighted in bold font.
| Dataset FB15k | Variant Llama2-7B | 1p HITS@1 74.6 | 2p 26 | 3p 18.5 | 2i 59.9 | 3i 47.7 | ip 2.4 | pi 5.7 | 2u 5.8 | up 5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| complex | 77.5 | 37.9 | 26.3 | 67.4 | 54.6 | 8.2 | 20.7 | 20.7 | 17.6 | |
| step | 77.5 | 41.8 | 28.1 | 70.2 | 57.3 | 10.3 | 24.3 | 22.8 | 17.8 | |
| FB15k-237 | Llama2-7B | 77.2 | 28.5 | 17.7 | 10.9 | 22.6 | 10.8 | 8.7 | 10.5 | 13.2 |
| complex | 78.5 | 30.8 | 19.3 | 41.1 | 38.1 | 9.6 | 18.7 | 24.2 | 14.0 | |
| step | 78.5 | 34.3 | 21.3 | 43.2 | 40.2 | 11.7 | 22.2 | 27.9 | 14.2 | |
| NELL995 | Llama2-7B | 86.4 | 28.3 | 19.6 | 10.2 | 24 | 8.6 | 3.5 | 1.5 | 15.9 |
| complex | 88.0 | 30.9 | 21.7 | 44.1 | 41.6 | 7.4 | 8.2 | 3.3 | 17 | |
| step | 88.0 | 34.3 | 24.0 | 46.1 | 43.8 | 9.5 | 9.8 | 8.9 | 17.3 | |
| HITS@3 | | | | | | | | | | |
| FB15k | Llama2-7B | 74 | 53.4 | 34.6 | 18.2 | 36.4 | 44.7 | 39.4 | 35.7 | 77.1 |
| complex | 77.7 | 57.6 | 37.9 | 68.5 | 61.3 | 39.6 | 84.8 | 82.9 | 81.7 | |
| step | 77.7 | 57.4 | 40.1 | 69.4 | 62.5 | 48.4 | 91.2 | 92.7 | 82.6 | |
| FB15k-237 | Llama2-7B | 75.9 | 42.6 | 25.7 | 12.6 | 25.9 | 43.6 | 35.1 | 42.9 | 53.8 |
| complex | 78.3 | 45.9 | 28.1 | 47.2 | 43.7 | 38.7 | 75.6 | 89.4 | 57 | |
| step | 78.3 | 45.9 | 29.8 | 48.2 | 44.6 | 47.3 | 80.0 | 93.6 | 57.6 | |
| NELL995 | Llama2-7B | 85.6 | 42.9 | 28.7 | 11.8 | 27.6 | 34.6 | 14.1 | 5.7 | 65 |
| complex | 87.8 | 46.8 | 31.6 | 50.7 | 47.9 | 29.8 | 32.9 | 13.2 | 69.4 | |
| step | 87.8 | 45.7 | 33.5 | 51.3 | 48.7 | 38.1 | 39.6 | 35.8 | 70.3 | |
| HITS@10 | | | | | | | | | | |
| FB15k | Llama2-7B | 73.6 | 53.9 | 35.7 | 18.1 | 36.3 | 44.6 | 39.5 | 35.7 | 77.1 |
| complex | 77.7 | 58.2 | 39.1 | 68.2 | 61.4 | 39.5 | 85 | 82.9 | 81.7 | |
| step | 77.7 | 57.4 | 46.0 | 69.4 | 62.5 | 48.2 | 91.2 | 84.7 | 82.6 | |
| FB15k-237 | Llama2-7B | 75.2 | 43 | 26.5 | 12.6 | 25.9 | 43.6 | 35.1 | 42.9 | 53.8 |
| complex | 78.3 | 46.4 | 29 | 47.3 | 43.8 | 38.7 | 75.6 | 89.4 | 57 | |
| step | 78.3 | 45.9 | 34.1 | 48.2 | 44.6 | 47.3 | 80.0 | 93.6 | 57.6 | |
| NELL995 | Llama2-7B | 84.9 | 43.4 | 29.2 | 11.8 | 27.6 | 34.6 | 14.1 | 5.7 | 65 |
| complex | 87.8 | 47.4 | 32.2 | 50.8 | 48 | 29.8 | 32.9 | 13.2 | 69.4 | |
| step | 87.8 | 45.7 | 38.3 | 51.3 | 48.7 | 38.1 | 39.6 | 35.8 | 70.3 | |
Table 8: Performance comparison between LARK and the baseline for negation query types using HITS@K=1,3,10 scores. The best results for each query type in every dataset are given in bold font.
| FB15k | Llama2-7B complex step | 1.8 6.7 7.4 | 0.7 2.4 2.7 | 4.0 14.2 14.9 | 2.1 7.8 9.1 | 0.9 3.3 3.4 | 18.6 26.6 31.0 | 5.7 9.5 12.1 | 40.8 59.2 64.8 | 18.8 30.3 38.7 | 8.6 12.3 14.4 | 18.6 26.6 31.0 | 5.7 9.5 12.1 | 40.8 59.3 64.8 | 18.8 30.3 38.7 | 8.6 12.4 14.4 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| FB15k-237 | Llama2-7B | 1.9 | 0.8 | 6.8 | 2.8 | 0.7 | 7.5 | 3.5 | 27.3 | 11.6 | 2.7 | 7.5 | 3.5 | 27.3 | 11.6 | 2.7 |
| complex | 2.7 | 1.4 | 9.8 | 4.6 | 1 | 10.8 | 5.8 | 39.6 | 18.7 | 3.9 | 10.8 | 5.8 | 39.6 | 18.7 | 3.9 | |
| step | 3.2 | 1.7 | 10.6 | 5.8 | 1.1 | 12.6 | 7.4 | 43.3 | 23.9 | 4.6 | 12.6 | 7.4 | 43.3 | 23.9 | 4.6 | |
| NELL995 | Llama2-7B | 2.8 | 1.4 | 7.2 | 2.2 | 1.5 | 11.2 | 6 | 29.1 | 9.2 | 6.2 | 11.2 | 6 | 29.1 | 9.2 | 6.2 |
| complex | 3.9 | 2.3 | 10.2 | 3.7 | 2.2 | 16.1 | 9.4 | 41.8 | 15.1 | 9 | 16.1 | 9.4 | 41.8 | 15.1 | 9 | |
| step | 4.6 | 2.8 | 11.1 | 4.7 | 2.7 | 18.5 | 12.0 | 46.0 | 19.3 | 10.9 | 18.5 | 12.0 | 46.0 | 19.3 | 10.9 | |
## Appendix D Algorithm
Algorithm for the LARK’s procedure is provided in Algorithm 1.
Input: Logical query $q_τ$ , Knowledge Graph $G:E× R$ ;
Output: Answer entities $V_τ$ ;
1 # Query Abstraction: Map entity and relations to IDs
2 $q_τ=Abstract(q_τ);$
3 $G=Abstract(G);$
4 # Neighborhood Retrieval
5 $N_k(q_τ[Q_τ])=≤ft\{(h,r,t)\right\}$ , using Eq. (7)
6 # Query Decomposition
7 $q^d_τ=Decomp(q_τ);$
8 # Initialize Answer Cache $ans=\{\}$ ;
9 for $i∈ 1:length≤ft(q^d_τ\right)$ do
10 # Replace Answer Cache in Question
11 $q^d_τ[i]=replace(q^d_τ[i],ans[i-1]);$
12 $ans[i]=LLM≤ft(q^d_τ[i]\right);$
13
14 end for
return $ans[length≤ft(q^d_τ\right)]$
Algorithm 1 LARK Algorithm
Table 9: Details of the token distribution for various query types in different datasets. The columns present the mean, median, minimum (Min), and maximum (Max) values of the number of tokens in the queries of different query types. Column ‘Cov’ presents the percentage of queries (coverage) that contain less than 4096 tokens, which is the token limit of Llama2 model.
| 1p | 70.2 | 61 | 58 | 10338 | 100 | 82.1 | 61 | 58 | 30326 | 99.9 | 81.7 | 61 | 58 | 30250 | 99.9 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 2p | 331.2 | 106 | 86 | 27549 | 97.1 | 1420.9 | 140 | 83 | 130044 | 89.7 | 893.4 | 136 | 83 | 108950 | 90.9 |
| 3p | 785.2 | 165 | 103 | 80665 | 91 | 3579.8 | 329 | 103 | 208616 | 75.7 | 3052.6 | 389 | 100 | 164545 | 73.7 |
| 2i | 1136.7 | 276 | 119 | 20039 | 86.3 | 4482.8 | 636 | 119 | 60655 | 67.7 | 4469.3 | 680 | 119 | 54916 | 67.3 |
| 3i | 2575.4 | 860 | 145 | 29148 | 68.4 | 8760.2 | 2294 | 145 | 85326 | 48.3 | 8979.4 | 2856 | 145 | 76834 | 44.8 |
| ip | 1923.8 | 1235 | 135 | 21048 | 67.4 | 4035.8 | 2017 | 131 | 32795 | 50.5 | 4838 | 2676 | 131 | 33271 | 43.6 |
| pi | 1036.8 | 455 | 140 | 10937 | 85.8 | 1255.6 | 343 | 141 | 45769 | 83.4 | 1535.3 | 435 | 135 | 21125 | 79.9 |
| 2u | 1325.4 | 790 | 121 | 14703 | 80.8 | 2109.5 | 868 | 123 | 60655 | 68.9 | 2294.9 | 1138 | 125 | 23637 | 65.7 |
| up | 115.3 | 112 | 110 | 958 | 100 | 113.7 | 112 | 110 | 981 | 100 | 113.2 | 112 | 110 | 427 | 100 |
| 2in | 1169.1 | 548 | 123 | 18016 | 84.9 | 5264.7 | 1116 | 128 | 60281 | 61.8 | 3496 | 774 | 124 | 58032 | 71.6 |
| 3in | 4070.3 | 2230 | 159 | 28679 | 46.6 | 13695.8 | 8344 | 175 | 88561 | 25.9 | 12575.9 | 7061 | 164 | 88250 | 28.1 |
| inp | 629 | 112 | 110 | 73457 | 91.8 | 1949.4 | 394 | 110 | 115169 | 78.4 | 696.7 | 112 | 110 | 89660 | 93.8 |
| pin | 400.7 | 154 | 129 | 6802 | 95.8 | 1106.5 | 242 | 129 | 44010 | 87.2 | 418.1 | 131 | 129 | 24062 | 96.7 |
| pni | 345.9 | 129 | 127 | 7938 | 96.6 | 547.1 | 129 | 127 | 18057 | 95.1 | 289.3 | 129 | 127 | 17489 | 97.9 |
## Appendix E Query Token Distribution in Datasets
The quantitative details of the query token’s lengths is provided in Table 9 and their complete distribution plots are provided in Figure 5. From the results, we observe that the distribution of token lengths is positively-skewed for most of the query types, which indicates that the number of samples with high token lengths is small in number. Thus, small improvements in the LLMs’ token limit can potentially lead to better coverage on most of the reasoning queries in standard KG datasets.
<details>
<summary>extracted/2305.01157v3/images/prob_dist.png Details</summary>

### Visual Description
## Density Plots: Token Count Distributions by Query Type and Dataset
### Overview
The image displays a 5x3 grid of 14 individual density plots (the bottom-right cell is empty). Each plot visualizes the probability density distribution of the "Number of Tokens" for a specific query type across three different knowledge graph datasets: NELL, FB15k, and FB15k-237. The overall purpose is to compare the token length characteristics of different query structures across these datasets.
### Components/Axes
* **Grid Structure:** 14 subplots arranged in rows and columns.
* **Subplot Titles:** Each subplot is titled with a specific "Query Type" (e.g., `Query Type=1p`, `Query Type=2p`, `Query Type=3p`, `Query Type=2i`, `Query Type=3i`, `Query Type=ip`, `Query Type=pi`, `Query Type=2u`, `Query Type=up`, `Query Type=2in`, `Query Type=3in`, `Query Type=inp`, `Query Type=pin`, `Query Type=pni`).
* **X-Axis:** Labeled "Number of Tokens" for all plots. The scale varies significantly between plots, ranging from a maximum of ~600 (for `1p`) to ~70,000 (for `3in`).
* **Y-Axis:** Labeled "Probability Density" for all plots. The scale also varies, with peak densities ranging from ~0.0002 to ~0.20.
* **Legend:** Present in the top-right corner of each subplot. It contains three entries:
* `NELL` (represented by a blue filled area/line)
* `FB15k` (represented by an orange filled area/line)
* `FB15k-237` (represented by a green filled area/line)
### Detailed Analysis
The following describes each subplot, noting the approximate x-axis range, the visual trend for each dataset, and key density peaks.
**Row 1:**
1. **Query Type=1p:** X-axis: 0-600. **NELL (blue):** Extremely sharp, high peak (density ~0.012) near 100 tokens, then drops to near zero. **FB15k (orange) & FB15k-237 (green):** Very low, broad distributions peaking gently around 150-200 tokens.
2. **Query Type=2p:** X-axis: 0-1200. **NELL:** Sharp peak (density ~0.0035) near 150 tokens. **FB15k & FB15k-237:** Broader, lower distributions. FB15k-237 (green) has a slightly higher and later peak (~400 tokens) than FB15k.
3. **Query Type=3p:** X-axis: 0-3500. **NELL:** Sharp peak (density ~0.0010) near 250 tokens. **FB15k & FB15k-237:** Very broad, low distributions extending past 3000 tokens, with FB15k-237 peaking slightly higher and earlier than FB15k.
**Row 2:**
4. **Query Type=2i:** X-axis: 0-6000. **NELL:** Sharp peak (density ~0.0010) near 200 tokens. **FB15k & FB15k-237:** Broad distributions with peaks around 500-1000 tokens, extending to 6000.
5. **Query Type=3i:** X-axis: 0-25000. **NELL:** Sharp peak (density ~0.0004) near 500 tokens. **FB15k & FB15k-237:** Very broad, low distributions with multiple small humps, extending to 25,000 tokens.
6. **Query Type=ip:** X-axis: 0-25000. **NELL:** Sharp peak (density ~0.0004) near 1000 tokens. **FB15k & FB15k-237:** Broad distributions with a notable secondary hump for FB15k around 5000 tokens.
**Row 3:**
7. **Query Type=pi:** X-axis: 0-4000. **NELL:** Sharp peak (density ~0.0008) near 250 tokens. **FB15k & FB15k-237:** Broader distributions with peaks around 500 tokens, extending to 4000.
8. **Query Type=2u:** X-axis: 0-10000. **NELL:** Sharp peak (density ~0.0006) near 500 tokens. **FB15k & FB15k-237:** Broad distributions with peaks around 1000-2000 tokens.
9. **Query Type=up:** X-axis: 0-1000. **NELL:** Extremely sharp, very high peak (density ~0.20) near 100 tokens. **FB15k & FB15k-237:** Very low, flat distributions barely visible on this scale.
**Row 4:**
10. **Query Type=2in:** X-axis: 0-7000. **NELL:** Sharp peak (density ~0.0008) near 500 tokens. **FB15k & FB15k-237:** Broader distributions with peaks around 1000-1500 tokens.
11. **Query Type=3in:** X-axis: 0-70000. **NELL:** Sharp peak (density ~0.0002) near 2000 tokens. **FB15k & FB15k-237:** Very broad, low distributions with multiple small humps, extending to 70,000 tokens.
12. **Query Type=inp:** X-axis: 0-1000. **NELL:** Broad peak (density ~0.0012) centered around 200 tokens. **FB15k & FB15k-237:** Broader, lower distributions. FB15k-237 (green) is notably higher than FB15k (orange) across most of the range.
**Row 5:**
13. **Query Type=pin:** X-axis: 0-1200. **NELL:** Sharp peak (density ~0.0030) near 200 tokens. **FB15k & FB15k-237:** Broader distributions with peaks around 300-400 tokens.
14. **Query Type=pni:** X-axis: 0-1200. **NELL:** Sharp peak (density ~0.0035) near 150 tokens. **FB15k & FB15k-237:** Broader distributions with peaks around 200-300 tokens.
### Key Observations
1. **Consistent Dataset Behavior:** Across nearly all query types, the **NELL** dataset (blue) exhibits a distribution with a much sharper, higher peak at a lower token count compared to the other two datasets. This indicates NELL queries are consistently more concise.
2. **FB15k vs. FB15k-237:** The **FB15k** (orange) and **FB15k-237** (green) distributions are generally broader and shifted to higher token counts. In many plots (e.g., `2p`, `3p`, `inp`), the FB15k-237 distribution is slightly higher or peaks earlier than the standard FB15k distribution.
3. **Query Type Complexity:** The x-axis scale (max token count) increases with apparent query complexity. Simple queries like `1p` and `up` have ranges <1000, while complex queries like `3i`, `3in`, and `ip` have ranges extending to 25,000-70,000 tokens.
4. **Extreme Outlier:** The `Query Type=up` plot is a major outlier. The NELL distribution here is an exceptionally sharp spike (density ~0.20), suggesting this specific query type in NELL has an extremely consistent and very short token length. The FB15k distributions are negligible on this scale.
### Interpretation
This visualization provides a technical analysis of query structure complexity across different knowledge graph benchmarks. The data suggests:
* **Structural Differences in Datasets:** The stark contrast between NELL and the FB15k variants implies fundamental differences in how queries are formulated or what they represent in these datasets. NELL queries appear to be templated or constrained to be very short, while FB15k queries allow for much greater variability and length.
* **Impact of Dataset Version:** The subtle but consistent differences between FB15k and FB15k-237 (a filtered version of FB15k) suggest that the filtering process may slightly alter the distribution of query lengths, often making them marginally shorter or more concentrated.
* **Query Type Taxonomy:** The varying x-axis scales serve as a proxy for the inherent complexity of the query types. The `u` (union) and `n` (negation) operators, especially in combination (`3in`), appear to generate the most verbose queries. The `p` (projection) and `i` (intersection) operators also lead to longer queries.
* **Practical Implication:** For researchers using these datasets, this analysis highlights that models processing FB15k queries must handle significantly longer and more variable input sequences than those processing NELL. The `up` query type in NELL is a special case of extreme consistency. This information is crucial for designing model architectures, setting maximum sequence lengths, and understanding benchmark difficulty.
</details>
Figure 5: Probability distribution of the number of tokens in each query type. The figures contains 14 graphs for the 14 different query types. The x-axis and y-axis presents the number of tokens in the query and their probability density, respectively.