# Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning
**Authors**: Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Wenjie Zhang
> 0009-0000-7232-7051University of New South WalesData61, CSIROSydneyAustraliaxingyu.tan@unsw.edu.au
> 0000-0003-3554-3219University of New South WalesSydneyAustraliaxiaoyang.wang1@unsw.edu.au
> 0000-0001-7895-9551Data61, CSIROSydneyAustraliaq.liu@data61.csiro.au
> 0000-0002-2273-1862Data61, CSIROSydneyAustraliaxiwei.xu@data61.csiro.au
> 0000-0002-9167-1613Data61, CSIROSydneyAustraliaxin.yuan@data61.csiro.au
> 0000-0001-6572-2600University of New South WalesSydneyAustraliawenjie.zhang@unsw.edu.au
(2025)
Abstract.
Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.
Large Language Models; Knowledge Graph; Knowledge Graph Question Answering; Retrieval-Augmented Generation journalyear: 2025 copyright: acmlicensed conference: Proceedings of the ACM Web Conference 2025; April 28-May 2, 2025; Sydney, NSW, Australia booktitle: Proceedings of the ACM Web Conference 2025 (WWW ’25), April 28-May 2, 2025, Sydney, NSW, Australia doi: 10.1145/3696410.3714892 isbn: 979-8-4007-1274-6/25/04 ccs: Information systems Question answering
1. Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Question Answering Approaches
### Overview
The image presents a comparison of different approaches to answering the question: "What country bordering France contains an airport that serves Nijmegen?". It showcases four methods: GPT-3.5/GPT-4 LLM only, LLM empowered KG exploration search, LLM empowered KG subgraph answering, and PoG (Proof of Graph) reasoning paths. Each method's process and final answer are displayed, along with an indication of whether the answer is correct or incorrect.
### Components/Axes
The image is divided into four sections, labeled (a), (b), (c), and (d). Each section represents a different approach to answering the question.
* **Section (a): GPT-3.5/GPT-4 LLM only**
* Input: Question mark icon connected to a transformer icon (representing the LLM).
* Output: "Belgium" (incorrect answer) marked with a red "X" icon.
* Intermediate steps: A text box containing the chain of thought reasoning.
* **Section (b): LLM empowered KG exploration search**
* Input: "France" and "Nijmegen" boxes, a question mark icon, a knowledge graph icon, and a transformer icon.
* Output: "Netherlands" (incorrect answer) marked with a red "X" icon.
* Intermediate steps: A text box containing the explored triples and the answering process.
* **Section (c): LLM empowered KG subgraph answering**
* Input: Question mark icon, a knowledge graph icon, and a transformer icon.
* Output: "Refuse to answering" marked with a red "X" icon.
* Intermediate steps: A text box stating that the MindMap cannot prompt LLM to construct a graph.
* **Section (d): PoG (Proof of Graph) reasoning paths**
* Input: Question mark icon, a knowledge graph icon, and a transformer icon.
* Output: "Germany" (correct answer) marked with a green checkmark icon.
* Intermediate steps: A diagram showing the reasoning path exploration and pruning process, along with a text box containing the reasoning paths.
### Detailed Analysis or Content Details
**Section (a): GPT-3.5/GPT-4 LLM only**
* **Question:** What country bordering France contains an airport that serves Nijmegen?
* **Model:** GPT-3.5/GPT-4 LLM only
* **Process:** Chain of Thoughts prompt. The model reasons that Nijmegen is served by airports in neighboring countries, with Brussels Airport (BRU) in Belgium being one of the closest.
* **Answer:** Belgium (Incorrect)
**Section (b): LLM empowered KG exploration search**
* **Input:** France, Nijmegen, KG Triples
* **Process:** Explores triples related to France and Nijmegen.
* Triples: \[France, location.location.containedby, Europe], [France, location.location.containedby, Western Europe], [France, location.location.geolocation, Unnamed Entity], [Nijmegen, second\_level\_division, Netherland]
* **Reasoning:** Nijmegen is a city in the Netherlands, and the Netherlands is a country bordering France.
* **Answer:** Netherlands (Incorrect)
**Section (c): LLM empowered KG subgraph answering**
* **Process:** Attempts to construct a graph and generate a graph description document.
* **Result:** Refuses to answer due to the retrieved subgraph being extremely large and dense.
**Section (d): PoG (Proof of Graph) reasoning paths**
* **Process:**
* Subgraph Detection: Multiple knowledge graph icons are combined.
* Question Analysis: A question mark icon is combined with a transformer icon.
* Reasoning Path Exploration: A list icon is processed by a transformer icon.
* Reasoning Path Pruning: A list icon is processed by a transformer icon.
* **Reasoning Paths:**
* Nijmegen nearby Weeze Airport contain by Germany continent Europ, Western Europen contain France
* Nijmegen nearby Weeze Airport contain by Germany adjoins Unnamed Entity adjoins France
* **Response:** From the provided knowledge graph path, the entity {Germany} is the country that contains an airport serving Nijmegen and is also the country bordering France. Therefore, the answer to the main question "What country bordering France contains an airport that serves Nijmegen?" is {Germany}.
* **Answer:** Germany (Correct)
### Key Observations
* The GPT-3.5/GPT-4 LLM only approach and the LLM empowered KG exploration search both provide incorrect answers.
* The LLM empowered KG subgraph answering approach fails to provide an answer due to the complexity of the subgraph.
* The PoG reasoning paths approach provides the correct answer by leveraging a structured reasoning process and knowledge graph information.
### Interpretation
The image demonstrates the varying effectiveness of different approaches to question answering, particularly when dealing with complex queries that require reasoning over knowledge graphs. The GPT-3.5/GPT-4 LLM only approach, while capable of generating fluent text, lacks the structured knowledge and reasoning capabilities to arrive at the correct answer. The LLM empowered KG exploration search improves upon this by incorporating knowledge graph information, but still falls short due to its limited reasoning capabilities. The LLM empowered KG subgraph answering approach highlights the challenges of dealing with large and dense knowledge graphs. The PoG reasoning paths approach, which combines knowledge graph information with a structured reasoning process, proves to be the most effective in this case, successfully identifying the correct answer. This suggests that combining the strengths of LLMs with structured knowledge and reasoning techniques is crucial for achieving accurate and reliable question answering.
</details>
Figure 1. Representative workflow of four LLM reasoning paradigms.
Large Language Models (LLMs) have demonstrated remarkable performance in various tasks (Brown, 2020; Chowdhery et al., 2023; Touvron et al., 2023; Besta et al., 2024; Huang et al., 2025). These models leverage pre-training techniques by scaling to billions of parameters and training on extensive, diverse, and unlabelled data (Touvron et al., 2023; Rawte et al., 2023). Despite these impressive capabilities, LLMs face two well-known challenges. First, they struggle with deep and responsible reasoning when tackling complex tasks (Petroni et al., 2020; Talmor et al., 2018; Khot et al., 2022). Second, the substantial cost of training makes it difficult to keep models updated with the latest knowledge (Sun et al., 2024; Wen et al., 2024), leading to errors when answering questions that require specialized information not included in their training data. For example, in Figure 1 (a), though models like GPT can generate reasonable answers for knowledge-specific questions, these answers may be incorrect due to outdated information or hallucination of reasoning on LLM inherent Knowledge Base (KB).
To deal with the problems of error reasoning and knowledge gaps, the plan-retrieval-answering method has been proposed (Luo et al., 2024; Zhao et al., 2023; Li et al., 2023b). In this approach, LLMs are prompted to decompose complex reasoning tasks into a series of sub-tasks, forming a plan. Simultaneously, external KBs are retrieved to answer each step of the plan. However, this method still has the issue of heavily relying on the reasoning abilities of LLMs rather than the faithfulness of the retrieved knowledge. The generated reasoning steps guide information selection, but answers are chosen based on the LLM’s interpretation of the retrieved knowledge rather than on whether the selection leads to a correct and faithful answer.
To address these challenges, incorporating external knowledge sources like Knowledge Graphs (KGs) is a promising solution to enhance LLM reasoning (Sun et al., 2024; Luo et al., 2024; Pan et al., 2024; Luo et al., 2023). KGs offer abundant factual knowledge in a structured format, serving as a reliable source to improve LLM capabilities. Knowledge Graph Question Answering (KGQA) serves as an approach for evaluating the integration of KGs with LLMs, which requires machines to answer natural language questions by retrieving relevant facts from KGs. These approaches typically involve: (1) identifying the initial entities from the question, and (2) iteratively retrieving and refining inference paths until sufficient evidence has been obtained. Despite their success, they still face challenges such as handling multi-hop reasoning problems, addressing questions with multiple topic entities, and effectively utilizing the structural information of graphs.
Challenge 1: Multi-hop reasoning problem. Current methods (Guo et al., 2024; Ye et al., 2021; Sun et al., 2024; Ma et al., 2024), such as the ToG model presented in Figure 1 (b), begin by exploring from each topic entity, with LLMs selecting connected knowledge triples like (France, contained_by, Europe). This process relies on the LLM’s inherent understanding of these triples. However, focusing on one-hop neighbors can result in plausible but incorrect answers and prematurely exclude correct ones, especially when multi-hop reasoning is required. Additionally, multi-hop reasoning introduces significant computational overhead, making efficient pruning essential, especially in dense and large KGs.
Challenge 2: Multi-entity question. As shown in Figure 1 (b), existing work (Guo et al., 2024; Ye et al., 2021; Sun et al., 2024; Ma et al., 2024) typically explores KG for each topic entity independently. When a question involves multiple entities, these entities are examined in separate steps without considering their interconnections. This approach can result in a large amount of irrelevant information in the candidate set that does not connect to the other entities in the question, leading to suboptimal results.
Challenge 3: Utilizing graph structure. Existing methods (Wen et al., 2024; Guo et al., 2023; Chen et al., 2024) often overlook the inherent graph structures when processing retrieved subgraphs. For example, the MindMap model in Figure 1 (c) utilizes LLMs to generate text-formatted subgraphs from KG triples, converting them into graph descriptions that are fed back into the LLM to produce answers. This textual approach overlooks the inherent structural information of graphs and can overwhelm the LLM when dealing with large graphs. Additionally, during KG information selection, most methods use in-context learning by feeding triples into the LLM, ignoring the overall graph structure.
Contributions. In this paper, we introduce a novel method, P aths- o ver- G raph (PoG). Unlike previous studies that utilize knowledge triples for retrieval (Sun et al., 2024; Ma et al., 2024), PoG employs knowledge reasoning paths, that contain all the topic entities in a long reasoning length, as a retrieval-augmented input for LLMs. The paths in KGs serve as logical reasoning chains, providing KG-supported, interpretable reasoning logic that addresses issues related to the lack of specific knowledge background and unfaithful reasoning paths.
To address multi-hop reasoning problem, as shown in Figure 1 (d), PoG first performs question analysis, to extract topic entities from questions. Utilizing these topic entities, it decomposes the complex question into sub-questions and generates an LLM thinking indicator termed "Planning". This planning not only serves as an answering strategy but also predicts the implied relationship depths between the answer and each topic entity. The multi-hop paths are then explored starting from a predicted depth, enabling a dynamic search process. Previous approaches using planning usually retrieve information from scratch, which often confuses LLMs with source neighborhood-based semantic information. In contrast, our method ensures that LLMs follow accurate reasoning paths that directly lead to the answer.
To address multi-entity questions, PoG employs a three-phase exploration process to traverse reasoning paths from the retrieved question subgraph. All paths must contain all topic entities in the same order as they occur in the LLM thinking indicator. In terms of reasoning paths in KGs, all paths are inherently logical and faithful. Each path potentially contains one possible answer and serves as the interpretable reasoning logic. The exploration leverages the inherent knowledge of both LLM and KG.
To effectively utilize graph structure, PoG captures the question subgraph by expanding topic entities to their maximal depth neighbors, applying graph clustering and reduction to reduce graph search costs. In the path pruning phase, we select possible correct answers from numerous candidates. All explored paths undergo a three-step beam search pruning, integrating graph structures, LLM prompting, and a pre-trained language understanding model (e.g., BERT) to ensure effectiveness and efficiency. Additionally, inspired by the Graph of Thought (GoT) (Besta et al., 2024), to reduce LLM hallucination, PoG prompts LLMs to summarize the obtained Top- $W_{\max}$ paths before evaluating the answer, where $W_{\max}$ is a user-defined maximum width in the path pruning phase. In summary, the advantage of PoG can be abbreviated as:
- Dynamic deep search: Guided by LLMs, PoG dynamically extracts multi-hop reasoning paths from KGs, enhancing LLM capabilities in complex knowledge-intensive tasks.
- Interpretable and faithful reasoning: By utilizing highly question-relevant knowledge paths, PoG improves the interpretability of LLM reasoning, enhancing the faithfulness and question-relatedness of LLM-generated content.
- Efficient pruning with graph structure integration: PoG incorporates efficient pruning techniques in both the KG and reasoning paths to reduce computational costs, mitigate LLM hallucinations caused by irrelevant noise, and effectively narrow down candidate answers.
- Flexibility and effectiveness: a) PoG is a plug-and-play framework that can be seamlessly applied to various LLMs and KGs. b) PoG allows frequent knowledge updates via the KG, avoiding the expensive and slow updates required for LLMs. c) PoG reduces the LLMs token usage by over 50% with only a ±2% difference in accuracy compared to the best-performing strategy. d) PoG achieves state-of-the-art results on all the tested KGQA datasets, outperforming the strong baseline ToG by an average of 18.9% accuracy using both GPT-3.5 and GPT-4. Notably, PoG with GPT-3.5 can outperform ToG with GPT-4 by up to 23.9%.
2. Related Work
KG-based LLM reasoning. KGs provide structured knowledge valuable for integration with LLMs (Pan et al., 2024). Early studies (Peters et al., 2019; Luo et al., 2024; Zhang et al., 2021; Li et al., 2023b) embed KG knowledge into neural networks during pre-training or fine-tuning, but this can reduce explainability and hinder efficient knowledge updating (Pan et al., 2024). Recent methods combine KGs with LLMs by converting relevant knowledge into textual prompts, often ignoring structural information (Pan et al., 2024; Wen et al., 2024). Advanced works (Sun et al., 2024; Jiang et al., 2023; Ma et al., 2024) involve LLMs directly exploring KGs, starting from an initial entity and iteratively retrieving and refining reasoning paths until the LLM decides the augmented knowledge is sufficient. However, by starting from a single vertex and ignoring the question’s position within the KG’s structure, these methods overlook multiple topic entities and the explainability provided by multi-entity paths.
Reasoning with LLM prompting. LLMs have shown significant potential in solving complex tasks through effective prompting strategies. Chain of Thought (CoT) prompting (Wei et al., 2022) enhances reasoning by following logical steps in few-shot learning. Extensions like Auto-CoT (Zhang et al., 2023), Complex-CoT (Fu et al., 2022), CoT-SC (Wang et al., 2022), Zero-Shot CoT (Kojima et al., 2022), ToT (Yao et al., 2024), and GoT (Besta et al., 2024) build upon this approach. However, these methods often rely solely on knowledge present in training data, limiting their ability to handle knowledge-intensive or deep reasoning tasks. To solve this, some studies integrate external KBs using plan-and-retrieval methods such as CoK (Li et al., 2023b), RoG (Luo et al., 2024), and ReAct (Yao et al., 2022), decomposing complex questions into subtasks to reduce hallucinations. However, they may focus on the initial steps of sub-problems and overlook further steps of final answers, leading to locally optimal solutions instead of globally optimal ones. To address these deep reasoning challenges, we introduce dynamic multi-hop question reasoning. By adaptively determining reasoning depths for different questions, we enable the model to handle varying complexities effectively.
KG information pruning. Graphs are widely used to model complex relationships among different entities (Tan et al., 2023; Sima et al., 2024; Li et al., 2024b, a). KGs contain vast amounts of facts (Guo et al., 2019; Wu et al., 2024; Wang et al., 2024a), making it impractical to involve all relevant triples in the context of the LLM due to high costs and potential noise (Wang et al., 2024b). Existing methods (Sun et al., 2024; Jiang et al., 2023; Ma et al., 2024) typically identify initial entities and iteratively retrieve reasoning paths until an answer is reached, often treating the LLM as a function executor and relying on in-context learning or fine-tuning, which is expensive. Some works attempt to reduce pruning costs. KAPING (Baek et al., 2023a) projects questions and triples into the same semantic space to retrieve relevant knowledge via similarity measures. KG-GPT (Kim et al., 2023) decomposes complex questions, matches, and selects the relevant relations with sub-questions to form evidence triples. However, these methods often overlook the overall graph structure and the interrelations among multiple topic entities, leading to suboptimal pruning and reasoning performance.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Knowledge Graph and Question Answering Diagram
### Overview
The image is a diagram illustrating a knowledge graph and a question-answering process. It depicts how a question is processed through various stages, including initialization, exploration, path pruning, and question answering, using a knowledge graph to find the answer.
### Components/Axes
**1. Knowledge Graph (Left Side):**
* Nodes: Represent entities (e.g., "Nijmegen," "Weeze Airport," "France," "Germany," "Kingdom of the Netherlands," "Europe, Western Europe," "Central European Time Zone," "Olympics," "Ryanair," "Wired," "Lyon-Saint Exupéry Airport").
* Edges: Represent relationships between entities (e.g., "nearby airports," "containedby," "country," "user.topics," "adjoin_s," "in_this_time_zone").
* Node Colors:
* Red: "Nijmegen," "Germany," "France"
* Blue: "Weeze Airport," "Public airport," "Europe, Western Europe"
* Gray: "Kingdom of the Netherlands," "Veghel, Strijen, Rhenen, Oostzaan," "Central European Time Zone," "Olympics," "Unnamed Entity"
**2. Question Answering Process (Right Side):**
* **Initialization:**
* Question: "What country bordering France contains an airport that serves Nijmegen?"
* Topic Entity Recognition
* Question Subgraph Detection
* Split Questions, LLM indicator, Ordered Entities
* **Exploration:**
* Topic Entity Path Exploration
* LLM Supplement Path Exploration
* Node Expand Exploration
* **Path Pruning:**
* Fuzzy Selection: Indicator (H_I), Paths_Set (H_Path)
* Precise Path Selection
* Branch Reduced Selection
* **Question Answering:**
* Path Summarizing
* Answer (Yes/No)
### Detailed Analysis or ### Content Details
**Knowledge Graph Details:**
* "Nijmegen" is connected to "Weeze Airport" via "nearby airports."
* "Weeze Airport" is connected to "Public airport" via "airport type."
* "Weeze Airport" is "containedby" "Germany."
* "Nijmegen" is "location.administrative_division, containedby" "Kingdom of the Netherlands."
* "Kingdom of the Netherlands" is "country" connected to "Veghel, Strijen, Rhenen, Oostzaan."
* "Germany" is "continent" connected to "Europe, Western Europe."
* "Europe, Western Europe" is "in_this_time_zone" connected to "Central European Time Zone."
* "France" is connected to "Lyon-Saint Exupéry Airport" via "containedby."
* "France" is connected to "Europe, Western Europe" via "contain."
* "France" is connected to "Wired" via "user.topics."
* "France" is connected to "Olympics" via "participating countries."
* "Wired" is connected to "Ryanair" via "user.topics."
* "Olympics" is connected to "Unnamed Entity" via "olympic. athletes."
* "Unnamed Entity" is connected to "Olympics" via "athlete. affiliation."
* "Germany" is connected to "Unnamed Entity" via "adjoin_s."
* "France" is connected to "Unnamed Entity" via "adjoin_s."
**Question Answering Process Details:**
* The question "What country bordering France contains an airport that serves Nijmegen?" initiates the process.
* The question is processed through Topic Entity Recognition and Question Subgraph Detection.
* The question is split into smaller parts, and LLM indicators and ordered entities are identified.
* The Exploration phase involves Topic Entity Path Exploration, LLM Supplement Path Exploration, and Node Expand Exploration.
* Path Pruning involves Fuzzy Selection, Precise Path Selection, and Branch Reduced Selection.
* Path Summarizing leads to the final answer (Yes/No).
### Key Observations
* The diagram illustrates a complex system for question answering using a knowledge graph.
* The knowledge graph contains various entities and their relationships.
* The question-answering process involves multiple stages of exploration and pruning.
* The use of LLM (Large Language Model) is integrated into the process.
### Interpretation
The diagram demonstrates a sophisticated approach to question answering that leverages a knowledge graph and LLMs. The system aims to find relevant information within the knowledge graph to answer complex questions. The process involves breaking down the question, exploring potential paths within the graph, pruning irrelevant paths, and summarizing the remaining paths to arrive at an answer. The integration of LLMs suggests the use of natural language processing techniques to enhance the accuracy and efficiency of the question-answering process. The diagram highlights the importance of structured knowledge representation and advanced algorithms in building intelligent systems that can understand and respond to human queries.
</details>
Figure 2. Overview of the PoG architecture. Exploration: After initialization (detailed in Figure 3), the model retrieves entity paths from $\mathcal{G}_{q}$ through three exploration phases. Path Pruning: PoG applies a three-step beam search to prune paths after each exploration phase. Question Answering: The pruned paths are then evaluated for question answering. If these paths do not fully answer the question, the model explores deeper paths until $D_{max}$ is reached or moves on to the next exploration phase.
3. Preliminary
Consider a Knowledge Graph (KG) $\mathcal{G(E,R,T)}$ , where $\mathcal{E}$ , $\mathcal{R}$ and $\mathcal{T}$ represent the set of entities, relations, and knowledge triples, respectively. Each knowledge triple $T∈\mathcal{T}$ encapsulates the factual knowledge in $\mathcal{G}$ , and is represented as $T=(e_{h},r,e_{t})$ , where $e_{h},e_{t}∈\mathcal{E}$ and $r∈\mathcal{R}$ . Given an entity set $\mathcal{E_{S}⊂eq E}$ , the induced subgraph of $\mathcal{E_{S}}$ is denoted as $\mathcal{S=(E_{S},R_{S},T_{S})}$ , where $\mathcal{T}_{S}=\{(e,r,e^{\prime})∈\mathcal{T}\mid e,e^{\prime}∈\mathcal{E%
}_{S}\}$ , and $\mathcal{R}_{S}=\{r∈\mathcal{R}\mid(e,r,e^{\prime})∈\mathcal{T}_{S}\}.$ Furthermore, we denote $\mathcal{D}(e)$ and $\mathcal{D}(r)$ as the sets of short textual descriptions for each entity $e∈\mathcal{E}$ and each relation $r∈\mathcal{R}$ , respectively. For example, the text description of the entity “m.0f8l9c” is $\mathcal{D}$ (“m.0f8l9c”)= “France”. For simplicity, in this paper, all entities and relations are referenced through their $\mathcal{D}$ representations and transformed into natural language.
** Definition 0 (Reasoning Path)**
*Given a KG $\mathcal{G}$ , a reasoning path within $\mathcal{G}$ is defined as a connected sequence of knowledge triples, represented as: $path_{\mathcal{G}}(e_{1},e_{l+1})=\{T_{1},T_{2},...,T_{l}\}=\{(e_{1},r_{1},e_{%
2}),(e_{2},r_{2},e_{3})$ $,...,(e_{l},r_{l},e_{l+1})\}$ , where $T_{i}∈\mathcal{T}$ denotes the $i$ -th triple in the path and $l$ denotes the length of the path, i.e., $length(path_{\mathcal{G}}(e_{1},e_{l+1}))=l$ .*
** Example 0**
*Consider a reasoning path between ”University” and ”Student” in KG: $path_{\mathcal{G}}(\text{University}$ , $\text{Student})$ $=\{(\text{University}$ , employs, $\text{Professor})$ , $(\text{Professor}$ , teaches, $\text{Course})$ , $(\text{Course}$ , enrolled_in, $\text{Student})\}$ , and can be visualized as:
$$
\text{University}\xrightarrow{\text{employs}}\text{Professor}\xrightarrow{%
\text{teaches}}\text{Course}\xrightarrow{\text{enrolled\_in}}\text{Student}.
$$
It indicates that a “University” employs a “Professor,” who teaches a “Course,” in which a ”Student” is enrolled. The length of the path is 3.*
For any entity $s$ and $t$ in $\mathcal{G}$ , if there exists a reasoning path between $s$ and $t$ , we say $s$ and $t$ can reach each other, denoted as $s\leftrightarrow t$ . The distance between $s$ and $t$ in $\mathcal{G}$ , denoted as $dist_{\mathcal{G}}(s,t)$ , is the shortest reasoning path distance between $s$ and $t$ . For the non-reachable vertices, their distance is infinite. Given a positive integer $h$ , the $h$ -hop neighbors of an entity $s$ in $\mathcal{G}$ is defined as $N_{\mathcal{G}}(s,h)=\{t∈\mathcal{E}|dist_{\mathcal{G}}(s,t)≤ h\}$ .
** Definition 0 (Entity Path)**
*Given a KG $\mathcal{G}$ and a list of entities $list_{e}$ = [ $e_{1},e_{2},e_{3},...,e_{l}$ ], the entity path of $list_{e}$ is defined as a connected sequence of reasoning paths, which is denoted as $path_{\mathcal{G}}(list_{e})$ $=\{path_{\mathcal{G}}(e_{1},e_{2}),$ $path_{\mathcal{G}}(e_{2},e_{3}),...,path_{\mathcal{G}}(e_{l-1},e_{l})\}=\{(%
e_{s},r,e_{t})$ $|(e_{s},r,e_{t})∈ path_{\mathcal{G}}(e_{i},e_{i+1})\land 1≤ i<l\}$ .*
Knowledge Graph Question Answering (KGQA) is a fundamental reasoning task based on KGs. Given a natural language question $q$ and a KG $\mathcal{G}$ , the objective is to devise a function $f$ that predicts answers $a∈ Answer(q)$ utilizing knowledge encapsulated in $\mathcal{G}$ , i.e., $a=f(q,\mathcal{G})$ . Consistent with previous research (Sun et al., 2019; Luo et al., 2024; Sun et al., 2024; Ma et al., 2024), we assume the topic entities $Topic(q)$ mentioned in $q$ and answer entities $Answer(q)$ in ground truth are linked to the corresponding entities in $\mathcal{G}$ , i.e., $Topic(q)⊂eq\mathcal{E}\text{ and }Answer(q)⊂eq\mathcal{E}$ .
4. Method
PoG implements the “KG-based LLM Reasoning” by first exploring all possible faithful reasoning paths and then collaborating with LLM to perform a 3-step beam search selection on the retrieved paths. Compared to previous approaches (Sun et al., 2024; Ma et al., 2024), our model focuses on providing more accurate and question-relevant retrieval-argument graph information. The framework of PoG is outlined in Figure 2, comprising four main components.
- Initialization. The process begins by identifying the set of topic entities from the question input, and then queries the source KG $\mathcal{G}$ by exploring up to $D_{\max}$ -hop from each topic entity to construct the evidence sub-graph $\mathcal{G}_{q}$ , where $D_{\max}$ is the user-defined maximum exploration depth. Subsequently, we prompt the LLM to analyze the question and generate an indicator that serves as a strategy for the answer formulation process and predicting the exploration depth $D_{\text{predict}}$ .
- Exploration. After initialization, the model retrieves topic entity paths from $\mathcal{G}_{q}$ through three exploration phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. All reasoning paths are constrained within the depth range $D∈[D_{\text{predict}},D_{\max}]$ .
- Path Pruning. Following each exploration phase, PoG employs a pre-trained LM, LLM prompting, and graph structural analysis to perform a three-step beam search. The pruned paths are then evaluated in the question answering.
- Question Answering. Finally, LLM is prompted to assess if the pruned reasoning paths sufficiently answer the question. If not, continue exploration with deeper paths incrementally until the $D_{\max}$ is exceeded or proceed to the next exploration phase.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph Question Answering Process
### Overview
The image illustrates a system for answering questions using a knowledge graph. It outlines the process from inputting a question to generating an answer by leveraging a large language model (LLM) and graph analysis techniques. The process includes question analysis, graph reduction, node clustering, and graph detection.
### Components/Axes
* **Input:**
* Question: "What country bordering France contains an airport that serves Nijmegen?"
* Variables: G, q, Dmax (likely representing the knowledge graph, query, and maximum distance, respectively).
* **Country:**
* France (highlighted in yellow)
* **Airport:**
* Nijmegen (highlighted in red)
* H\_G and H\_T are present, but their meaning is not clear from the image.
* **Topic Entity:**
* France (highlighted in yellow)
* Nijmegen (highlighted in red)
* **Knowledge Graph (G):** A network of interconnected nodes and relations.
* **LLM Indictor:**
* Output1: I\_LLM, q\_split (likely representing LLM indicator and split questions, respectively).
* Nijmegen <- airport <- answer (country) -> France
* Split\_question1: "What country contains an airport that serves Nijmegen?"
* Split\_question2: "What country borders France?"
* **Graph Reduction:** A step to simplify the knowledge graph.
* **Node and Relation Clustering:** A step to group related nodes and relations.
* **Graph Detection:** A step to identify relevant subgraphs.
* **Output2:** Gq (likely representing the resulting graph).
### Detailed Analysis
The diagram can be broken down into several stages:
1. **Input Question:** The process begins with a complex question that requires reasoning over a knowledge graph.
2. **Question Analysis:** The question is analyzed to identify key entities (France, Nijmegen) and their types (Country, Airport).
3. **Knowledge Graph Interaction:** The system interacts with the knowledge graph (G) to find relevant information related to the entities. This involves exploring the graph within a certain distance (Dmax) from the entities.
* France is associated with a yellow gradient, indicating a search radius.
* Nijmegen is associated with a red gradient, indicating a search radius.
4. **LLM Indicator:** The LLM is used to generate split questions and infer relationships between entities.
5. **Graph Reduction:** The knowledge graph is reduced to focus on the most relevant nodes and relations.
6. **Node and Relation Clustering:** Nodes and relations are clustered based on their similarity or relevance.
7. **Graph Detection:** The final step involves detecting the subgraph that answers the input question.
The graph reduction, node clustering, and graph detection steps are visually represented as graphs with nodes and edges. The nodes are colored differently (yellow, red, orange, white), and the edges are also colored (green, blue, purple, brown, etc.). The specific meaning of these colors is not explicitly defined in the image.
### Key Observations
* The system uses a combination of symbolic knowledge (knowledge graph) and neural methods (LLM) to answer complex questions.
* The process involves multiple stages of graph analysis and refinement.
* The use of split questions suggests a strategy of breaking down complex questions into simpler ones.
### Interpretation
The diagram illustrates a sophisticated approach to question answering that leverages the strengths of both knowledge graphs and large language models. The knowledge graph provides structured information about entities and their relationships, while the LLM provides the ability to reason over natural language and generate inferences. The system aims to answer complex questions by first analyzing the question, then exploring the knowledge graph to find relevant information, and finally using the LLM to synthesize an answer. The graph reduction and clustering steps are crucial for improving the efficiency and accuracy of the system by focusing on the most relevant parts of the knowledge graph.
</details>
Figure 3. Overview of the initialization phase. Output 1: from the input question, the model identifies topic entities and prompts the LLM to decompose questions into split questions $q_{split}$ and generate an indicator $I_{LLM}$ . The indicator outlines a strategy for formulating the answer and predicts the exploration depth $D_{predict}$ . Output 2: the model queries the source KG up to $D_{max}$ -hop from identified topic entities, constructing and pruning the evidence subgraph $\mathcal{G}_{q}$ .
4.1. Initialization
The initialization has two main stages, i.e., question subgraph detection and question analysis. The framework is shown in Figure 3.
Question subgraph detection. Given a question $q$ , PoG initially identifies the question subgraph, which includes all the topic entities of $q$ and their $D_{\max}$ -hop neighbors.
Topic entity recognition. To identify the relevant subgraph, PoG first employs LLMs to extract the potential topic entities from the question. Following the identification, the process applies BERT-based similarity matching to align these potential entities with entities from KG. Specifically, as shown in Figure 3, we encode both the keywords and all entities from KG into dense vector embeddings as $H_{T}$ and $H_{\mathcal{G}}$ . We then compute a cosine similarity matrix between these embeddings to determine the matches. For each keyword, the entities with the highest similarity scores are selected to form the set $Topic(q)$ . This set serves as the foundation for constructing the question subgraph in subsequent steps.
Subgraph detection. Upon identifying the topic entities, PoG captures the induced subgraph $\mathcal{G}_{q}⊂eq\mathcal{G}$ by expanding around each entity $e$ in $Topic(q)$ . For each entity, we retrieve knowledge triples associated with its $D_{\max}$ -hop neighbors, thereby incorporating query-relevant and faithful KG information into $\mathcal{G}_{q}$ . Through this process, we update $\mathcal{E}_{q}$ with newly added intermediate nodes that serve as bridging pathways between the topic entities. The result subgraph, $\mathcal{G}_{q}$ is defined as $(\mathcal{E}_{q},\mathcal{R}_{q},\mathcal{T}_{q})$ , where $\mathcal{E}_{q}$ encompasses $Topic(q)$ together with the set $\{N_{\mathcal{G}}(e,D_{\max})\mid e∈ Topic(q)\}$ , effectively linking all relevant entities and their connective paths within the defined hop distance. To interact with KG, we utilize the pre-defined SPARQL queries as detailed in Appendix D.
Graph pruning. To efficiently manage information overhead and reduce computational cost, we implement graph pruning on the question subgraph $\mathcal{G}_{q}$ using node and relation clustering alongside graph reduction techniques. As illustrated in Figure 3, node and relation clustering is achieved by compressing multiple nodes and their relations into supernodes, which aggregate information from the original entities and connections. For graph reduction, we employ bidirectional BFS to identify all paths connecting the topic entities. Based on these paths, we regenerate induced subgraphs that involve only the relevant connections, effectively excluding nodes and relations that lack strong relevance to the topic entities.
Question analysis. To reduce hallucinations in LLMs, the question analysis phase is divided into two parts and executed within a single LLM call using an example-based prompt (shown in Appendix E). First, the complex question $q$ is decomposed into simpler questions based on the identified topic entities, each addressing their relationship to the potential answer. Addressing these simpler questions collectively guides the LLM to better answer the original query, thereby reducing hallucinations. Second, a LLM indicator is generated, encapsulating all topic entities and predicting the answer position within a single chain of thought derived from the original question. This indicator highlights the relationships and sequence among the entities and answer. Based on this, a predicted depth $D_{\text{predict}}$ is calculated, defined as the maximum distance between the predicted answer and each topic entity. An example of question analysis is shown in Figure 3 with predicted depth 2.
4.2. Exploration
As discussed in Section 1, identifying reasoning paths that encompass all topic entities is essential to derive accurate answers. These paths serve as interpretable chains of thought, providing both the answer and the inference steps leading to it, a feature we refer as interpretability. To optimize the discovery of such paths efficiently and accurately, the exploration process is divided into three phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. After each phase, we perform path pruning and question answering. If a sufficient path is found, the process terminates; otherwise, it advances to the next phase to explore additional paths. Due to the space limitation, the pseudo-code of exploration section is shown in Appendix A.1.
Topic entity path exploration. To reduce LLM usage and search space, PoG begins exploration from a predicted depth $D_{\text{predict}}$ rather than the maximum depth. Using the question subgraph $\mathcal{G}_{q}$ , topic entities $Topic(q)$ , LLM indicator $I_{\text{LLM}}$ , and $D_{\text{predict}}$ , PoG identifies reasoning paths containing all topic entities by iteratively adjusting the exploration depth $D$ . Entities in $Topic(q)$ are ordered according to $I_{\text{LLM}}$ to facilitate reasoning effectively. Starting from the predicted depth $D=min(D_{\text{predict}},D_{\text{max}})$ , we employ a bidirectional BFS to derive all potential entity paths, which is defined as:
$$
Paths_{t}=\{p\mid|Topic(q)|\times(D-1)<length(p)\leq|Topic(q)|\times D\},
$$
where $p=Path_{\mathcal{G}_{q}}(Topic(q))$ . To reduce the complexity, a pruning strategy is employed and selects the top- $W_{\max}$ paths based on $Paths_{t}$ , $I_{\text{LLM}}$ , and split questions from Section 4.1. These paths are evaluated for sufficiency verification. If inadequate, $D$ is incremented until $D_{\max}$ is reached. Then the next phase commences.
LLM supplement path exploration. Traditional KG-based LLM reasoning often rephrases KG facts without utilizing the LLM’s inherent knowledge. To overcome this, PoG prompts LLMs to generate predictions based on path understanding and its implicit knowledge, providing additional relevant insights. It involves generating new LLM thinking indicators $I_{\text{Sup}}$ for predicted entities $e∈ Predict(q)$ , and then using text similarity to verify and align them with $\mathcal{E}_{q}∈\mathcal{G}_{q}$ . The supplementary entity list $List_{S}(e)=Topic(q)+e$ is built and ranked by $I_{\text{Sup}}$ to facilitate reasoning effectively. Next, supplementary paths $Paths_{s}$ are derived from $List_{S}(e)$ in the evidence KG $\mathcal{G}_{q}$ with a fixed depth $D_{\max}$ :
$$
Paths_{s}=\{p\mid\text{length}(p)\leq|Topic(q)|\times D_{\max}\},
$$
where $p=Path_{\mathcal{G}_{q}}(List_{S}(e))$ . These paths with new indicators are evaluated similarly to the topic entity path exploration phase. The prompting temple is shown in Appendix E.
Node expand exploration. If previous phases cannot yield sufficient paths, PoG proceeds to node expansion. Unlike previous methods (Sun et al., 2024; Ma et al., 2024) that separately explore relations and entities, PoG explores both simultaneously, leveraging clearer semantic information for easier integration with existing paths. During the exploration, PoG expands unvisited entities by 1-hop neighbors in $\mathcal{G}$ . New triples are merged into existing paths to form the new paths, followed by pruning and evaluation.
4.3. Path Pruning
As introduced in Section 2, KGs contain vast amounts of facts, making it impractical to involve all relevant triples in the LLM’s context due to high costs. To address this complexity and reduce LLM overhead, we utilize a three-step beam search for path pruning. The corresponding pseudo-code can be found in Appendix A.2.
Fuzzy selection. Considering that only a small subset of the generated paths is relevant, the initial step of our beam search involves fuzzy selection by integrating a pre-trained language model (e.g. SentenceBERT (Reimers and Gurevych, 2019)), to filter the irrelevant paths quickly. As shown in Figure 2, we encode the LLM indicator $I_{\text{LLM}}$ (or $I_{\text{Sup}}$ ) and all reasoning paths into vector embeddings, denoted as $H_{I}$ and $H_{Paths}$ , and calculate cosine similarities between them. The top- $W_{1}$ paths with the highest similarity scores are selected for further evaluation.
Precise path selection. Following the initial fuzzy selection, the number of candidate paths is reduced to $W_{1}$ . At this stage, we prompt the LLM to select the top- $W_{\max}$ reasoning paths most likely to contain the correct answer. The specific prompt used to guide LLM in selection phase can be found in Appendix E.
Branch reduced selection. Considering that paths are often represented in natural language and can be extensive, leading to high processing costs for LLMs, we implement a branch reduced selection method integrated with the graph structure. This method effectively balances efficiency and accuracy by further refining path selection. Starting with $D=1$ , for each entity $e$ in the entity list, we extract the initial $D$ -step paths from every path in the candidate set $Paths_{c}$ into a new set $Paths_{e}$ . If the number of $Paths_{e}$ exceeds the maximum designated width $W_{\max}$ , these paths are pruned using precise path selection. The process iterates until the number of paths in $Paths_{c}$ reaches $D_{\max}$ . For example, as illustrated in Figure 2, with $W_{\max}=1$ , only the initial step paths (depicted in green) are extracted for further examination, while paths represented by dashed lines are pruned. This selection method enables efficient iterative selection by limiting the number of tokens and ensuring the relevance and conciseness of the reasoning paths.
Beam search strategy. Based on the three path pruning methods above, PoG can support various beam search strategies, ranging from non-reliant to fully reliant on LLMs. These strategies are selectable in a user-friendly manner, allowing flexibility based on the specific requirements of the task. We have defined four such strategies in Algorithm 2 of Appendix A.2.
4.4. Question Answering
Based on the pruned paths in Section 4.3, we introduce a two-step question-answering method.
Path Summarizing. To address hallucinations caused by paths with excessive or incorrect text, we develop a summarization strategy by prompting LLM to review and extract relevant triples from provided paths, creating a concise and focused path. Details of the prompts used are in Appendix E.
Question answering. Based on the current reasoning path derived from path pruning and summarizing, we prompt the LLM to first evaluate whether the paths are sufficient for answering the split question and then the main question. If the evaluation is positive, LLM is prompted to generate the answer using these paths, along with the question and question analysis results as inputs, as shown in Figures 2. The prompts for evaluation and generation are detailed in Appendix E. If the evaluation is negative, the exploration process is repeated until completion. If node expand exploration reaches its depth limit without yielding a satisfactory answer, LLM will leverage both provided and inherent knowledge to formulate a response. Additional details on the prompts can be found in Appendix E.
5. Experiments
Table 1. Results of PoG across various datasets, compared with the state-of-the-art (SOTA) in Supervised Learning (SL) and In-Context Learning (ICL) methods. The highest scores for ICL methods are highlighted in bold, while the second-best results are underlined. The Prior FT (Fine-tuned) SOTA includes the best-known results achieved through supervised learning.
| Method | Class | LLM | Multi-Hop KGQA | Single-Hop KGQA | Open-Domain QA | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| CWQ | WebQSP | GrailQA | Simple Questions | WebQuestions | | | |
| Without external knowledge | | | | | | | |
| IO prompt (Sun et al., 2024) | - | GPT-3.5-Turbo | 37.6 | 63.3 | 29.4 | 20.0 | 48.7 |
| CoT (Sun et al., 2024) | - | GPT-3.5-Turbo | 38.8 | 62.2 | 28.1 | 20.3 | 48.5 |
| SC (Sun et al., 2024) | - | GPT-3.5-Turbo | 45.4 | 61.1 | 29.6 | 18.9 | 50.3 |
| With external knowledge | | | | | | | |
| Prior FT SOTA | SL | - | 70.4 (Das et al., 2021) | 85.7 (Luo et al., 2024) | 75.4 (Gu et al., 2023) | 85.8 (Baek et al., 2023b) | 56.3 (Kedia et al., 2022) |
| KB-BINDER (Li et al., 2023a) | ICL | Codex | - | 74.4 | 58.5 | - | - |
| ToG/ToG-R (Sun et al., 2024) | ICL | GPT-3.5-Turbo | 58.9 | 76.2 | 68.7 | 53.6 | 54.5 |
| ToG-2.0 (Ma et al., 2024) | ICL | GPT-3.5-Turbo | - | 81.1 | - | - | - |
| ToG/ToG-R (Sun et al., 2024) | ICL | GPT-4 | 69.5 | 82.6 | 81.4 | 66.7 | 57.9 |
| PoG-E | ICL | GPT-3.5-Turbo | 71.9 | 90.9 | 87.6 | 78.3 | 76.9 |
| PoG | ICL | GPT-3.5-Turbo | 74.7 | 93.9 | 91.6 | 80.8 | 81.8 |
| PoG-E | ICL | GPT-4 | 78.5 | 95.4 | 91.4 | 81.2 | 82.0 |
| PoG | ICL | GPT-4 | 81.4 | 96.7 | 94.4 | 84.0 | 84.6 |
Experimental settings. We evaluate PoG on five KGQA datasets, i.e., CWQ (Talmor and Berant, 2018), WebQSP (Yih et al., 2016), GrailQA (Gu et al., 2021), SimpleQuestions (Petrochuk and Zettlemoyer, 2018), and WebQuestions (Berant et al., 2013). PoG is tested against methods without external knowledge (IO, CoT (Wei et al., 2022), SC (Wang et al., 2022)) and the state-of-the-art (SOTA) approaches with external knowledge, including prompting-based and fine-tuning-based methods. Freebase (Bollacker et al., 2008) serves as the background knowledge graph for all datasets. Experiments are conducted using two LLMs, i.e., GPT-3.5 (GPT-3.5-Turbo) and GPT-4. Following prior studies, we use exact match accuracy (Hits@1) as the evaluation metric. Due to the space limitation, detailed experimental settings, including dataset statistics, baselines, and implementation details, are provided in Appendix C.
PoG setting. We adopt the Fuzzy + Precise Path Selection strategy in Algorithm 2 of Appendix A.2 for PoG, with $W_{1}=80$ for fuzzy selection. Additionally, we introduce PoG-E, which randomly selects one relation from each edge in the clustered question subgraph to evaluate the impact of graph structure on KG-based LLM reasoning. $W_{\max}$ and $D_{\max}$ are 3 by default for beam search.
5.1. Main Results
Since PoG leverages external knowledge to enhance LLM reasoning, we first compare it with other methods that utilize external knowledge. Although PoG is a training-free, prompting-based method and has natural disadvantages compared to fine-tuned methods trained on evaluation data. As shown in Table 1, PoG with GPT-3.5-Turbo still achieves new SOTA performance across most datasets. Additionally, PoG with GPT-4 surpasses fine-tuned SOTA across all the multi-hop and open-domain datasets by an average of 17.3% and up to 28.3% on the WebQuestions dataset. Comparing all the in-context learning (ICL) methods, PoG with GPT-3.5-Turbo surpasses all the previous SOTA methods. When comparing PoG with GPT-3.5-Turbo against SOTA using GPT-4, PoG outperforms the SOTA by an average of 12.9% and up to 23.9%. When using the same LLM, PoG demonstrates substantial improvements: with GPT-3.5-Turbo, it outperforms SOTA by an average of 21.2% and up to 27.3% on the WebQuestions dataset; with GPT-4, it outperforms SOTA by 16.6% on average and up to 26.7% on the WebQuestions dataset. Additionally, PoG with GPT-3.5-Turbo outperforms methods without external knowledge (e.g., IO, CoT, SC prompting) by 62% on GrailQA and 60.5% on Simple Questions. These results show that incorporating external knowledge graphs significantly enhances reasoning tasks. PoG-E also achieves excellent results. Under GPT-4, PoG-E surpasses all SOTA in ICL by 14.1% on average and up to 24.1% on the WebQuestions dataset. These findings demonstrate that the graph structure is crucial for reasoning tasks, particularly for complex logical reasoning. By integrating the structural information of the question within the graph, PoG enhances the deep reasoning capabilities of LLMs, leading to superior performance.
5.2. Ablation Study
We perform various ablation studies to understand the importance of different factors in PoG. These ablation studies are performed with GPT-3.5-Turbo on two subsets of the CWQ and WebQSP test sets, each containing 500 randomly sampled questions. Does search depth matter? As described, PoG’s dynamic deep search is limited by $D_{max}$ . To assess the impact of $D_{\max}$ on performance, we conduct experiments with depth from 1 to 4. The results, shown in Figures 4 (a) and (c), indicate that performance improves with increased depth, but the benefits diminish beyond a depth of 3. Figures 4 (b) and (d), showing which exploration phase the answer is generated from, reveal that higher depths reduce the effectiveness of both LLM-based path supplementation and node exploration. Excessive depth leads to LLM hallucinations and difficulties in managing long reasoning paths. Therefore, we set the maximum depth to 3 for experiments to balance performance and computational efficiency. Additionally, even at lower depths, PoG maintains strong performance by effectively combining the LLM’s inherent knowledge with the structured information from the KG.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Varying Maximum Depth
### Overview
The image is a line chart comparing the accuracy of two models, PoG and PoG-E, across varying maximum depths (Dmax) from 1 to 4. The y-axis represents accuracy in percentage, ranging from 50% to 85%. The x-axis represents the varying maximum depth (Dmax).
### Components/Axes
* **X-axis:** Varying maximum depth (Dmax), with values 1, 2, 3, and 4.
* **Y-axis:** Accuracy (%), ranging from 50 to 85, with gridlines at intervals of 5.
* **Legend:** Located at the bottom of the chart.
* Blue line with downward-pointing triangle markers: PoG
* Black line with diamond markers: PoG-E
### Detailed Analysis
* **PoG (Blue Line):**
* Trend: The accuracy of PoG increases from Dmax = 1 to Dmax = 3, then plateaus.
* Data Points:
* Dmax = 1: Accuracy ≈ 62.5%
* Dmax = 2: Accuracy ≈ 73.5%
* Dmax = 3: Accuracy ≈ 80.5%
* Dmax = 4: Accuracy ≈ 80.5%
* **PoG-E (Black Line):**
* Trend: The accuracy of PoG-E increases from Dmax = 1 to Dmax = 3, then decreases.
* Data Points:
* Dmax = 1: Accuracy ≈ 55.5%
* Dmax = 2: Accuracy ≈ 69%
* Dmax = 3: Accuracy ≈ 78.5%
* Dmax = 4: Accuracy ≈ 70%
### Key Observations
* Both models show an increase in accuracy as the maximum depth increases from 1 to 3.
* PoG-E's accuracy decreases when the maximum depth is increased from 3 to 4, while PoG's accuracy remains relatively constant.
* PoG consistently outperforms PoG-E at all maximum depths.
### Interpretation
The chart suggests that increasing the maximum depth of the models initially improves their accuracy. However, for PoG-E, there is a point where increasing the depth further leads to a decrease in accuracy, possibly due to overfitting. PoG appears to be more stable at higher depths, maintaining its accuracy. The data indicates that PoG is the superior model across the tested range of maximum depths. The optimal depth for PoG-E appears to be around 3, while PoG performs well at both 3 and 4.
</details>
(a) CWQ (Vary $D_{\max}$ )
<details>
<summary>x5.png Details</summary>

### Visual Description
## Chart Type: Stacked Bar Chart with Line Overlay
### Overview
The image is a stacked bar chart showing the breakdown of accuracy contributions from different exploration methods (Topic Entity Path Exploration, LLM Supplement Path Exploration, and Node Expand Exploration) at varying maximum depths. A line graph overlays the bars, representing the total accuracy at each depth.
### Components/Axes
* **X-axis:** "Varying maximum depth ($D_{max}$)" with tick marks at 1, 2, 3, and 4.
* **Y-axis:** "Accuracy (%)" with tick marks at 20, 40, 60, 80, and 100.
* **Legend:** Located at the bottom of the chart.
* Blue dashed line with circles: "Accuracy Total"
* Light Blue: "Topic Entity Path Exploration"
* Coral: "LLM Supplement Path Exploration"
* Gray: "Node Expand Exploration"
### Detailed Analysis
**1. Topic Entity Path Exploration (Light Blue):**
* Depth 1: Approximately 36%
* Depth 2: Approximately 59%
* Depth 3: Approximately 69%
* Depth 4: Approximately 71%
* Trend: The contribution from Topic Entity Path Exploration increases significantly from depth 1 to 2, then increases at a slower rate from depth 2 to 4.
**2. LLM Supplement Path Exploration (Coral):**
* Depth 1: Approximately 15%
* Depth 2: Approximately 6%
* Depth 3: Approximately 1%
* Depth 4: Approximately 1%
* Trend: The contribution from LLM Supplement Path Exploration decreases sharply from depth 1 to 3, then remains relatively constant.
**3. Node Expand Exploration (Gray):**
* Depth 1: Approximately 12%
* Depth 2: Approximately 8%
* Depth 3: Approximately 10%
* Depth 4: Approximately 8%
* Trend: The contribution from Node Expand Exploration is relatively stable across all depths, with a slight fluctuation.
**4. Accuracy Total (Blue Dashed Line with Circles):**
* Depth 1: Approximately 63%
* Depth 2: Approximately 73%
* Depth 3: Approximately 80%
* Depth 4: Approximately 80%
* Trend: The total accuracy increases from depth 1 to 3, then plateaus from depth 3 to 4.
### Key Observations
* The "Topic Entity Path Exploration" contributes the most to the overall accuracy, especially at higher depths.
* The "LLM Supplement Path Exploration" has a significant contribution at depth 1, but its impact diminishes as the depth increases.
* The "Node Expand Exploration" has a relatively consistent contribution across all depths.
* The total accuracy plateaus at a depth of 3, suggesting that increasing the depth beyond this point does not significantly improve performance.
### Interpretation
The data suggests that increasing the maximum depth initially improves the overall accuracy, primarily due to the increasing contribution of "Topic Entity Path Exploration". However, the diminishing returns observed beyond a depth of 3 indicate that other factors may be limiting performance. The "LLM Supplement Path Exploration" seems to be more effective at lower depths, possibly indicating that it is better suited for initial exploration or when the search space is smaller. The relatively stable contribution of "Node Expand Exploration" suggests that it plays a consistent, but not dominant, role in the overall accuracy.
</details>
(b) CWQ(PoG)
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Varying Maximum Depth
### Overview
The image is a line chart comparing the accuracy (%) of two methods, PoG and PoG-E, across varying maximum depths (Dmax) from 1 to 4. The chart displays the relationship between the maximum depth and the accuracy achieved by each method.
### Components/Axes
* **X-axis:** Varying maximum depth (Dmax), with values 1, 2, 3, and 4.
* **Y-axis:** Accuracy (%), ranging from 80 to 94, with gridlines at each integer value.
* **Legend:** Located at the bottom-center of the chart.
* Blue line with downward-pointing triangle markers: PoG
* Black line with diamond markers: PoG-E
### Detailed Analysis
* **PoG (Blue Line):**
* At Dmax = 1, Accuracy ≈ 81%.
* At Dmax = 2, Accuracy ≈ 87%.
* At Dmax = 3, Accuracy ≈ 94%.
* At Dmax = 4, Accuracy ≈ 92.5%.
* Trend: The accuracy increases sharply from Dmax = 1 to Dmax = 3, then decreases slightly from Dmax = 3 to Dmax = 4.
* **PoG-E (Black Line):**
* At Dmax = 1, Accuracy ≈ 82.3%.
* At Dmax = 2, Accuracy ≈ 86.2%.
* At Dmax = 3, Accuracy ≈ 91.4%.
* At Dmax = 4, Accuracy ≈ 88.4%.
* Trend: The accuracy increases from Dmax = 1 to Dmax = 3, then decreases from Dmax = 3 to Dmax = 4.
### Key Observations
* Both PoG and PoG-E show an increase in accuracy as the maximum depth increases from 1 to 3.
* Both methods experience a decrease in accuracy when the maximum depth increases from 3 to 4.
* PoG generally has a higher accuracy than PoG-E for Dmax values of 2 and 3.
* PoG-E has a higher accuracy than PoG for Dmax value of 1.
* The peak accuracy for PoG is at Dmax = 3, while the peak accuracy for PoG-E is also at Dmax = 3.
### Interpretation
The data suggests that increasing the maximum depth (Dmax) initially improves the accuracy of both PoG and PoG-E methods. However, beyond a certain point (Dmax = 3), increasing the depth further leads to a decrease in accuracy, possibly due to overfitting. The optimal maximum depth for both methods appears to be around 3. The PoG method seems to perform slightly better than PoG-E at higher depths, but PoG-E performs better at lower depths. This information is valuable for tuning the parameters of these methods to achieve the best performance.
</details>
(c) WebQSP (Vary $D_{\max}$ )
<details>
<summary>x7.png Details</summary>

### Visual Description
## Stacked Bar Chart: Accuracy vs. Varying Maximum Depth
### Overview
The image is a stacked bar chart showing the accuracy (%) on the y-axis versus the varying maximum depth (Dmax) on the x-axis. The chart compares the contributions of "Topic Entity Path Exploration", "LLM Supplement Path Exploration", and "Node Expand Exploration" to the overall accuracy. A dashed line represents the "Accuracy Total".
### Components/Axes
* **Y-axis:** "Accuracy (%)", ranging from 0 to 100 in increments of 20.
* **X-axis:** "Varying maximum depth (Dmax)", with values 1, 2, 3, and 4.
* **Legend:** Located at the bottom-left of the chart.
* "Accuracy Total" (blue dashed line with circular markers)
* "Topic Entity Path Exploration" (light blue bars)
* "LLM Supplement Path Exploration" (coral/orange bars)
* "Node Expand Exploration" (gray bars)
### Detailed Analysis
* **Topic Entity Path Exploration (Light Blue):**
* At Dmax = 1, the value is approximately 55%.
* At Dmax = 2, the value is approximately 75%.
* At Dmax = 3, the value is approximately 90%.
* At Dmax = 4, the value is approximately 85%.
* Trend: Generally increasing from Dmax = 1 to Dmax = 3, then slightly decreasing at Dmax = 4.
* **LLM Supplement Path Exploration (Coral/Orange):**
* At Dmax = 1, the value is approximately 5%.
* At Dmax = 2, the value is approximately 10%.
* At Dmax = 3, the value is approximately 3%.
* At Dmax = 4, the value is approximately 3%.
* Trend: Increasing from Dmax = 1 to Dmax = 2, then decreasing from Dmax = 2 to Dmax = 4.
* **Node Expand Exploration (Gray):**
* At Dmax = 1, the value is approximately 20%.
* At Dmax = 2, the value is approximately 1%.
* At Dmax = 3, the value is approximately 1%.
* At Dmax = 4, the value is approximately 3%.
* Trend: Decreasing significantly from Dmax = 1 to Dmax = 2, then relatively stable.
* **Accuracy Total (Blue Dashed Line):**
* At Dmax = 1, the value is approximately 80%.
* At Dmax = 2, the value is approximately 86%.
* At Dmax = 3, the value is approximately 93%.
* At Dmax = 4, the value is approximately 91%.
* Trend: Generally increasing from Dmax = 1 to Dmax = 3, then slightly decreasing at Dmax = 4.
### Key Observations
* "Topic Entity Path Exploration" contributes the most to the overall accuracy at higher Dmax values.
* "Node Expand Exploration" has a significant contribution at Dmax = 1, but its contribution decreases sharply as Dmax increases.
* "LLM Supplement Path Exploration" has a relatively small contribution across all Dmax values.
* The total accuracy peaks at Dmax = 3 and slightly decreases at Dmax = 4.
### Interpretation
The chart suggests that increasing the maximum depth (Dmax) initially improves the overall accuracy, primarily due to the increasing contribution of "Topic Entity Path Exploration". However, beyond a certain depth (Dmax = 3), the accuracy plateaus or even slightly decreases, indicating diminishing returns or potential overfitting. The "Node Expand Exploration" seems to be more effective at lower depths, while "LLM Supplement Path Exploration" plays a minor role throughout. The optimal depth appears to be around Dmax = 3, balancing the contributions of different exploration methods for maximum accuracy.
</details>
(d) WebQSP(PoG)
Figure 4. The accuracy of PoG and PoG-E among CWQ and WebQSP datasets by varying different $D_{\max}$ .
5.3. Effectiveness Evaluation
Effective evaluation on multi-entity questions. To evaluate PoG’s performance on multi-entity questions, we report the accuracy on all test sets by categorizing questions based on the number of topic entities. The results, shown in Table 2, demonstrate that, despite the increased complexity of multi-entity questions compared to single-entity ones, PoG maintains excellent accuracy, achieving up to 93.9% on the WebQSP dataset. This underscores the effectiveness of our structure-based model in handling complex multi-entity queries. Notably, the slightly lower performance on the GrailQA dataset can be attributed to some questions lacking matched topic entities, which prevents effective reasoning using KG.
Table 2. Performance of PoG and PoG-E on multi-entity and single-entity questions of all datasets. The symbol ‘-’ indicates no multi-entity question inside.
| Question Set | CWQ | WebQSP | GrailQA | WebQuestions | Simple Questions |
| --- | --- | --- | --- | --- | --- |
| PoG with GPT-3.5-Turbo | | | | | |
| Single-entity | 70.3 | 93.9 | 92.1 | 81.7 | 78.3 |
| Multi-entity | 80.2 | 93.1 | 70.7 | 82.8 | - |
| PoG-E with GPT-3.5-Turbo | | | | | |
| Single-entity | 67.5 | 91 | 88.2 | 76.8 | 80.8 |
| Multi-entity | 77.5 | 82.8 | 76.0 | 82.8 | - |
Effective evaluation on multi-hop reasoning. To assess PoG’s performance on multi-hop reasoning tasks, we analyze accuracy by categorizing questions based on the length of their ground-truth SPARQL queries. We randomly sample 1,000 questions from CWQ and WebQSP datasets and determine the reasoning length of each question by counting the number of relations in their ground-truth SPARQL queries. The distribution of questions with varying reasoning lengths is illustrated in Figure 5. We evaluate the performance of PoG and PoG-E across different ground-truth lengths to understand their effectiveness under varying query complexities. As shown in Figure 6, the performance of PoG and PoG-E remains consistent across different reasoning lengths. Even at the highest length levels in the WebQSP dataset, PoG achieves excellent accuracy, reaching up to 90%. Notably, although some questions have ground-truth lengths of eight or more, PoG successfully addresses them without matching the ground-truth length, demonstrating its ability to explore novel paths by effectively combining the LLM’s inherent knowledge with the structured information from the KG. These results demonstrate the effectiveness of PoG in handling complex multi-hop reasoning tasks.
Table 3. The illustration of graph size reduction.
| | CWQ | WebQSP | GrailQA | WebQuestions |
| --- | --- | --- | --- | --- |
| Ave Entity Number | 3,540,267 | 243,826 | 62,524 | 240,863 |
| Ave Entity Number After Pruned | 1,621,055 | 182,673 | 30,267 | 177,822 |
| Ave Entitiy Reduction Proportion (%) | 54% | 25% | 52% | 26% |
<details>
<summary>x8.png Details</summary>

### Visual Description
## Bar Chart: Number of Questions vs. Length of Paths in SPARQL
### Overview
The image is a bar chart comparing the number of questions for different lengths of paths in SPARQL for two datasets: CWQ (white bars) and WebQSP (black bars). The y-axis (Number of questions) is on a logarithmic scale. The x-axis represents the length of paths in SPARQL, ranging from 1 to 8.
### Components/Axes
* **Title:** Implicit, but the chart shows the relationship between the number of questions and the length of paths in SPARQL.
* **X-axis:** Length of paths in SPARQL, with values 1, 2, 3, 4, 5, 6, 7, and '8+'.
* **Y-axis:** Number of questions, with a logarithmic scale. The axis markers are at 10^0 (1), 10^1 (10), and 10^2 (100).
* **Legend:** Located at the top-right of the chart.
* CWQ: Represented by white bars.
* WebQSP: Represented by black bars.
### Detailed Analysis
Here's a breakdown of the data for each path length:
* **Path Length 1:**
* CWQ: Approximately 2 questions.
* WebQSP: Approximately 3 questions.
* **Path Length 2:**
* CWQ: Approximately 3 questions.
* WebQSP: Approximately 30 questions.
* **Path Length 3:**
* CWQ: Approximately 250 questions.
* WebQSP: Approximately 400 questions.
* **Path Length 4:**
* CWQ: Approximately 200 questions.
* WebQSP: Approximately 180 questions.
* **Path Length 5:**
* CWQ: Approximately 120 questions.
* WebQSP: Approximately 20 questions.
* **Path Length 6:**
* CWQ: Approximately 20 questions.
* WebQSP: Approximately 3 questions.
* **Path Length 7:**
* CWQ: Approximately 20 questions.
* WebQSP: Not present.
* **Path Length 8+:**
* CWQ: Approximately 80 questions.
* WebQSP: Approximately 60 questions.
### Key Observations
* For path lengths 1 and 2, the number of questions is significantly lower compared to path lengths 3 and 4.
* Both CWQ and WebQSP peak at path lengths 3 and 4.
* WebQSP has a higher number of questions for path lengths 2 and 3 compared to CWQ.
* CWQ has a higher number of questions for path lengths 5, 7, and 8+ compared to WebQSP.
* The number of questions decreases significantly for both datasets after path length 4, except for CWQ at path length 8+.
### Interpretation
The chart illustrates the distribution of question complexity, as measured by the length of paths in SPARQL queries, across two datasets (CWQ and WebQSP). The data suggests that both datasets contain a significant number of questions that require paths of length 3 and 4. The difference in distributions between CWQ and WebQSP indicates that WebQSP contains more questions with shorter paths (2 and 3), while CWQ has a higher proportion of questions with longer paths (5, 7, and 8+). This could reflect differences in the types of questions or the structure of the knowledge graphs used in each dataset. The logarithmic scale emphasizes the relative differences in the number of questions, highlighting the dominance of path lengths 3 and 4.
</details>
Figure 5. The lengths of the ground-truth SPARQL queries within the CWQ and WebQSP datasets.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Accuracy vs. Length of Paths in SPARQL
### Overview
The image is a bar chart comparing the accuracy of two methods, PoG and PoG-E, against the length of paths in SPARQL queries. The x-axis represents the length of paths, ranging from 1 to 8+, while the y-axis represents accuracy in percentage.
### Components/Axes
* **X-axis:** Length of paths in SPARQL. Categories: 1, 2, 3, 4, 5, 6, 7, 8+
* **Y-axis:** Accuracy (%). Scale: 0 to 100, with tick marks at intervals of 20.
* **Legend:** Located at the top-center of the chart.
* PoG: Light blue bars with a black outline.
* PoG-E: Dark gray bars with a black outline.
### Detailed Analysis
Here's a breakdown of the accuracy for each path length, for both PoG and PoG-E:
* **Path Length 1:**
* PoG: No bar present, implying 0% accuracy.
* PoG-E: No bar present, implying 0% accuracy.
* **Path Length 2:**
* PoG (Light Blue): Approximately 100% accuracy.
* PoG-E (Dark Gray): Approximately 75% accuracy.
* **Path Length 3:**
* PoG (Light Blue): Approximately 82% accuracy.
* PoG-E (Dark Gray): Approximately 80% accuracy.
* **Path Length 4:**
* PoG (Light Blue): Approximately 70% accuracy.
* PoG-E (Dark Gray): Approximately 69% accuracy.
* **Path Length 5:**
* PoG (Light Blue): Approximately 57% accuracy.
* PoG-E (Dark Gray): Approximately 55% accuracy.
* **Path Length 6:**
* PoG (Light Blue): Approximately 56% accuracy.
* PoG-E (Dark Gray): Approximately 56% accuracy.
* **Path Length 7:**
* PoG (Light Blue): Approximately 51% accuracy.
* PoG-E (Dark Gray): Approximately 47% accuracy.
* **Path Length 8+:**
* PoG (Light Blue): Approximately 63% accuracy.
* PoG-E (Dark Gray): Approximately 50% accuracy.
**Trends:**
* **PoG:** Accuracy starts at 0% for path length 1, peaks at 100% for path length 2, then generally decreases until path length 7, before increasing again at path length 8+.
* **PoG-E:** Accuracy starts at 0% for path length 1, peaks at 80% for path length 3, then generally decreases until path length 7, before increasing again at path length 8+.
### Key Observations
* PoG generally outperforms PoG-E in terms of accuracy, except at path length 6 where they are approximately equal.
* Both methods show a decrease in accuracy as the path length increases from 2 to 7.
* Both methods show an increase in accuracy when the path length is 8+.
* The most significant difference in accuracy between PoG and PoG-E is at path length 2.
### Interpretation
The chart suggests that the accuracy of both PoG and PoG-E is affected by the length of paths in SPARQL queries. Shorter paths (length 2) result in higher accuracy, particularly for PoG. As the path length increases, the accuracy tends to decrease, possibly due to increased complexity or ambiguity in the queries. The increase in accuracy at path length 8+ could indicate a different behavior or characteristic of very long paths. The data implies that PoG is generally more accurate than PoG-E, especially for shorter paths.
</details>
(a) CWQ
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart: Accuracy vs. Length of Paths in SPARQL
### Overview
The image is a bar chart comparing the accuracy of two methods, PoG and PoG-E, based on the length of paths in SPARQL queries. The x-axis represents the length of paths, and the y-axis represents the accuracy in percentage.
### Components/Axes
* **Title:** (Inferred) Accuracy vs. Length of Paths in SPARQL
* **X-axis:**
* **Label:** Length of paths in SPARQL
* **Categories:** 1, 2, 3, 4, 5, 6, 7, 8+
* **Y-axis:**
* **Label:** Accuracy (%)
* **Scale:** 0 to 100, with tick marks at 20, 40, 60, 80, and 100.
* **Legend:** Located at the top of the chart.
* **PoG:** Represented by light blue bars.
* **PoG-E:** Represented by dark gray bars.
### Detailed Analysis
Here's a breakdown of the accuracy for each path length and method:
* **Path Length 1:**
* PoG (light blue): Approximately 50%
* PoG-E (dark gray): Approximately 50%
* **Path Length 2:**
* PoG (light blue): Approximately 90%
* PoG-E (dark gray): Approximately 60%
* **Path Length 3:**
* PoG (light blue): Approximately 96%
* PoG-E (dark gray): Approximately 95%
* **Path Length 4:**
* PoG (light blue): Approximately 83%
* PoG-E (dark gray): Approximately 82%
* **Path Length 5:**
* PoG (light blue): Approximately 75%
* PoG-E (dark gray): Approximately 74%
* **Path Length 6:**
* PoG (light blue): Approximately 67%
* PoG-E (dark gray): Approximately 66%
* **Path Length 7:**
* PoG (light blue): No data
* PoG-E (dark gray): No data
* **Path Length 8+:**
* PoG (light blue): Approximately 93%
* PoG-E (dark gray): Approximately 90%
### Key Observations
* PoG generally has higher accuracy than PoG-E, except at path length 1 and 3 where they are nearly equal.
* The accuracy of PoG peaks at path length 3, then decreases until path length 6, and increases again at path length 8+.
* The accuracy of PoG-E peaks at path length 3, then decreases until path length 6, and increases again at path length 8+.
* There is no data for path length 7.
### Interpretation
The chart suggests that the length of paths in SPARQL queries affects the accuracy of both PoG and PoG-E methods. PoG appears to be more accurate than PoG-E for most path lengths, especially at path length 2. The drop in accuracy between path lengths 3 and 6 could indicate a complexity threshold where longer paths become more difficult to process accurately. The increase in accuracy at path length 8+ is interesting and might suggest that very long paths have some properties that make them easier to process, or that the queries represented by 8+ paths are of a different nature than those with lengths between 4 and 6. The absence of data for path length 7 is a notable gap in the data.
</details>
(b) WebQSP
Figure 6. The accuracy of PoG and PoG-E on the CWQ and WebQSP datasets, categorized by the different lengths of the ground-truth answers for each question.
Graph structure pruning. To evaluate the effectiveness of the graph pruning method proposed in Section 4.1, we conduct experiments using 200 random samples from each dataset. We report the average number of entities per question before and after graph reduction, as well as the proportion of entities reduced, in Table 3. The results indicate that up to 54% of entities in the CWQ dataset can be pruned before path exploration. This demonstrates the effectiveness of eliminating irrelevant data from the outset.
Case study: interpretable reasoning. We also conduct the case study to demonstrate interpretability of PoG, we present three reasoning examples in Table 9 of Appendix B.5. These examples feature questions with one, two, and three entities, respectively. Through the case study, we showcase PoG’s effectiveness in handling multi-entity and multi-hop tasks by providing faithful and interpretable reasoning paths that lead to accurate answers.
To further evaluate the effectiveness and efficiency of PoG, we perform additional experiments, including pruning beam search strategy ablation and prompt setting ablation (Appendix B.1), reasoning faithfulness analysis (Appendix B.2), error analysis (Appendix B.3), LLM calls cost and running time analysis (Appendix B.4), and graph reduction and path pruning case study (Appendix B.5).
6. Conclusion
In this paper, we introduce Paths-over-Graphs (PoG), a novel method that integrates LLMs with KGs to enable faithful and interpretable reasoning. PoG addresses complex reasoning tasks through a three-phase dynamic multi-hop path exploration, combining the inherent knowledge of LLMs with factual information from KGs. Efficiency is enhanced by graph-structured pruning and a three-step pruning process to effectively narrow down candidate paths. Extensive experiments on five public datasets demonstrate that PoG outperforms existing baselines, showcasing its superior reasoning capabilities and interoperability.
7. Acknowledgment
Xiaoyang Wang is supported by the Australian Research Council DP230101445 and DP240101322. Wenjie Zhang is supported by the Australian Research Council DP230101445 and FT210100303.
References
- (1)
- Baek et al. (2023b) Jinheon Baek, Alham Fikri Aji, Jens Lehmann, and Sung Ju Hwang. 2023b. Direct Fact Retrieval from Knowledge Graphs without Entity Linking. In ACL.
- Baek et al. (2023a) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023a. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023).
- Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533–1544.
- Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In AAAI, Vol. 38. 17682–17690.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247–1250.
- Brown (2020) Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
- Chen et al. (2024) Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2024. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter 25, 2 (2024), 42–61.
- Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Das et al. (2021) Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In EMNLP.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. In ICLR.
- Gu et al. (2023) Yu Gu, Xiang Deng, and Yu Su. 2023. Don’t Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments. In ACL.
- Gu et al. (2021) Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. In WWW. 3477–3488.
- Guo et al. (2023) Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. arXiv preprint arXiv:2305.15066 (2023).
- Guo et al. (2019) Lingbing Guo, Zequn Sun, and Wei Hu. 2019. Learning to exploit long-term relational dependencies in knowledge graphs. In International conference on machine learning. PMLR, 2505–2514.
- Guo et al. (2024) Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063–7076.
- Huang et al. (2025) Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. 2025. Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models. In COLING.
- Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645 (2023).
- Kedia et al. (2022) Akhil Kedia, Mohd Abbas Zaidi, and Haejun Lee. 2022. FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering. In EMNLP.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).
- Kim et al. (2023) Jiho Kim, Yeonsu Kwon, Yohan Jo, and Edward Choi. 2023. Kg-gpt: A general framework for reasoning on knowledge graphs using large language models. arXiv preprint arXiv:2310.11220 (2023).
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. NeurIPS (2022).
- Li et al. (2024a) Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024a. Hypergraph Self-supervised Learning with Sampling-efficient Signals. arXiv preprint arXiv:2404.11825 (2024).
- Li et al. (2024b) Fan Li, Zhiyu Xu, Dawei Cheng, and Xiaoyang Wang. 2024b. AdaRisk: Risk-adaptive Deep Reinforcement Learning for Vulnerable Nodes Detection. IEEE TKDE (2024).
- Li et al. (2023a) Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023a. Few-shot In-context Learning on Knowledge Base Question Answering. In ACL.
- Li et al. (2023b) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2023b. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. arXiv preprint arXiv:2305.13269 (2023).
- Luo et al. (2023) Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538 (2023).
- Luo et al. (2024) Linhao Luo, Yuanfang Li, Reza Haf, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In ICLR.
- Ma et al. (2024) Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, and Jian Guo. 2024. Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval. arXiv preprint arXiv:2407.10805 (2024).
- Pan et al. (2024) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. TKDE (2024).
- Peters et al. (2019) Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
- Petrochuk and Zettlemoyer (2018) Michael Petrochuk and Luke Zettlemoyer. 2018. SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach. In EMNLP. 554–558.
- Petroni et al. (2020) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).
- Rawte et al. (2023) Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
- Sima et al. (2024) Qing Sima, Jianke Yu, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024. Deep Overlapping Community Search via Subspace Embedding. arXiv preprint arXiv:2404.14692 (2024).
- Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019).
- Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In ICLR.
- Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. In NAACL.
- Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937 (2018).
- Tan et al. (2023) Xingyu Tan, Jingya Qian, Chen Chen, Sima Qing, Yanping Wu, Xiaoyang Wang, and Wenjie Zhang. 2023. Higher-Order Peak Decomposition. In CIKM.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Wang et al. (2024a) Jinghao Wang, Yanping Wu, Xiaoyang Wang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. 2024a. Efficient Influence Minimization via Node Blocking. Proc. VLDB Endow. (2024).
- Wang et al. (2024b) Kai Wang, Yuwei Xu, Zhiyong Wu, and Siqiang Luo. 2024b. LLM as Prompter: Low-resource Inductive Reasoning on Arbitrary Knowledge Graphs. In ACL.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022).
- Wen et al. (2024) Yilin Wen, Zifeng Wang, and Jimeng Sun. 2024. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. In ACL.
- Wu et al. (2024) Yanping Wu, Renjie Sun, Xiaoyang Wang, Dong Wen, Ying Zhang, Lu Qin, and Xuemin Lin. 2024. Efficient Maximal Frequent Group Enumeration in Temporal Bipartite Graphs. Proc. VLDB Endow. (2024).
- Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS (2024).
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- Ye et al. (2021) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. 2021. Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering. arXiv preprint arXiv:2109.08678 (2021).
- Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL. 201–206.
- Zhang et al. (2021) Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long document modeling with pooling attention. In ICML. 12437–12446.
- Zhang et al. (2023) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. In ICLR.
- Zhao et al. (2023) Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. 2023. Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework. In ACL.
Appendix A Algorithm
A.1. Exploration
We summarize the comprehensive algorithmic procedure for exploration detailed in Section 4.2 as presented in Algorithm 1.
\Hy@raisedlink \hyper@anchorstart AlgoLine0.1 \hyper@anchorend
Input : Question subgraph ( $\mathcal{G}_{q}$ ), source KG ( $\mathcal{G}$ ),question and split question ( $Q=q$ + $q_{split}$ ), topic entities ( $Topic(q)$ ), LLM indicator ( $I_{\text{LLM}}$ ), predict depth ( $D_{\text{predict}}$ ), maximum depth ( $D_{\max}$ ), maximum width ( $W_{\max}$ ), and path pruning case ( $case$ )
Output : PoG answers ( $a(q)$ ), final reasoning path ( $Paths_{f}(q)$ )
\Hy@raisedlink \hyper@anchorstart AlgoLine0.2 \hyper@anchorend
/* Start with topic entity path exploration */ \Hy@raisedlink \hyper@anchorstart AlgoLine0.3 \hyper@anchorend
$List_{T}←$ Reorder ( $Topic(q),I_{\text{LLM}}$ ), $D←$ min( $D_{\text{predict}},D_{\max}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.4 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.5 \hyper@anchorend
while $D≤ D_{\max}$ do $Paths_{t}←\text{EntityPathFind}$ ( $List_{T},D$ , $\mathcal{G}_{q}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.6 \hyper@anchorend
$\text{PathPruning}(Paths_{t},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.7 \hyper@anchorend
$Answer,Paths_{T}←\text{QuestionAnswering}(Paths_{t},Q,I_{\text{LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.8 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.9 \hyper@anchorend
if $\texttt{"\{Yes\}"}\text{ in }Answer$ then return $Answer,Paths_{T}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.10 \hyper@anchorend
else $D← D+1$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.11 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.12 \hyper@anchorend
/* LLM supplement path exploration procedure */ \Hy@raisedlink \hyper@anchorstart AlgoLine0.13 \hyper@anchorend
$Paths_{s}←[]$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.14 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.15 \hyper@anchorend
$Predict(q)←$ SupplementPrediction( $Paths_{T},Q,I_{\text{LLM}}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.16 \hyper@anchorend
for each $e,I_{sup(e)}∈ Predict(q)$ do \Hy@raisedlink \hyper@anchorstart AlgoLine0.17 \hyper@anchorend
$List_{S}←$ Reorder ( $List_{T}+e,I_{sup(e)}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.18 \hyper@anchorend
$Paths_{s}^{\prime}←\text{EntityPathFind}$ ( $List_{S},D_{\max}$ , $\mathcal{G}_{q}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.19 \hyper@anchorend
$Paths_{s}← Paths_{s}+\text{FuzzySelect}$ ( $Paths_{s}^{\prime},I_{sup(e)},W_{\max}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.20 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.21 \hyper@anchorend
$\text{PathPruning}(Paths_{s},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{S},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.22 \hyper@anchorend
$Answer,Paths_{S}←\text{QuestionAnswering}(Paths_{s},Q,I_{\text{LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.23 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.24 \hyper@anchorend
if "{Yes}" in $Answer$ then return $Answer,Paths_{S}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.25 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.26 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.27 \hyper@anchorend
/* Node expand exploration procedure */ \Hy@raisedlink \hyper@anchorstart AlgoLine0.28 \hyper@anchorend
$Visted←\emptyset$ , $D← 1$ , $Paths_{e}← Paths_{T}+Paths_{S}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.29 \hyper@anchorend
$\text{PathPruning}(Paths_{e},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ;; \Hy@raisedlink \hyper@anchorstart AlgoLine0.30 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.31 \hyper@anchorend
while $D≤ D_{\max}$ do for each $e∈\text{ExtractEntity}(Paths_{e})\land e∉ Visted$ do $Related\_edges=\text{Find\_1\_hop\_Edges}(\mathcal{G},e)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.32 \hyper@anchorend
$Paths_{e}←\text{MergeTogether}(Paths_{e},Related\_edge)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.33 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.34 \hyper@anchorend
$\text{PathPruning}(Paths_{e},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.35 \hyper@anchorend
$Answer,Paths_{e}←\text{QuestionAnswering}(Paths_{e},Q,I_{\text{LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.36 \hyper@anchorend
if $\texttt{"\{Yes\}"}\text{ in }Answer$ then return $Answer,Paths_{e}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.37 \hyper@anchorend
else $Visted← Visted\cup e$ ; $D← D+1$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.38 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.39 \hyper@anchorend
$Paths_{l}← Paths_{T}+Paths_{S}+Paths_{E}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.40 \hyper@anchorend
$\text{PathPruning}(Paths_{l},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.41 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.42 \hyper@anchorend
$Answer,Paths_{L}←\text{QuestionAnsweringFinal}(Paths_{l},Q,I_{\text{%
LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.43 \hyper@anchorend
Return $Answer,Paths_{L}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.44 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.45 \hyper@anchorend
Algorithm 1 Exploration
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
A.2. Path Pruning
We summarize the comprehensive algorithmic procedure of path pruning detailed in Section 4.3 as presented in Algorithm 2.
\Hy@raisedlink \hyper@anchorstart AlgoLine0.1 \hyper@anchorend
Input : Candidate paths( $Paths_{c}$ ), question and split question ( $Q=q+q_{split}$ ), indicator ( $I$ ), maximum width ( $W_{\max}$ ), maximum depth ( $D_{\max}$ ), entity list ( $list$ )
Output : Pruned candidate paths ( $Paths_{c}$ )
if Case = Fuzzy Selection Only then $\text{FuzzySelect}(Paths_{c},Q,I,W_{\max})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.2 \hyper@anchorend
else if Case = Fuzzy + Precise Path Selection then $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.3 \hyper@anchorend
$\text{PrecisePathSelect}(Paths_{c},Q,I,W_{\max})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.4 \hyper@anchorend
else if Case = Fuzzy + Branch Reduced Selection then $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.5 \hyper@anchorend
$\text{BranchReduceSelect}(Paths_{c},Q,I,W_{\max},D_{\max},list)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.6 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.7 \hyper@anchorend
else if Case = Fuzzy + Branch Reduced + Precise Path then /* case = 3-Step Beam Search */ $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.8 \hyper@anchorend
$\text{BranchReduceSelect}(Paths_{c},Q,I,W_{2},D_{\max},list)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.9 \hyper@anchorend
$\text{PrecisePathSelect}(Paths_{c},Q,I,W_{\max})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.10 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.11 \hyper@anchorend
Procedure BranchReduceSelect $(Paths_{c},Q,I,W,D_{\max},list)$ \Hy@raisedlink \hyper@anchorstart AlgoLine0.12 \hyper@anchorend
$D← 1$ , $Paths_{e}←\emptyset$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.13 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.14 \hyper@anchorend
while $|Paths_{c}|≥ W\land D≤ D_{\max}$ do \Hy@raisedlink \hyper@anchorstart AlgoLine0.15 \hyper@anchorend
for each $e∈ list$ do \Hy@raisedlink \hyper@anchorstart AlgoLine0.16 \hyper@anchorend
$Paths_{e}← Paths_{e}\cup\text{ExtractHeadSteps}(Paths_{c},e,D)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.17 \hyper@anchorend
if $|Paths_{e}|>W$ then $\text{PrecisePathSelect}(Paths_{e},Q,I,W)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.18 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.19 \hyper@anchorend
$Paths_{c}←\text{IntersectMatchUpdate}(Paths_{e},Paths_{c})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.20 \hyper@anchorend
$Paths_{e}←\emptyset$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.21 \hyper@anchorend
$D← D+1$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.22 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.23 \hyper@anchorend
if $|Paths_{c}|>W$ then $\text{PrecisePathSelect}(Paths_{c},Q,I,W)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.24 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.25 \hyper@anchorend
Algorithm 2 PathPruning
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
Appendix B Experiment
B.1. Additional Ablation Study
Compare the effect of different beam search strategies. As introduced in Section 4.3, PoG supports various beam search strategies, ranging from non-reliant to fully reliant on LLMs, selectable in a user-friendly manner. To evaluate the computational cost and performance, we test four cases outlined in Algorithm 2. In the 3-Step Beam Search case, we set $W_{2}=20$ for internal narrowing. The Fuzzy Selection approach, as described in Section 4.3, utilizes all candidate paths and a LLM-generated indicator for encoding and comparison. We report accuracy, average LLM calls in total, and average token input during the path pruning for each beam search strategy applied to PoG and PoG-E in Table 4 and Table 5. These results indicate that PoG/PoG-E with Fuzzy and Precise Path Selection achieves the highest accuracy. Additionally, the BranchReduced Selection method, which leverages the graph structure, not only delivers excellent results but also reduces token usage by over 50% (65%) with only a ±2% (±4.3%) difference in accuracy on PoG (PoG-E) compared to the best-performing strategy. Furthermore, the Fuzzy Selection method, which employs lightweight models instead of relying solely on LLMs, also demonstrates strong performance. These results validate the effectiveness of our beam search strategies and underscore the importance of structure-based faithful path reasoning.
Table 4. Performance comparison of PoG with different beam search methods on CWQ and WebQSP.
| PoG | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| w/ Fuzzy Selection | Accuracy | 57.1 | 86.4 |
| Token Input | - | - | |
| LLM Calls | 6.8 | 6.5 | |
| w/ Fuzzy and | Accuracy | 79.3 | 93.0 |
| BranchReduced Selection | Token Input | 101,455 | 328,742 |
| LLM Calls | 9.7 | 9.3 | |
| w/ Fuzzy and | Accuracy | 81.4 | 93.9 |
| Precise Path Selection | Token Input | 216,884 | 617,448 |
| LLM Calls | 9.1 | 7.5 | |
| w/ 3-Step Beam Search | Accuracy | 79.8 | 91.9 |
| Token Input | 102,036 | 369,175 | |
| LLM Calls | 8.8 | 9.0 | |
How do summary prompts affect? Inspired by GoT (Besta et al., 2024), we utilize summary prompts to reduce LLM hallucinations and decrease computational costs. To evaluate their impact, we conduct an ablation study comparing PoG and PoG-E with and without path summarization. We measure both accuracy and average token input to the LLM API during the path pruning phase to measure efficiency and effectiveness. The results, present in Tabel 6, show that using graph summaries increases accuracy by up to 10% on the CWQ dataset with PoG-E, while reducing token input by up to 36% on WebQSP. These results indicate hat path summarization effectively minimizes LLM hallucinations, enhances the LLM’s understanding of the explored paths, facilitates answer retrieval, enables earlier termination of the reasoning process, and reduces costs.
Table 5. Performance comparison of PoG-E with different beam search methods among CWQ and WebQSP datasets.
| PoG-E | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| w/ FuzzySelect | Accuracy | 62.31 | 82.3 |
| Token Input | - | - | |
| Ave LLM Calls | 6 | 6.3 | |
| w/ Fuzzy and | Accuracy | 71.9 | 88.4 |
| BranchReduced Selection | Token Input | 128,407 | 371,083 |
| Ave LLM Calls | 9.4 | 9.1 | |
| w/ Fuzzy and | Accuracy | 80.4 | 91.4 |
| Precise Path Selection | Token Input | 344,747 | 603,261 |
| Ave LLM Calls | 8.3 | 7.4 | |
| w/ 3-Step Beam Search | Accuracy | 73.87 | 89.4 |
| Token Input | 120,159 | 411,283 | |
| Ave LLM Calls | 8.3 | 9.1 | |
Table 6. Performance comparison of PoG and PoG-E with and without path summarizing on CWQ and WebQSP datasets.
| Method | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| PoG | | | |
| w/ Path Summarizing | Accuracy | 81.4 | 93.9 |
| Token Input | 216,884 | 297,359 | |
| w/o Path Summarizing | Accuracy | 74.7 | 91.9 |
| Token Input | 273,447 | 458,545 | |
| PoG-E | | | |
| w/ Path Summarizing | Accuracy | 80.4 | 91.4 |
| Token Input | 314,747 | 273,407 | |
| w/o Path Summarizing | Accuracy | 70.4 | 90.4 |
| Token Input | 419,679 | 428,545 | |
B.2. Reasoning Faithfulness Analysis
Overlap ratio between explored paths and ground-truth paths. We analyzed correctly answered samples from three datasets to investigate the overlap ratio between the paths $P$ explored by PoG and the ground-truth paths $S$ in SPARQL queries. The overlap ratio is defined as the proportion of overlapping relations to the total number of relations in the ground-truth SPARQL path:
$$
Ratio(P)=\frac{|Relation(P)\cap Relation(S)|}{|Relation(S)|},
$$
where $Relation(P)$ denotes the set of relations in path $P$ . Figure 7 illustrates the distribution of questions across different overlap ratios. For the WebQSP dataset, PoG achieves the highest proportion of fully overlapping paths with the ground truth, reaching approximately 60% accuracy. In contrast, PoG-E applied to the GrailQA dataset shows the highest proportion of paths with up to 70% non-overlapping relations, indicating that PoG-E explores novel paths to derive the answers. The different results between PoG and PoG-E are due to PoG-E’s strategy of randomly selecting one related edge from each clustered edge. This approach highlights the effectiveness of our structure-based path exploration method in generating diverse and accurate reasoning paths.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio vs. Percentage
### Overview
The image is a bar chart showing the distribution of data across different overlap ratio ranges. The x-axis represents the overlap ratio in percentage, divided into bins, and the y-axis represents the percentage of data falling into each bin.
### Components/Axes
* **X-axis:** Overlap Ratio (%), with categories: 0, (0,25], (25,50], (50,75], (75,100], 100
* **Y-axis:** Percentage (%), with a scale from 0 to 40, incrementing by 10.
### Detailed Analysis
* **Category 0:** The bar extends to approximately 11.5%. The bar is dark gray.
* **Category (0,25]:** The bar extends to approximately 5%. The bar is dark gray.
* **Category (25,50]:** The bar extends to approximately 23%. The bar is gray.
* **Category (50,75]:** The bar extends to approximately 12%. The bar is gray.
* **Category (75,100]:** There is no bar for this category, so the percentage is 0%.
* **Category 100:** The bar extends to approximately 16%. The bar is light blue.
### Key Observations
* The (25,50] overlap ratio range has the highest percentage, indicating that most of the data falls within this range.
* The (0,25] overlap ratio range has the lowest percentage.
* There is no data for the (75,100] overlap ratio range.
### Interpretation
The bar chart illustrates the distribution of data based on overlap ratios. The peak at the (25,50] range suggests that a significant portion of the data exhibits this level of overlap. The absence of data in the (75,100] range could indicate a lack of instances with very high overlap ratios. The presence of data at 0 and 100 suggests that there are instances with no overlap and complete overlap, respectively, although in smaller proportions compared to the (25,50] range.
</details>
(a) CWQ (PoG)
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio Percentage Distribution
### Overview
The image is a bar chart illustrating the distribution of percentages across different overlap ratio categories. The x-axis represents the overlap ratio in percentage, divided into bins, and the y-axis represents the percentage of occurrences within each bin.
### Components/Axes
* **X-axis:** Overlap Ratio (%), with categories: 0, (0,25], (25,50], (50,75], (75,100), 100.
* **Y-axis:** Percentage (%), with a scale from 0 to 40. Axis markers are present at intervals of 10 (10, 20, 30, 40).
### Detailed Analysis
* **Category 0:** The bar extends to approximately 11%. The bar color is dark gray.
* **Category (0,25]:** The bar extends to approximately 5%. The bar color is dark gray.
* **Category (25,50]:** The bar extends to approximately 31%. The bar color is gray.
* **Category (50,75]:** The bar extends to approximately 11%. The bar color is gray.
* **Category (75,100]:** There is no bar for this category, implying a percentage of 0%.
* **Category 100:** The bar extends to approximately 11%. The bar color is light blue.
### Key Observations
* The highest percentage occurs in the (25,50] overlap ratio category.
* The (0,25] overlap ratio category has the lowest percentage.
* The percentages for the 0, (50,75], and 100 overlap ratio categories are approximately equal.
* The (75,100] overlap ratio category has a percentage of 0.
### Interpretation
The bar chart suggests that overlap ratios between 25% and 50% are the most frequent. Very low overlap ratios (0-25%) are the least frequent. Overlap ratios of 0% and 100% occur with similar frequency. The absence of a bar for the (75,100] category indicates that such overlap ratios are not observed in the dataset. This distribution could be indicative of a system or process that tends to favor moderate overlap, while avoiding both very little and near-complete overlap.
</details>
(b) CWQ (PoG-E)
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio vs. Percentage
### Overview
The image is a bar chart showing the relationship between "Overlap Ratio (%)" on the x-axis and "Percentage (%)" on the y-axis. The chart displays the distribution of data across different overlap ratio intervals.
### Components/Axes
* **X-axis:** "Overlap Ratio (%)" with categories: 0, (0,25], (25,50], (50,75], (75,100], 100
* **Y-axis:** "Percentage (%)" with a scale from 0 to 60, marked at intervals of 20 (0, 20, 40, 60).
### Detailed Analysis
* **Category 0:** The bar is dark gray and reaches approximately 13% on the y-axis.
* **Category (0,25]:** The bar is a lighter shade of gray and reaches approximately 2% on the y-axis.
* **Category (25,50]:** The bar is a lighter shade of gray and reaches approximately 6% on the y-axis.
* **Category (50,75]:** The bar is a lighter shade of gray and reaches approximately 6% on the y-axis.
* **Category (75,100]:** There is no bar for this category, implying a percentage of 0%.
* **Category 100:** The bar is light blue and reaches approximately 63% on the y-axis.
### Key Observations
* The percentage is highest when the overlap ratio is 100%.
* The percentage is relatively high when the overlap ratio is 0%.
* The percentages for overlap ratios between (0,25], (25,50], and (50,75] are very low.
* There is no data for the overlap ratio between (75,100].
### Interpretation
The bar chart indicates that the data is heavily skewed towards two extremes: either there is no overlap (0% overlap ratio), or there is a complete overlap (100% overlap ratio). The low percentages for the intermediate overlap ratios suggest that partial overlaps are rare in the dataset represented by this chart. The high percentage at 100% overlap ratio suggests that complete overlaps are the most common scenario.
</details>
(c) WebQSP (PoG)
<details>
<summary>x14.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio vs. Percentage
### Overview
The image is a bar chart showing the distribution of overlap ratios, grouped into bins, with the percentage of occurrences on the y-axis. The x-axis represents the overlap ratio in percentage, divided into intervals, and the y-axis represents the percentage.
### Components/Axes
* **X-axis:** Overlap Ratio (%), with categories: 0, (0,25], (25,50], (50,75], (75,100), 100
* **Y-axis:** Percentage (%), with a scale from 0 to 60.
* **Bars:** Represent the percentage for each overlap ratio category. The bars are colored in shades of gray and blue.
### Detailed Analysis
* **Overlap Ratio 0%:** The bar is gray and has a height of approximately 23%.
* **Overlap Ratio (0, 25]%:** The bar is very short, close to 0%.
* **Overlap Ratio (25, 50]%:** The bar is gray and has a height of approximately 10%.
* **Overlap Ratio (50, 75]%:** The bar is gray and has a height of approximately 6%.
* **Overlap Ratio (75, 100]%:** No bar is present.
* **Overlap Ratio 100%:** The bar is light blue and has a height of approximately 45%.
### Key Observations
* The highest percentage occurs when the overlap ratio is 100%.
* The percentage is also relatively high when the overlap ratio is 0%.
* The percentages for overlap ratios between 0% and 75% are very low.
### Interpretation
The bar chart suggests that there are two dominant scenarios: either there is no overlap (0% overlap ratio), or there is a complete overlap (100% overlap ratio). The intermediate overlap ratios (between 0% and 75%) are relatively rare. This could indicate a bimodal distribution where entities either do not overlap at all or overlap completely, with very few instances of partial overlap.
</details>
(d) WebQSP (PoG-E)
<details>
<summary>x15.png Details</summary>

### Visual Description
## Bar Chart: Percentage vs. Overlap Ratio
### Overview
The image is a bar chart that displays the percentage of occurrences for different overlap ratio ranges. The x-axis represents the overlap ratio in percentage, and the y-axis represents the percentage of occurrences. There are three bars, each corresponding to a different overlap ratio range: 0, (0, 25], (25, 50], (50, 75](75, 100), and 100.
### Components/Axes
* **X-axis:** Overlap Ratio (%), with categories: 0, (0,25], (25,50], (50,75](75,100), 100
* **Y-axis:** Percentage (%), with scale markers at 20, 40, 60, and 80.
### Detailed Analysis
* **Category 0:** The bar is dark gray and reaches approximately 15%.
* **Category (0,25], (25,50], (50,75](75,100):** The bar is gray and reaches approximately 10%.
* **Category 100:** The bar is light blue and reaches approximately 67%.
### Key Observations
* The percentage is highest when the overlap ratio is 100%.
* The percentage is lowest when the overlap ratio is between (0,25], (25,50], (50,75](75,100).
* The percentage at 0 overlap ratio is higher than the percentage at (0,25], (25,50], (50,75](75,100) overlap ratio.
### Interpretation
The bar chart indicates a strong positive correlation between the overlap ratio and the percentage of occurrences. Specifically, a 100% overlap ratio is significantly more frequent than lower overlap ratios. This suggests that in the context from which this data was derived, perfect overlap is a common or desired outcome. The low percentages at other overlap ratios suggest that partial overlaps are relatively rare.
</details>
(e) GrailQA (PoG)
<details>
<summary>x16.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio vs. Percentage
### Overview
The image is a bar chart showing the distribution of overlap ratios, grouped into bins, against the percentage of occurrences. The x-axis represents the overlap ratio in percentage, divided into bins: 0, (0,25], (25,50], (50,75], (75,100], and 100. The y-axis represents the percentage.
### Components/Axes
* **X-axis:** Overlap Ratio (%), with categories: 0, (0,25], (25,50], (50,75], (75,100], 100.
* **Y-axis:** Percentage (%), with a scale from 0 to 80 in increments of 20.
* **Bars:** Represent the percentage for each overlap ratio category. The bars are colored in shades of gray and blue.
### Detailed Analysis
* **Category 0:** The bar is dark gray and reaches approximately 68%.
* **Category (0,25]:** The bar is a slightly lighter gray and reaches approximately 5%.
* **Category (25,50]:** The bar is a medium gray and reaches approximately 10%.
* **Category (50,75]:** There is no bar for this category, implying a percentage of 0.
* **Category (75,100]:** There is no bar for this category, implying a percentage of 0.
* **Category 100:** The bar is light blue and reaches approximately 15%.
### Key Observations
* The highest percentage occurs when the overlap ratio is 0, with approximately 68%.
* The percentage is very low for overlap ratios between 0 and 50.
* The percentage increases again when the overlap ratio is 100, reaching approximately 15%.
* There are no occurrences for overlap ratios between 50 and 100.
### Interpretation
The chart suggests that there are two distinct groups: one where there is no overlap (0% overlap ratio), and another where there is complete overlap (100% overlap ratio). The lack of data for overlap ratios between 50% and 100% suggests a bimodal distribution, where the overlap is either non-existent or complete. The high percentage at 0% overlap ratio indicates that in many cases, there is no overlap between the objects being compared. The 100% overlap ratio suggests that in some cases, the objects are perfectly aligned or identical. The low percentages in the (0,25] and (25,50] categories indicate that partial overlaps are rare.
</details>
(f) GrailQA (PoG-E)
Figure 7. The path overlap ratio of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.
Evidence of answer exploration sources. We conduct an analysis of correctly answered samples from three datasets to investigate the sources of evidence used by the LLM in generating answers, as illustrated in Figure 8. Specifically, we categorize all generated answers into three cases: KG only, LLM-inspired KG, and KG-inspired LLM. In the KG only scenario, answers are generated solely based on KG paths. The LLM-inspired KG case involves the LLM predicting an answer using its inherent knowledge and subsequently using the KG to verify its correctness. Conversely, in the KG-inspired LLM case, the paths generated by the KG are insufficient to reach the answer, and the LLM supplements the reasoning using its inherent knowledge. As shown in the figure, up to 14% of answers are generated through the KG-inspired LLM approach, and up to 9% involve LLM-inspired KG path supplementation. Compared to previous work that integrates LLM inherent knowledge with KG data (Sun et al., 2024), PoG more effectively leverages the strengths of both sources. These results demonstrate that PoG is a faithful reasoning method that primarily relies on KG-based reasoning while being supplemented by the LLM, ensuring both accuracy and interpretability in answer generation.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Pie Charts: Knowledge Graph and Language Model Usage
### Overview
The image presents three pie charts comparing the usage of Knowledge Graphs (KG) and Language Models (LLM) in three different question answering datasets: CWQ, WebQSP, and GrailQA. Each pie chart is divided into three segments representing the percentage of questions answered using "KG Only", "LLM Inspired KG", and "KG Inspired LLM" approaches.
### Components/Axes
* **Titles:** CWQ(%), WebQSP(%), GrailQA(%) - These indicate the dataset associated with each pie chart.
* **Pie Chart Segments:**
* **Blue:** KG Only
* **Light Green:** LLM Inspired KG
* **Light Gray:** KG Inspired LLM
* **Legend:** Located at the bottom-right of the image, it maps the colors to the corresponding approaches.
### Detailed Analysis
**1. CWQ(%) Pie Chart (Left)**
* **KG Only (Blue):** 78%
* **LLM Inspired KG (Light Green):** 9%
* **KG Inspired LLM (Light Gray):** 12%
**2. WebQSP(%) Pie Chart (Center)**
* **KG Only (Blue):** 86%
* **LLM Inspired KG (Light Green):** 4%
* **KG Inspired LLM (Light Gray):** 9%
**3. GrailQA(%) Pie Chart (Right)**
* **KG Only (Blue):** 95%
* **LLM Inspired KG (Light Green):** 1%
* **KG Inspired LLM (Light Gray):** 3%
### Key Observations
* Across all three datasets, the "KG Only" approach dominates, representing the largest segment in each pie chart.
* The "LLM Inspired KG" approach has the smallest representation in GrailQA (1%) compared to CWQ (9%) and WebQSP (4%).
* The "KG Inspired LLM" approach has the largest representation in CWQ (12%) compared to WebQSP (9%) and GrailQA (3%).
### Interpretation
The pie charts suggest that Knowledge Graphs are heavily relied upon for answering questions in these datasets. The "KG Only" approach consistently accounts for the majority of the answers. The influence of Language Models, either inspiring the KG or being inspired by the KG, varies across the datasets. GrailQA appears to be the most KG-centric, with minimal contribution from LLM-related approaches. CWQ shows a more balanced distribution, with a relatively higher percentage of questions answered using LLM-inspired methods. WebQSP falls in between, with a moderate reliance on KG and some influence from LLMs.
This data could indicate differences in the complexity or structure of the datasets, or variations in the types of questions asked. It also highlights the potential for further exploration of how LLMs can be effectively integrated with Knowledge Graphs to improve question answering performance.
</details>
(a) PoG
<details>
<summary>x18.png Details</summary>

### Visual Description
## Pie Charts: KG vs LLM Performance on Question Answering Datasets
### Overview
The image presents three pie charts comparing the performance of Knowledge Graph (KG) based question answering systems against Language Model (LLM) inspired KG systems on three different datasets: CWQ, WebQSP, and GrailQA. Each pie chart shows the percentage of questions answered correctly by KG Only, LLM Inspired KG, and KG Inspired LLM approaches.
### Components/Axes
* **Titles:** CWQ(%), WebQSP(%), GrailQA(%) - indicating the dataset used for each pie chart.
* **Pie Chart Segments:** Each pie chart is divided into three segments, each representing a different approach to question answering.
* **Legend (Bottom-Right):**
* Blue: KG Only
* Light Green: LLM Inspired KG
* Light Gray: KG Inspired LLM
* **Values:** Percentage values are displayed within each segment of the pie charts.
### Detailed Analysis
**CWQ(%)**
* KG Only (Blue): 77%
* LLM Inspired KG (Light Green): 7%
* KG Inspired LLM (Light Gray): 14%
**WebQSP(%)**
* KG Only (Blue): 87%
* LLM Inspired KG (Light Green): 2%
* KG Inspired LLM (Light Gray): 10%
**GrailQA(%)**
* KG Only (Blue): 92%
* LLM Inspired KG (Light Green): 1%
* KG Inspired LLM (Light Gray): 6%
### Key Observations
* Across all three datasets, the KG Only approach consistently achieves the highest percentage of correct answers.
* LLM Inspired KG consistently has the lowest percentage of correct answers across all datasets.
* The performance difference between KG Only and the other two approaches is most pronounced in GrailQA.
### Interpretation
The data suggests that for these question answering tasks, using a pure Knowledge Graph approach is more effective than incorporating Language Models, either to inspire the KG or to be inspired by the KG. The significant performance of KG Only, especially on GrailQA, indicates that the structure and information within the knowledge graph are well-suited for answering questions in these datasets. The lower performance of LLM Inspired KG might indicate challenges in effectively integrating language model insights into the knowledge graph structure. The KG Inspired LLM approach performs better than LLM Inspired KG, but still lags behind the KG Only approach, suggesting that while language models can benefit from knowledge graphs, they don't necessarily outperform a well-structured KG for these specific tasks.
</details>
(b) PoG-E
Figure 8. The proportions of answer evidence of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.
B.3. Error Analysis
To further analyze the integration of LLMs and KGs, we conduct an error analysis on the CWQ, WebQSP, and GrailQA datasets. We categoriz errors into four types: (1) answer generation error, (2) refuse error, (3) format error, and (4) other hallucination errors. Note that answer generation error occurs when PoG provides an accurate reasoning path, but the LLM fails to extract the correct answer from it. The distribution of these error types is illustrated in Figure 9. The results indicate that using more powerful LLMs reduces the number of ”other hallucination errors,” ”refuse errors,” and ”answer generation errors,” as the model offers enhanced reasoning capabilities based on the retrieved data. Specifically, the reduction in ”answer generation errors” shows the reasoning paths provided by PoG are effectively utilized by more advanced LLMs. However, we observe an increase in ”format errors” with more powerful LLMs, which may be attributed to their greater creative flexibility.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Stacked Bar Chart: Error Analysis of Language Models
### Overview
The image presents a stacked bar chart comparing the error profiles of different language models (GPT-3.5 and GPT-4) under two prompting strategies (PoG and PoG-E) across three question answering datasets (CWQ, WebQSP, and GrailQA). The chart breaks down the errors into four categories: Others Hallucination Error, Answer Generation Error, Refuse Answer, and Format Error.
### Components/Axes
* **Title:** Error Samples
* **Y-axis:**
* Label: Error Samples
* Scale: 0 to 250, with tick marks at 0, 50, 100, 150, 200, and 250.
* **X-axis:**
* Categories: CWQ, WebQSP, GrailQA
* Sub-categories (within each category):
* PoG (GPT-3.5)
* PoG (GPT-4)
* PoG-E (GPT-3.5)
* PoG-E (GPT-4)
* **Legend (top-right):**
* Others Hallucination Error (light blue)
* Answer Generation Error (coral)
* Refuse Answer (gold)
* Format Error (dark blue)
### Detailed Analysis
**CWQ Dataset:**
* **PoG (GPT-3.5):**
* Others Hallucination Error: ~180
* Answer Generation Error: ~30
* Refuse Answer: ~10
* Format Error: ~10
* **PoG (GPT-4):**
* Others Hallucination Error: ~70
* Answer Generation Error: ~0
* Refuse Answer: ~10
* Format Error: ~70
* **PoG-E (GPT-3.5):**
* Others Hallucination Error: ~210
* Answer Generation Error: ~20
* Refuse Answer: ~10
* Format Error: ~20
* **PoG-E (GPT-4):**
* Others Hallucination Error: ~70
* Answer Generation Error: ~0
* Refuse Answer: ~30
* Format Error: ~90
**WebQSP Dataset:**
* **PoG (GPT-3.5):**
* Others Hallucination Error: ~80
* Answer Generation Error: ~0
* Refuse Answer: ~10
* Format Error: ~10
* **PoG (GPT-4):**
* Others Hallucination Error: ~20
* Answer Generation Error: ~0
* Refuse Answer: ~0
* Format Error: ~30
* **PoG-E (GPT-3.5):**
* Others Hallucination Error: ~110
* Answer Generation Error: ~30
* Refuse Answer: ~10
* Format Error: ~0
* **PoG-E (GPT-4):**
* Others Hallucination Error: ~70
* Answer Generation Error: ~0
* Refuse Answer: ~10
* Format Error: ~0
**GrailQA Dataset:**
* **PoG (GPT-3.5):**
* Others Hallucination Error: ~60
* Answer Generation Error: ~0
* Refuse Answer: ~5
* Format Error: ~5
* **PoG (GPT-4):**
* Others Hallucination Error: ~20
* Answer Generation Error: ~0
* Refuse Answer: ~0
* Format Error: ~25
* **PoG-E (GPT-3.5):**
* Others Hallucination Error: ~80
* Answer Generation Error: ~20
* Refuse Answer: ~10
* Format Error: ~0
* **PoG-E (GPT-4):**
* Others Hallucination Error: ~25
* Answer Generation Error: ~5
* Refuse Answer: ~0
* Format Error: ~20
### Key Observations
* **Hallucination Errors:** "Others Hallucination Error" is generally the most prevalent error type across all datasets and model configurations, indicated by the height of the light blue section.
* **Format Errors:** "Format Error" is a significant error type, especially for PoG (GPT-4) on the CWQ dataset.
* **Model Performance:** GPT-4 generally exhibits lower "Others Hallucination Error" compared to GPT-3.5, suggesting improved accuracy.
* **Prompting Strategy:** The impact of the PoG-E prompting strategy varies across datasets and error types.
### Interpretation
The stacked bar chart provides a detailed comparison of error profiles for different language models and prompting strategies on various question-answering datasets. The data suggests that GPT-4 generally outperforms GPT-3.5 in terms of hallucination errors. The choice of prompting strategy (PoG vs. PoG-E) can significantly influence the error distribution, highlighting the importance of prompt engineering. The prevalence of "Others Hallucination Error" across all configurations indicates a persistent challenge in ensuring the factual accuracy of language model outputs. The "Format Error" suggests that the models sometimes struggle to produce outputs in the expected format, which could be addressed through improved training or output constraints.
</details>
Figure 9. The error instances and categories of PoG and PoG-E in the CWQ, WebQSP, and GrailQA datasets.
B.4. Efficiency Analysis
LLM calls cost analysis. To evaluate the cost and efficiency of utilizing LLMs, we conducted an analysis of LLM calls on the CWQ, WebQSP, and GrailQA datasets. Initially, we examined the proportion of questions answered with varying numbers of LLM calls, as depicted in Figure 10. The results indicate that the majority of questions are answered within nine LLM calls across all datasets, with approximately 80% and 50% of questions being resolved within six calls on CWQ and WebQSP, respectively. These findings demonstrate PoG’s efficiency in minimizing LLM usage costs. Furthermore, we compared the average number of LLM calls required by PoG with the current SOTA method, ToG (Sun et al., 2024), as shown in Table 7. Since we utilized identical datasets for WebQSP, GrailQA, Simple Questions, and WebQuestions, we report the ToG results from their paper. The comparison reveals that PoG achieves comparable or superior accuracy while reducing the number of LLM calls by up to 40% on the GrailQA dataset compared to ToG. This improvement is attributed to PoG’s dynamic exploration strategy, which avoids starting from scratch, and its effective use of graph structures to prune irrelevant information, thereby significantly decreasing computational costs.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Bar Charts: LLM Call Distribution for Different Datasets
### Overview
The image presents three bar charts, each displaying the distribution of the number of Large Language Model (LLM) calls for different datasets: CWQ, WebQSP, and GrailQA. The x-axis represents the number of LLM calls, grouped into intervals, and the y-axis represents the percentage of occurrences within each interval.
### Components/Axes
* **Titles (Top of each chart):**
* Left Chart: CWQ
* Middle Chart: WebQSP
* Right Chart: GrailQA
* **X-Axis Label (Shared):** "Number of LLM calls"
* **Y-Axis Label (Shared):** "Percentage %"
* Y-Axis Scale: 0 to 80, with tick marks at 0, 20, 40, 60, and 80.
* **X-Axis Categories (Shared):**
* (0,3]
* (3,6]
* (6,9]
* (9,12]
* (12,15]
* (15,18]
* (18,21]
* 21+
### Detailed Analysis
**1. CWQ (Left Chart):**
* Bar color: Orange
* Trend: The distribution is skewed towards higher numbers of LLM calls.
* (0,3]: ~7%
* (3,6]: ~23%
* (6,9]: ~44%
* (9,12]: ~9%
* (12,15]: ~6%
* (15,18]: ~3%
* (18,21]: ~3%
* 21+: ~9%
**2. WebQSP (Middle Chart):**
* Bar color: Yellow
* Trend: The distribution is skewed towards lower numbers of LLM calls.
* (0,3]: ~52%
* (3,6]: ~28%
* (6,9]: ~10%
* (9,12]: ~6%
* (12,15]: ~2%
* (15,18]: ~1%
* (18,21]: ~1%
* 21+: ~2%
**3. GrailQA (Right Chart):**
* Bar color: Blue
* Trend: The distribution is heavily concentrated in the lowest interval.
* (0,3]: ~78%
* (3,6]: ~13%
* (6,9]: ~4%
* (9,12]: ~2%
* (12,15]: ~1%
* (15,18]: ~0.5%
* (18,21]: ~0.5%
* 21+: ~1%
### Key Observations
* **CWQ:** Has a peak in the (6,9] interval, indicating a higher usage of LLM calls compared to the other datasets.
* **WebQSP:** Shows a significant percentage of occurrences in the (0,3] interval, suggesting that most instances require very few LLM calls.
* **GrailQA:** Is heavily concentrated in the (0,3] interval, indicating that the vast majority of instances require very few LLM calls.
### Interpretation
The charts illustrate the distribution of LLM calls required for different datasets. GrailQA and WebQSP appear to be solvable with fewer LLM calls compared to CWQ. The CWQ dataset has a more even distribution, with a peak in the (6,9] range, suggesting that it requires a more substantial number of LLM calls to resolve. The differences in these distributions likely reflect the complexity and nature of the questions or tasks within each dataset. The concentration of GrailQA in the (0,3] interval suggests that it may consist of simpler queries or tasks that can be resolved with minimal interaction with the LLM.
</details>
Figure 10. The proportion of question of PoG and PoG-E by different LLM Calls among CWQ, WebQSP, and GrailQA datasets
Table 7. Average LLM calls per question of PoG and ToG among all datasets.
| Method | CWQ | WebQSP | GrailQA | Simple Questions | WebQuestions |
| --- | --- | --- | --- | --- | --- |
| PoG | 10.7 | 8.3 | 6.5 | 6.1 | 9.3 |
| ToG | - | 11.2 | 10.6 | 8.7 | 10.5 |
Running time analysis. To further evaluate the processing efficiency of PoG, we conducted extensive evaluations focusing on average accuracy, LLM call frequency, and running time on multi-hop question datasets GrailQA and CWQ. The results, presented in Table 8, compare the performance of ToG and PoG across various beam search strategies. The data indicate that while higher accuracy slightly increases runtime, PoG effectively balances accuracy with reduced LLM call costs. Specifically, PoG reduces LLM call costs by up to 53.5% while increasing accuracy by up to 33.4%. This allows users to tailor the PoG framework to their specific needs regarding accuracy, cost, and runtime. All PoG setting provide significantly lower costs. For instance, on the GrailQA dataset, PoG with 3-step beam search reduces LLM call costs by 39.6% and improves accuracy by 33.1%, with only a 1.14% increase in runtime. A similar trend is also observed in the CWQ dataset.
Table 8. Running time evaluation of ToG and PoG with different beam search methods on CWQ and GrailQA.
| Model | Evaluation | CWQ | GrailQA |
| --- | --- | --- | --- |
| ToG | Accuracy | 53.1 | 59.3 |
| Time (s) | 78.7 | 14.8 | |
| LLM Calls | 21.3 | 10.1 | |
| PoG | Accuracy | 81.4 | 92.7 |
| Time (s) | 118.9 | 21.4 | |
| LLM Calls | 9.1 | 6.5 | |
| PoG | Accuracy | 79.8 | 92.4 |
| w/ 3-Step Beam Search | Time (s) | 87.5 | 15.0 |
| LLM Calls | 8.8 | 6.1 | |
| PoG | Accuracy | 57.1 | 83.0 |
| w/ Fuzzy Selection | Time (s) | 109.7 | 15.7 |
| LLM Calls | 6.8 | 4.7 | |
B.5. Case Study
Case study: graph reduction and path pruning. We conducted a case study using the example question presented in Figure 2 to illustrate the effects of graph pruning and path pruning on the graph structure. Figure 11 (a) shows the results of graph pruning, where vertices in blue are selected as part of the question subgraph, and vertices in black are pruned. In this sample, the number of entities is reduced from 16,740 to 1,245, resulting in a 92% reduction of vertices. Figures 11 (b) and 11 (c) demonstrate the question subgraph induced by the blue vertices in Figure 11 (a) and the results after applying fuzzy and precise path selection. In these figures, vertices in blue represent the selected entity after each pruning, vertices in yellow represent the topic entities, and the vertex in red denotes the final answer entity. From these graphs, we observe that utilizing the graph structure allows for the rapid pruning of irrelevant vertices, ensuring that the reasoning paths remain faithful and highly relevant to the question since all vertices within the question subgraph are interconnected with all topic entities, thereby maintaining the integrity and relevance of the reasoning process.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Network Diagram: Node Connectivity
### Overview
The image is a network diagram displaying interconnected nodes. The nodes are represented by blue and black circles, and the connections between them are shown as thin, gray lines. The diagram appears to be a visualization of a complex network, with a higher density of connections in the central region.
### Components/Axes
* **Nodes:** Represented by blue and black circles. The blue circles are larger than the black circles.
* **Edges:** Represented by thin, gray lines connecting the nodes.
* **Layout:** The nodes are arranged in a roughly square shape, with a higher density of nodes and edges in the center.
### Detailed Analysis or ### Content Details
* **Node Distribution:** The nodes are distributed throughout the diagram, with a higher concentration in the central area. The blue nodes appear to be more sparsely distributed than the black nodes.
* **Edge Density:** The density of edges is highest in the center of the diagram, indicating a higher degree of connectivity between nodes in this region. The edges appear to radiate outwards from the center.
* **Node Colors:** There are two distinct node colors: blue and black. The significance of these colors is not explicitly stated in the image.
* **Node Size:** The blue nodes are larger than the black nodes. The significance of the size difference is not explicitly stated in the image.
* **Edge Curvature:** The edges are curved, suggesting a force-directed layout algorithm was used to generate the diagram.
### Key Observations
* The network exhibits a core-periphery structure, with a densely connected core and a more sparsely connected periphery.
* The blue nodes may represent a different type of node or a node with different properties than the black nodes.
* The size of the blue nodes may indicate a higher degree of importance or influence within the network.
### Interpretation
The network diagram likely represents a complex system with interconnected entities. The blue and black nodes could represent different types of entities, and the edges represent the relationships between them. The higher density of connections in the center suggests that this region is a hub of activity or interaction. Without additional context, it is difficult to determine the specific meaning of the diagram, but it provides a visual representation of the network's structure and connectivity. The difference in size between the blue and black nodes could indicate a difference in importance or influence within the network.
</details>
(a) Graph pruning
<details>
<summary>x22.png Details</summary>

### Visual Description
## Network Diagram: Complex Interconnections
### Overview
The image is a network diagram illustrating complex interconnections between numerous nodes. The diagram features a large cluster of densely interconnected dark blue nodes, with connections extending to two distinct orange nodes located outside the main cluster. The density of connections varies, with some nodes acting as central hubs and others having fewer links.
### Components/Axes
* **Nodes:** Represented by dark blue and orange circles. The majority of nodes are dark blue, forming a dense cluster. Two nodes are orange and located outside the main cluster.
* **Edges:** Represented by thin lines connecting the nodes. The lines are primarily dark blue within the main cluster and transition to a tan color as they connect to the orange nodes.
* **Layout:** The main cluster of dark blue nodes is positioned towards the left and center of the image. The two orange nodes are located on the right and bottom-left of the image, respectively.
### Detailed Analysis
* **Dark Blue Nodes:** There are approximately 500-700 dark blue nodes. These nodes are densely interconnected, forming a central network. The connections within this cluster appear to be relatively uniform, with some nodes having a slightly higher degree of connectivity than others.
* **Orange Nodes:** There are two orange nodes. One is located on the right side of the diagram, and the other is located on the bottom-left. These nodes are connected to the main cluster via tan-colored edges.
* **Edges:** The edges within the main cluster are dark blue, matching the color of the nodes. As the edges extend from the main cluster to the orange nodes, they transition to a tan color. The orange node on the right has a significantly higher number of connections compared to the orange node on the bottom-left. The edges connecting to the right orange node are more concentrated, indicating a stronger relationship or higher frequency of interaction.
### Key Observations
* **Central Cluster:** The dark blue nodes form a highly interconnected central cluster, suggesting a strong degree of interaction or relationship among these entities.
* **Peripheral Nodes:** The orange nodes act as peripheral nodes, connected to the main cluster but distinct from it. The orange node on the right appears to be a significant point of interaction with the central cluster, while the orange node on the bottom-left has fewer connections.
* **Color Coding:** The color transition of the edges from dark blue to tan suggests a change in the nature of the relationship or interaction as it extends from the central cluster to the peripheral nodes.
### Interpretation
The network diagram likely represents a system where a large group of entities (dark blue nodes) are closely interconnected and interact frequently. The two orange nodes represent external entities that interact with the central cluster, but to varying degrees. The orange node on the right is a major point of interaction, potentially representing a key external partner or influence. The orange node on the bottom-left has a weaker connection, suggesting a less significant or less frequent interaction. The color change in the edges could indicate a different type of relationship or interaction between the central cluster and the peripheral nodes compared to the interactions within the central cluster itself. The diagram could be used to visualize social networks, communication networks, or any system where entities are interconnected and interact with each other.
</details>
(b) Question subgraph
<details>
<summary>x23.png Details</summary>

### Visual Description
## Network Diagram: Node Connectivity
### Overview
The image is a network diagram illustrating the connectivity between nodes. It features a dense cluster of gray nodes interconnected with blue lines, several larger blue nodes scattered throughout the cluster, and one prominent orange node located on the right side of the diagram. The orange node is connected to the gray node cluster via orange lines.
### Components/Axes
* **Nodes:** Represented by circles. There are three types of nodes:
* Gray nodes: Small, numerous, and densely clustered.
* Blue nodes: Larger than gray nodes, fewer in number, and distributed within the gray node cluster.
* Orange node: Single, located on the right, and connected to the gray node cluster.
* **Edges:** Represented by lines connecting the nodes. There are two types of edges:
* Blue lines: Connect the gray nodes to each other and to the blue nodes.
* Orange lines: Connect the orange node to the gray nodes.
### Detailed Analysis
* **Gray Nodes:** These nodes form the bulk of the network. They are interconnected with blue lines, creating a dense web of connections.
* **Blue Nodes:** These nodes are larger and fewer in number compared to the gray nodes. They are distributed throughout the gray node cluster and are also connected to the gray nodes via blue lines. There are approximately 20-30 blue nodes.
* **Orange Node:** This node is located on the right side of the diagram and is connected to the gray node cluster via orange lines. The orange lines appear to radiate from the orange node towards the gray node cluster.
* **Connectivity:** The gray nodes are highly interconnected, forming a dense network. The blue nodes appear to act as hubs within the gray node network. The orange node serves as a central point connecting to the gray node cluster.
### Key Observations
* The gray nodes form the core of the network, with a high degree of interconnectedness.
* The blue nodes may represent important nodes within the network, given their larger size and distribution.
* The orange node appears to be a significant point of connection to the gray node cluster.
### Interpretation
The network diagram likely represents a complex system where the gray nodes represent individual entities, the blue nodes represent key influencers or hubs, and the orange node represents a central authority or external connection. The density of the blue lines suggests a high degree of interaction and communication within the gray node cluster. The orange node's connection to the gray node cluster via orange lines indicates a flow of information or influence from the orange node to the gray nodes. The diagram could represent social networks, communication networks, or any other system where interconnectedness and influence are key factors.
</details>
(c) Fuzzy selection
<details>
<summary>x24.png Details</summary>

### Visual Description
## Network Diagram: Node Connectivity
### Overview
The image is a network diagram illustrating the connectivity between nodes. It features a large cluster of gray nodes, with connections to several highlighted nodes in orange, blue, and red. The density of connections varies, suggesting different levels of interaction between nodes.
### Components/Axes
* **Nodes:** Represented by circles. Most nodes are gray, while some are colored orange, blue, and red.
* **Edges:** Represented by lines connecting the nodes. The edges are colored to match the color of the node they connect to.
* **Colors:**
* Gray: Represents the majority of nodes in the network.
* Orange: Represents nodes with connections primarily to the gray cluster.
* Blue: Represents nodes with connections primarily to the gray cluster.
* Red: Represents a node with connections primarily to the gray cluster.
### Detailed Analysis
* **Gray Nodes:** There are approximately 500-700 gray nodes forming a central cluster. These nodes are interconnected with each other via gray edges.
* **Orange Nodes:** There are two orange nodes. One is located in the top-right corner, and the other is located in the bottom-left corner. These nodes are connected to the gray cluster via orange edges. The top-right orange node has a higher density of connections to the gray cluster than the bottom-left orange node.
* **Blue Nodes:** There are two blue nodes. One is located in the center-right of the gray cluster, and the other is located in the top-left of the gray cluster. These nodes are connected to the gray cluster via blue edges.
* **Red Node:** There is one red node located in the center of the gray cluster. This node is connected to the gray cluster via red edges.
### Key Observations
* The gray nodes form a dense, interconnected cluster.
* The orange nodes act as peripheral nodes with connections to the main cluster.
* The blue and red nodes are embedded within the gray cluster.
* The density of connections varies between the different colored nodes.
### Interpretation
The network diagram likely represents a system where the gray nodes are the primary components, and the colored nodes represent different types of interactions or roles within the system. The orange nodes could represent external entities interacting with the main system, while the blue and red nodes could represent key internal components or control points. The varying density of connections suggests different levels of influence or interaction between these components. The diagram could be used to visualize social networks, biological systems, or any other complex network of interconnected entities.
</details>
(d) Precise selection
Figure 11. Visualization of graph reduction and Path selection.
Case study: interpretable reasoning. In this section, we present Table 9, which illustrates PoG’s interpretability through case studies involving questions with one, two, and three entities. These examples demonstrate PoG’s effectiveness in handling multi-entity and multi-hop tasks by providing clear and understandable reasoning paths that lead to accurate answers.
Table 9. Examples of faithful and interpretable reasoning of PoG.
| Field | Content |
| --- | --- |
| Question | Lou Seal is the mascot for the team that last won the World Series when? |
| Answer | 2014 World Series |
| Topic Entity | Lou Seal |
| LLM indicator | ” Lou Seal ” - mascot for - team - last won - World Series - when |
| Split Question | split_question 1: What team is ”Lou Seal” the mascot for? |
| split_question 2: When did the last World Series win occur for the team with Lou Seal as the mascot? | |
| Reasoning Path | {Lou Seal} $→$ sports.mascot.team $→$ {San Francisco Giants} $→$ sports.sports_team.championships $→$ {2014 World Series, 2012 World Series, 2010 World Series} |
| PoG answer | answer: {2014 World Series} |
| reason: From the given path {Lou Seal} $→$ sports.mascot.team $→$ {San Francisco Giants}, {San Francisco Giants} is the team that ”Lou Seal” serves as the mascot for. From {San Francisco Giants} $→$ sports.sports_team.championships $→$ {2014 World Series, 2012 World Series, 2010 World Series}, the last World Series won by the {San Francisco Giants} is in {2014}. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {2014 World Series}. | |
| Question | What European Union country sharing borders with Germany contains the Lejre Municipality? |
| Answer | Denmark |
| Topic Entity | {Germany, Lejre Municipality} |
| LLM indicator | ” Lejre Municipality ” – contained by - European Union country (answer)- shares borders with-” Germany ” |
| Split Question | split_question 1: What European Union country shares borders with ”Germany”? |
| split_question 2: What country contains the Lejre Municipality? | |
| Reasoning Path | {Lejre Municipality} $→$ location.administrative_division.country $→$ {Denmark} $→$ location.location.containedby $→$ {Eurasia} $→$ location.location.contains $→$ {Germany}, |
| {Lejre Municipality} $→$ location.administrative_division.country $→$ {Denmark} $→$ base.locations.countries. continent $→$ {Europe} $→$ location.location.contains $→$ {Western Europe} $→$ location.location.contains $→$ {Germany} | |
| PoG answer | answer: {Denmark} |
| reason: From {Denmark} $→$ location.location.containedby $→$ {Eurasia} $→$ location.location.contains $→$ {Germany}, Denmark is European Union country shares borders with ”Germany”. From {Lejre Municipality} $→$ location.administrative_division.country $→$ {Denmark}, Denmark is a country contains the Lejre Municipality Both paths lead to Denmark, which is the European Union country sharing borders with Germany and containing the Lejre Municipality. Therefore, the provided knowledge graph path is sufficient to answer the main question, and the answer is {Denmark}. | |
| Question | What Central Tiem Zone continent do the Falkland Islands belong to? |
| Answer | Americas |
| Topic Entity | {Falkland Islands, Continent, Central Time Zone} |
| LLM indicator | answer(” Central Time Zone ”) – owns - answer(” Continent ”)- located contain -” Falkland Islands ” |
| Split Question | split_question 1:What continent do the Falkland Islands belong to? |
| split_question 2: What Central Time Zone does the continent belong to? | |
| Reasoning Path | {Central Time Zone} $→$ time.time_zone.locations_in_this_time_zone $→$ {Americas} $→$ common.topic.not able_types $→$ {Continent} $→$ common.topic. notable_types $→$ {Americas} $→$ location.location.con tains $→$ { Falkland Islands} |
| PoG answer | answer: {Americas} |
| reason: From {Continent} $→$ common.topic.notable_types $→$ {Americas} $→$ location.location. contains $→$ { Falkland Islands}, Americas is the continent Falkland Islands belong to. From {Central Time Zone} $→$ time.time_zone.locations_in_this_time_zone $→$ {Americas} $→$ common.topic. notable_types $→$ {Continent}. Americas is the Central Time Zone. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {Americas}. | |
Appendix C Experiment Details
Experiment datasets. To evaluate PoG’s capability in multi-hop knowledge-intensive reasoning tasks, we assess it on four KBQA datasets: three multi-hop datasets (CWQ (Talmor and Berant, 2018), WebQSP (Yih et al., 2016), GrailQA (Gu et al., 2021)) and one single-hop dataset (SimpleQuestions (Petrochuk and Zettlemoyer, 2018)). Additionally, to examine PoG on more general tasks, we include an open-domain QA dataset, WebQuestions. For the evaluation of large datasets such as CWQ, GrailQA, and SimpleQuestions, we utilize a random sample of 1,000 test cases from CWQ and employ the 1,000 samples previously reported by ToG (Sun et al., 2024) to facilitate a comparison with the SOTA while also minimizing computational costs. Freebase serves as the background knowledge graph for all datasets, which encompasses approximately 88 million entities, 20,000 relations, and 126 million triples (Luo et al., 2024; Bollacker et al., 2008). The statistics of the datasets utilized in this study are detailed in Table 10. The source code has been publicly available https://github.com/SteveTANTAN/PoG.
Baselines. Inspired by ToG (Sun et al., 2024), we compare our method with standard prompting (IO), Chain-of-Thought (CoT), and Self-Consistency (SC) promptings with six in-context exemplars and ”step-by-step” reasoning chains. For each dataset, we also include previous SOTA works for comparison. For a fair play, we compare with previous SOTA among all prompting-based methods and previous SOTA among all methods respectively. Since ToG is the current SOTA prompting-based method, we directly refer to their results and those of other baselines reported in their paper for comparisons.
Experimental implementation. Leveraging the plug-and-play convenience of our framework, we experiment with two LLMs: GPT-3.5 and GPT-4. We use the OpenAI API to access GPT-3.5 (GPT-3.5-turbo) and GPT-4. Aligning with ToG, we set the temperature parameter to 0.4 during the exploration process (to increase diversity) and to 0 during the reasoning process (to ensure reproducibility). The maximum token length for generation is set to 256. In all experiments, we set both $W_{\max}$ and $D_{\max}$ to 3 for beam search. All the experiments are conducted on a machine with Intel Xeon Gold 6248R CPU, Nvidia A5000 GPU and 512GB memory.
Table 10. Statistics of the datasets used in this paper. † denotes we randomly select 1,000 samples from the CWQ test set to create the experiment testing set due to the abundance of test samples. ∗ denotes that we utilize the 1,000 samples reported by ToG (Sun et al., 2024) to compare with the state-of-the-art.
| Dataset | Answer Format | Test | Train |
| --- | --- | --- | --- |
| ComplexWebQuestions (CWQ) † | Entity | 1,000 | 27,734 |
| WebQSP | Entity/Number | 1,639 | 3,098 |
| GrailQA ∗ | Entity/Number | 1,000 | 44,337 |
| Simple Question ∗ | Entity/Number | 1,000 | 14,894 |
| WebQuestions | Entity/Number | 2,032 | 3,778 |
Appendix D SPARQL
This section outlines the pre-defined SPARQL queries used for interacting with the knowledge graph and constructing graphs for our experiments.
D.1. 1-hop Entity and Relation Search
To facilitate the retrieval of 1-hop neighbors of entities within the Freebase Knowledge Graph, we have predefined a SPARQL query. This query is designed to be executed by simply substituting the appropriate ID for the query entity ID. It returns the connected entities’ IDs and their associated relations’ IDs, indicating whether the connected entity is at the tail or the head of the relation.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT ? relation ? connectedEntity ? direction
WHERE {
{
ns: ID ? relation ? connectedEntity .
BIND ("tail" AS ? direction)
}
UNION
{
? connectedEntity ? relation ns: ID .
BIND ("head" AS ? direction)
}
}
D.2. Short Textual Description
The following predefined function implements the retrieval of short textual descriptions, $\mathcal{D}(.)$ , for converting the identifiers (IDs) of entities or relations into natural language descriptions.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT DISTINCT ? tailEntity
WHERE {
{
? entity ns: type. object. name ? tailEntity .
FILTER (? entity = ns: ID)
}
UNION
{
? entity < http:// www. w3. org /2002/07/ owlsameAs > ? tailEntity .
FILTER (? entity = ns: ID)
}
}
D.3. 1-hop Subgraph Search
To facilitate subgraph detection in Section 4.1, we implement the 1-hop subgraph detection feature by integrating SPARQL functions described in Appendix D.1 and D.2. The purpose of this function is to retrieve, in a single SPARQL query, the 1-hop neighbors of a given query with their IDs, natural language names, and connected relationships, specifying whether the connected entity is at the tail or the head of the relationship.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT ? relation ? connectedEntity ? connectedEntityName ? direction
WHERE {
{
ns: ID ? relation ? connectedEntity .
OPTIONAL {
? connectedEntity ns: type. object. name ? name .
FILTER (lang (? name) = ’ en ’)
}
BIND (COALESCE (? name, "Unnamed Entity") AS ? connectedEntityName)
BIND ("tail" AS ? direction)
}
UNION
{
? connectedEntity ? relation ns: ID .
OPTIONAL {
? connectedEntity ns: type. object. name ? name .
FILTER (lang (? name) = ’ en ’)
}
BIND (COALESCE (? name, "Unnamed Entity") AS ? connectedEntityName)
BIND ("head" AS ? direction)
}
}
Appendix E Prompts
In this section, we detail the prompts required for our main experimental procedures.
Question Analysis Prompt Template
You will receive a multi-hop question, which is composed of several interconnected queries, along with a list of topic entities that serve as the main keywords for the question. Your task is to break the question into simpler parts, using each topic entity once and provide a Chain of Thought (CoT) that shows how the topic entities are related. Note: Each simpler question should explore how one topic entity connects to others or the answer. The goal is to systematically address each entity to derive the final answer. In-Context Few-shot Q: {Query} Topic Entity: {Topic Entity} A:
LLM Supplement Prompt Template
Using the main question, a possibly uncertain chain of thought generated by a language model, some related split questions, paths from the ”Related_paths” section, and main topic entities: please first provide three predicted results, and second offer three possible chains of thought that could lead to these results, using the provided knowledge paths and your own knowledge. If any answers are unclear, suggest alternative answers to fill in the gaps in the chains of thought, following the same format as the provided examples. In-Context Few-shot Q: {Query} Topic Entity: {Topic Entity} Think Indicator:{Think Indicator} Split Question:{Split Question} A:
where {Think Indicator}, and {Split Question} are obtained in section 4.1. An indicator example is shown in Figure 2.
Precise Path Select Prompt Template
Given a main question, a LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}, Please score and give me the top three lists from the candidate paths set can be highly to be the answer of the question. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Candidate Paths:{Candidate Paths} A:
{Candidate Paths} denotes the retrieved reasoning paths $Final_{p}aths$ to be selected in this request which are formatted as a series of structural sentences:
$\{e_{0x},...,e_{0z}\}→ r_{1_{i}}→\{e_{1x},...,e_{1z}\}→...$ $→ r_{l_{j}}→\{e_{lx},...,e_{lz}\}$
$...$
$\{e_{0x},...,e_{0z}\}→ r_{1_{i}}→\{e_{1x},...,e_{1z}\}→...$ $→ r_{l_{j}}→\{e_{lx},...,e_{lz}\}$ ,
where, $i$ and $j$ in $r_{1_{i}},r_{1_{i}}$ represent the $i$ -th, $j$ -th relation from each relation edge in the clustered question subgraph. And $e$ is constructed by its ID and natural language name $\mathcal{D}(ID)$ .
Path Summarizing Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, the associated accuracy retrieved knowledge paths from the Related paths section, and main topic entities. Your task is to summarize the provided knowledge triple in the Related paths section and generate a chain of thoughts by the knowledge triple related to the main topic entities of the question, which will used for generating the answer for the main question and split question further. You have to make sure you summarize correctly by using the provided knowledge triple, you can only use the entity with id from the given path and you can not skip in steps. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A:
{Related_Paths} has the same format with the {Candidate_Paths} before.
Question Answering Evaluation Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}. Your task is to determine if this knowledge graph path is sufficient to answer the given split question first then the main question. If it’s sufficient, you need to respond {Yes} and provide the answer to the main question. If the answer is obtained from the given knowledge path, it should be the entity name from the path. Otherwise, you need to response {No}, then explain the reason. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A:
Question Answering Generation Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}, Your task is to generate the answer based on the given knowledge graph path and your own knowledge. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A: