# Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning
**Authors**: Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Wenjie Zhang
> 0009-0000-7232-7051 University of New South Wales Data61, CSIRO Sydney Australia
> 0000-0003-3554-3219 University of New South Wales Sydney Australia
> 0000-0001-7895-9551 Data61, CSIRO Sydney Australia
> 0000-0002-2273-1862 Data61, CSIRO Sydney Australia
> 0000-0002-9167-1613 Data61, CSIRO Sydney Australia
> 0000-0001-6572-2600 University of New South Wales Sydney Australia
(2025)
## Abstract
Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.
Large Language Models; Knowledge Graph; Knowledge Graph Question Answering; Retrieval-Augmented Generation journalyear: 2025 copyright: acmlicensed conference: Proceedings of the ACM Web Conference 2025; April 28-May 2, 2025; Sydney, NSW, Australia booktitle: Proceedings of the ACM Web Conference 2025 (WWW ’25), April 28-May 2, 2025, Sydney, NSW, Australia doi: 10.1145/3696410.3714892 isbn: 979-8-4007-1274-6/25/04 ccs: Information systems Question answering
## 1. Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: LLM Reasoning Process for Question Answering
### Overview
This diagram illustrates a multi-step reasoning process employed by a Large Language Model (LLM) – specifically GPT-3.5/GPT-4 – to answer a complex question: "What country bordering France contains an airport that serves Nijmegen?". The diagram is divided into four sections (a, b, c, d) representing different stages of the reasoning process, from initial response generation to final answer selection. Each stage utilizes different techniques, including chain-of-thought prompting, knowledge graph (KG) exploration, and subgraph answering.
### Components/Axes
The diagram consists of several key components:
* **Question Box:** Top-left corner, containing the initial question.
* **LLM Blocks:** Representing the LLM's processing steps.
* **Knowledge Graph (KG) Visualizations:** Showing entities and relationships.
* **Flow Arrows:** Indicating the direction of information flow.
* **Answer Boxes:** Highlighting the LLM's responses at each stage.
* **Checkmarks/Crosses:** Indicating success or failure of a reasoning step.
* **Stage Labels:** (a), (b), (c), (d) denoting the different stages.
* **Sub-component Labels:** "Question Analysis", "Reasoning Path Exploration", "Reasoning Path Pruning".
### Detailed Analysis or Content Details
**Stage (a): GPT-3.5/GPT-4 LLM only**
* **Question:** "What country bordering France contains an airport that serves Nijmegen?"
* **LLM Response:** "Nijmegen is served by airports in the neighboring countries, and one of the closest major ones is Brussels Airport (BRU) in Belgium, which is relatively near Nijmegen compared to other major airports. The answer is Belgium."
* **Output:** Belgium.
**Stage (b): LLM empowered KG exploration search**
* **Entities:** France, Nijmegen.
* **KG Triples (ToG):** “[France, location.location.containedby, Europe]”, “[France, location.location.geolocaton, Unnamed Entity]”, “[Nijmegen, second_level_division, Netherlands]”.
* **LLM Response:** "First, Nijmegen is a city in the Netherlands. Second, the Netherlands is a country bordering France. The answer is {Netherlands}."
* **Output:** Netherlands.
**Stage (c): LLM empowered KG subgraph answering**
* **Text:** "(MindMap): MindMap cannot prompt LLM to construct a large and generate the graph descript document since the retrieved subgraph is extremely large and dense."
* **Output:** Refuse to answering.
**Stage (d): PoG (Path of Graph)**
* **Reasoning Paths:**
* Nijmegen → nearby → Weeze Airport → contain by → Germany
* Weeze Airport → contain by → Germany
* Germany → adjoins → Unnamed Entity → adjoins → France
* **LLM Response:** "From the provided knowledge graph path, the entity {Germany} is the country that contains an airport serving Nijmegen and also the country bordering France. Therefore, the answer to the main question 'What country bordering France contains an airport that serves Nijmegen?' is {Germany}."
* **Output:** Germany.
### Key Observations
* The LLM initially provides an incorrect answer (Belgium) based on proximity to an airport.
* KG exploration leads to a second answer (Netherlands), which is also incorrect.
* Subgraph answering fails due to the complexity of the retrieved graph.
* The final stage, utilizing a Path of Graph (PoG) approach, successfully identifies the correct answer (Germany) by tracing relationships within the knowledge graph.
* The diagram highlights the importance of structured knowledge representation (KG) and reasoning paths for accurate question answering.
* The diagram demonstrates the iterative nature of the LLM's reasoning process, with multiple attempts and refinements.
### Interpretation
The diagram illustrates the challenges and advancements in LLM-based question answering. The initial response demonstrates the LLM's ability to leverage general knowledge but also its susceptibility to errors. The subsequent stages showcase the benefits of integrating external knowledge sources (KG) and employing structured reasoning techniques (PoG). The failure of subgraph answering underscores the limitations of current LLMs in handling complex knowledge graphs. The final success with PoG suggests a promising direction for improving the accuracy and reliability of LLM-based question answering systems. The diagram is a visual representation of the LLM's "thought process," revealing the steps taken to arrive at a solution and the potential pitfalls encountered along the way. The use of checkmarks and crosses provides a clear indication of the success or failure of each reasoning step, allowing for a detailed analysis of the LLM's performance.
</details>
Figure 1. Representative workflow of four LLM reasoning paradigms.
Large Language Models (LLMs) have demonstrated remarkable performance in various tasks (Brown, 2020; Chowdhery et al., 2023; Touvron et al., 2023; Besta et al., 2024; Huang et al., 2025). These models leverage pre-training techniques by scaling to billions of parameters and training on extensive, diverse, and unlabelled data (Touvron et al., 2023; Rawte et al., 2023). Despite these impressive capabilities, LLMs face two well-known challenges. First, they struggle with deep and responsible reasoning when tackling complex tasks (Petroni et al., 2020; Talmor et al., 2018; Khot et al., 2022). Second, the substantial cost of training makes it difficult to keep models updated with the latest knowledge (Sun et al., 2024; Wen et al., 2024), leading to errors when answering questions that require specialized information not included in their training data. For example, in Figure 1 (a), though models like GPT can generate reasonable answers for knowledge-specific questions, these answers may be incorrect due to outdated information or hallucination of reasoning on LLM inherent Knowledge Base (KB).
To deal with the problems of error reasoning and knowledge gaps, the plan-retrieval-answering method has been proposed (Luo et al., 2024; Zhao et al., 2023; Li et al., 2023b). In this approach, LLMs are prompted to decompose complex reasoning tasks into a series of sub-tasks, forming a plan. Simultaneously, external KBs are retrieved to answer each step of the plan. However, this method still has the issue of heavily relying on the reasoning abilities of LLMs rather than the faithfulness of the retrieved knowledge. The generated reasoning steps guide information selection, but answers are chosen based on the LLM’s interpretation of the retrieved knowledge rather than on whether the selection leads to a correct and faithful answer.
To address these challenges, incorporating external knowledge sources like Knowledge Graphs (KGs) is a promising solution to enhance LLM reasoning (Sun et al., 2024; Luo et al., 2024; Pan et al., 2024; Luo et al., 2023). KGs offer abundant factual knowledge in a structured format, serving as a reliable source to improve LLM capabilities. Knowledge Graph Question Answering (KGQA) serves as an approach for evaluating the integration of KGs with LLMs, which requires machines to answer natural language questions by retrieving relevant facts from KGs. These approaches typically involve: (1) identifying the initial entities from the question, and (2) iteratively retrieving and refining inference paths until sufficient evidence has been obtained. Despite their success, they still face challenges such as handling multi-hop reasoning problems, addressing questions with multiple topic entities, and effectively utilizing the structural information of graphs.
Challenge 1: Multi-hop reasoning problem. Current methods (Guo et al., 2024; Ye et al., 2021; Sun et al., 2024; Ma et al., 2024), such as the ToG model presented in Figure 1 (b), begin by exploring from each topic entity, with LLMs selecting connected knowledge triples like (France, contained_by, Europe). This process relies on the LLM’s inherent understanding of these triples. However, focusing on one-hop neighbors can result in plausible but incorrect answers and prematurely exclude correct ones, especially when multi-hop reasoning is required. Additionally, multi-hop reasoning introduces significant computational overhead, making efficient pruning essential, especially in dense and large KGs.
Challenge 2: Multi-entity question. As shown in Figure 1 (b), existing work (Guo et al., 2024; Ye et al., 2021; Sun et al., 2024; Ma et al., 2024) typically explores KG for each topic entity independently. When a question involves multiple entities, these entities are examined in separate steps without considering their interconnections. This approach can result in a large amount of irrelevant information in the candidate set that does not connect to the other entities in the question, leading to suboptimal results.
Challenge 3: Utilizing graph structure. Existing methods (Wen et al., 2024; Guo et al., 2023; Chen et al., 2024) often overlook the inherent graph structures when processing retrieved subgraphs. For example, the MindMap model in Figure 1 (c) utilizes LLMs to generate text-formatted subgraphs from KG triples, converting them into graph descriptions that are fed back into the LLM to produce answers. This textual approach overlooks the inherent structural information of graphs and can overwhelm the LLM when dealing with large graphs. Additionally, during KG information selection, most methods use in-context learning by feeding triples into the LLM, ignoring the overall graph structure.
Contributions. In this paper, we introduce a novel method, P aths- o ver- G raph (PoG). Unlike previous studies that utilize knowledge triples for retrieval (Sun et al., 2024; Ma et al., 2024), PoG employs knowledge reasoning paths, that contain all the topic entities in a long reasoning length, as a retrieval-augmented input for LLMs. The paths in KGs serve as logical reasoning chains, providing KG-supported, interpretable reasoning logic that addresses issues related to the lack of specific knowledge background and unfaithful reasoning paths.
To address multi-hop reasoning problem, as shown in Figure 1 (d), PoG first performs question analysis, to extract topic entities from questions. Utilizing these topic entities, it decomposes the complex question into sub-questions and generates an LLM thinking indicator termed "Planning". This planning not only serves as an answering strategy but also predicts the implied relationship depths between the answer and each topic entity. The multi-hop paths are then explored starting from a predicted depth, enabling a dynamic search process. Previous approaches using planning usually retrieve information from scratch, which often confuses LLMs with source neighborhood-based semantic information. In contrast, our method ensures that LLMs follow accurate reasoning paths that directly lead to the answer.
To address multi-entity questions, PoG employs a three-phase exploration process to traverse reasoning paths from the retrieved question subgraph. All paths must contain all topic entities in the same order as they occur in the LLM thinking indicator. In terms of reasoning paths in KGs, all paths are inherently logical and faithful. Each path potentially contains one possible answer and serves as the interpretable reasoning logic. The exploration leverages the inherent knowledge of both LLM and KG.
To effectively utilize graph structure, PoG captures the question subgraph by expanding topic entities to their maximal depth neighbors, applying graph clustering and reduction to reduce graph search costs. In the path pruning phase, we select possible correct answers from numerous candidates. All explored paths undergo a three-step beam search pruning, integrating graph structures, LLM prompting, and a pre-trained language understanding model (e.g., BERT) to ensure effectiveness and efficiency. Additionally, inspired by the Graph of Thought (GoT) (Besta et al., 2024), to reduce LLM hallucination, PoG prompts LLMs to summarize the obtained Top- $W_{\max}$ paths before evaluating the answer, where $W_{\max}$ is a user-defined maximum width in the path pruning phase. In summary, the advantage of PoG can be abbreviated as:
- Dynamic deep search: Guided by LLMs, PoG dynamically extracts multi-hop reasoning paths from KGs, enhancing LLM capabilities in complex knowledge-intensive tasks.
- Interpretable and faithful reasoning: By utilizing highly question-relevant knowledge paths, PoG improves the interpretability of LLM reasoning, enhancing the faithfulness and question-relatedness of LLM-generated content.
- Efficient pruning with graph structure integration: PoG incorporates efficient pruning techniques in both the KG and reasoning paths to reduce computational costs, mitigate LLM hallucinations caused by irrelevant noise, and effectively narrow down candidate answers.
- Flexibility and effectiveness: a) PoG is a plug-and-play framework that can be seamlessly applied to various LLMs and KGs. b) PoG allows frequent knowledge updates via the KG, avoiding the expensive and slow updates required for LLMs. c) PoG reduces the LLMs token usage by over 50% with only a ±2% difference in accuracy compared to the best-performing strategy. d) PoG achieves state-of-the-art results on all the tested KGQA datasets, outperforming the strong baseline ToG by an average of 18.9% accuracy using both GPT-3.5 and GPT-4. Notably, PoG with GPT-3.5 can outperform ToG with GPT-4 by up to 23.9%.
## 2. Related Work
KG-based LLM reasoning. KGs provide structured knowledge valuable for integration with LLMs (Pan et al., 2024). Early studies (Peters et al., 2019; Luo et al., 2024; Zhang et al., 2021; Li et al., 2023b) embed KG knowledge into neural networks during pre-training or fine-tuning, but this can reduce explainability and hinder efficient knowledge updating (Pan et al., 2024). Recent methods combine KGs with LLMs by converting relevant knowledge into textual prompts, often ignoring structural information (Pan et al., 2024; Wen et al., 2024). Advanced works (Sun et al., 2024; Jiang et al., 2023; Ma et al., 2024) involve LLMs directly exploring KGs, starting from an initial entity and iteratively retrieving and refining reasoning paths until the LLM decides the augmented knowledge is sufficient. However, by starting from a single vertex and ignoring the question’s position within the KG’s structure, these methods overlook multiple topic entities and the explainability provided by multi-entity paths.
Reasoning with LLM prompting. LLMs have shown significant potential in solving complex tasks through effective prompting strategies. Chain of Thought (CoT) prompting (Wei et al., 2022) enhances reasoning by following logical steps in few-shot learning. Extensions like Auto-CoT (Zhang et al., 2023), Complex-CoT (Fu et al., 2022), CoT-SC (Wang et al., 2022), Zero-Shot CoT (Kojima et al., 2022), ToT (Yao et al., 2024), and GoT (Besta et al., 2024) build upon this approach. However, these methods often rely solely on knowledge present in training data, limiting their ability to handle knowledge-intensive or deep reasoning tasks. To solve this, some studies integrate external KBs using plan-and-retrieval methods such as CoK (Li et al., 2023b), RoG (Luo et al., 2024), and ReAct (Yao et al., 2022), decomposing complex questions into subtasks to reduce hallucinations. However, they may focus on the initial steps of sub-problems and overlook further steps of final answers, leading to locally optimal solutions instead of globally optimal ones. To address these deep reasoning challenges, we introduce dynamic multi-hop question reasoning. By adaptively determining reasoning depths for different questions, we enable the model to handle varying complexities effectively.
KG information pruning. Graphs are widely used to model complex relationships among different entities (Tan et al., 2023; Sima et al., 2024; Li et al., 2024b, a). KGs contain vast amounts of facts (Guo et al., 2019; Wu et al., 2024; Wang et al., 2024a), making it impractical to involve all relevant triples in the context of the LLM due to high costs and potential noise (Wang et al., 2024b). Existing methods (Sun et al., 2024; Jiang et al., 2023; Ma et al., 2024) typically identify initial entities and iteratively retrieve reasoning paths until an answer is reached, often treating the LLM as a function executor and relying on in-context learning or fine-tuning, which is expensive. Some works attempt to reduce pruning costs. KAPING (Baek et al., 2023a) projects questions and triples into the same semantic space to retrieve relevant knowledge via similarity measures. KG-GPT (Kim et al., 2023) decomposes complex questions, matches, and selects the relevant relations with sub-questions to form evidence triples. However, these methods often overlook the overall graph structure and the interrelations among multiple topic entities, leading to suboptimal pruning and reasoning performance.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph Exploration and Question Answering
### Overview
This diagram illustrates a system for exploring a knowledge graph to answer a natural language question. The process begins with question initialization, proceeds through knowledge graph exploration, path pruning, and culminates in question answering. The diagram depicts a flow of information through several stages, utilizing various techniques like LLM supplementation and fuzzy/precise selection.
### Components/Axes
The diagram is segmented into four main sections: Initialization (top-right), Exploration (right), Path Pruning (bottom-center), and Question Answering (bottom-right). Nodes represent entities or concepts within the knowledge graph. Edges represent relationships between these entities. The diagram uses color-coding to differentiate paths and processes.
* **Initialization:** Contains "Question" box with the query: "What country bordering France contains an airport that serves Nijmegen?". Also includes "Topic Entity Recognition", "Question Subgraph Detection", and "Split Questions".
* **Exploration:** Includes "Topic Entity Path Exploration", "LLM Supplement Path Exploration", "Node Expand Exploration".
* **Path Pruning:** Contains "Fuzzy Selection", "Precise Path Selection", "Branch Reduced Selection", and "Path Summarizing".
* **Question Answering:** Contains a decision node ("Yes"/"No") leading to "Answer".
* **Entities:** Netherlands, Nijmegen, Weeze Airport, Ryanair, Germany, France, Kingdom of the Netherlands, Europe, Western Europe, Central European Time Zone, Lyon-Saint Exupéry Airport, Unnamed Entity (appears multiple times), Olympics (2000, 2002, 1924).
* **Relationships:** olympic_athletes, athlete_affiliation, nearby, airports, second_level_division, airport_type, public_airport, airports_of_this_type, containedby, adjain_s, contain, country, time_zones, in_this_time_zone, user_topics, participating_countries.
* **Colors:** Paths are represented by different colors (red, green, blue, purple) to indicate different exploration routes. The "Fuzzy Selection" component uses a gradient from light to dark yellow.
### Detailed Analysis or Content Details
The diagram shows a question ("What country bordering France contains an airport that serves Nijmegen?") being processed through a knowledge graph.
1. **Initialization:** The question is parsed, and key entities (Nijmegen, France) are identified. A subgraph related to the question is detected.
2. **Exploration:** The system explores paths originating from the identified entities.
* **Netherlands** is connected to "olympic_athletes" and "Unnamed Entity" via "athlete_affiliation". It's also connected to "Nijmegen" via "nearby" and "Weeze Airport" via "airports".
* **Nijmegen** is connected to "Weeze Airport" via "airports".
* **Weeze Airport** is a "Public Airport" of type "airport".
* **Germany** contains "Weeze Airport" via "containedby".
* **France** is connected to "participating countries" via "user_topics". It also contains "Lyon-Saint Exupéry Airport" via "containedby".
* **Kingdom of the Netherlands** contains "Netherlands" via "country".
* **Europe** contains "Germany" and "Kingdom of the Netherlands" via "containedby".
* **Central European Time Zone** is in "Europe, Western Europe" via "in_this_time_zone".
3. **Path Pruning:** Multiple paths are explored, and a pruning process is applied.
* **Fuzzy Selection:** Paths are evaluated based on an "Indicator" (H1) and "Paths_Set" (H Path). The color gradient suggests a degree of confidence or relevance.
* **Precise Path Selection:** Paths are refined based on more specific criteria.
* **Branch Reduced Selection:** Redundant or less promising branches are eliminated.
4. **Question Answering:** The remaining paths are summarized, and a decision is made (Yes/No) to determine the answer.
### Key Observations
* The diagram highlights the iterative nature of knowledge graph exploration. Multiple paths are explored simultaneously.
* The "Fuzzy Selection" stage suggests the use of probabilistic or approximate matching techniques.
* The diagram emphasizes the importance of path pruning to reduce computational complexity and improve accuracy.
* The question is complex, requiring multiple hops through the knowledge graph to find the answer.
* The use of "Unnamed Entity" suggests that some entities in the knowledge graph are not explicitly labeled.
### Interpretation
This diagram represents a sophisticated system for question answering over a knowledge graph. The system leverages the power of LLMs to supplement path exploration and employs a combination of fuzzy and precise matching techniques to identify relevant information. The path pruning process is crucial for managing the complexity of the knowledge graph and ensuring efficient query processing. The diagram suggests a system designed to handle complex, multi-hop questions that require reasoning over a large and interconnected knowledge base. The presence of "Unnamed Entity" indicates a potential area for improvement in the knowledge graph's completeness or labeling. The system appears to be designed for a scenario where the knowledge graph is not perfectly curated and requires robust techniques for handling ambiguity and uncertainty.
</details>
Figure 2. Overview of the PoG architecture. Exploration: After initialization (detailed in Figure 3), the model retrieves entity paths from $\mathcal{G}_{q}$ through three exploration phases. Path Pruning: PoG applies a three-step beam search to prune paths after each exploration phase. Question Answering: The pruned paths are then evaluated for question answering. If these paths do not fully answer the question, the model explores deeper paths until $D_{max}$ is reached or moves on to the next exploration phase.
## 3. Preliminary
Consider a Knowledge Graph (KG) $\mathcal{G(E,R,T)}$ , where $\mathcal{E}$ , $\mathcal{R}$ and $\mathcal{T}$ represent the set of entities, relations, and knowledge triples, respectively. Each knowledge triple $T\in\mathcal{T}$ encapsulates the factual knowledge in $\mathcal{G}$ , and is represented as $T=(e_{h},r,e_{t})$ , where $e_{h},e_{t}\in\mathcal{E}$ and $r\in\mathcal{R}$ . Given an entity set $\mathcal{E_{S}\subseteq E}$ , the induced subgraph of $\mathcal{E_{S}}$ is denoted as $\mathcal{S=(E_{S},R_{S},T_{S})}$ , where $\mathcal{T}_{S}=\{(e,r,e^{\prime})\in\mathcal{T}\mid e,e^{\prime}\in\mathcal{E }_{S}\}$ , and $\mathcal{R}_{S}=\{r\in\mathcal{R}\mid(e,r,e^{\prime})\in\mathcal{T}_{S}\}.$ Furthermore, we denote $\mathcal{D}(e)$ and $\mathcal{D}(r)$ as the sets of short textual descriptions for each entity $e\in\mathcal{E}$ and each relation $r\in\mathcal{R}$ , respectively. For example, the text description of the entity “m.0f8l9c” is $\mathcal{D}$ (“m.0f8l9c”)= “France”. For simplicity, in this paper, all entities and relations are referenced through their $\mathcal{D}$ representations and transformed into natural language.
** Definition 0 (Reasoning Path)**
*Given a KG $\mathcal{G}$ , a reasoning path within $\mathcal{G}$ is defined as a connected sequence of knowledge triples, represented as: $path_{\mathcal{G}}(e_{1},e_{l+1})=\{T_{1},T_{2},...,T_{l}\}=\{(e_{1},r_{1},e_{ 2}),(e_{2},r_{2},e_{3})$ $,...,(e_{l},r_{l},e_{l+1})\}$ , where $T_{i}\in\mathcal{T}$ denotes the $i$ -th triple in the path and $l$ denotes the length of the path, i.e., $length(path_{\mathcal{G}}(e_{1},e_{l+1}))=l$ .*
** Example 0**
*Consider a reasoning path between ”University” and ”Student” in KG: $path_{\mathcal{G}}(\text{University}$ , $\text{Student})$ $=\{(\text{University}$ , employs, $\text{Professor})$ , $(\text{Professor}$ , teaches, $\text{Course})$ , $(\text{Course}$ , enrolled_in, $\text{Student})\}$ , and can be visualized as:
$$
\text{University}\xrightarrow{\text{employs}}\text{Professor}\xrightarrow{
\text{teaches}}\text{Course}\xrightarrow{\text{enrolled\_in}}\text{Student}.
$$
It indicates that a “University” employs a “Professor,” who teaches a “Course,” in which a ”Student” is enrolled. The length of the path is 3.*
For any entity $s$ and $t$ in $\mathcal{G}$ , if there exists a reasoning path between $s$ and $t$ , we say $s$ and $t$ can reach each other, denoted as $s\leftrightarrow t$ . The distance between $s$ and $t$ in $\mathcal{G}$ , denoted as $dist_{\mathcal{G}}(s,t)$ , is the shortest reasoning path distance between $s$ and $t$ . For the non-reachable vertices, their distance is infinite. Given a positive integer $h$ , the $h$ -hop neighbors of an entity $s$ in $\mathcal{G}$ is defined as $N_{\mathcal{G}}(s,h)=\{t\in\mathcal{E}|dist_{\mathcal{G}}(s,t)\leq h\}$ .
** Definition 0 (Entity Path)**
*Given a KG $\mathcal{G}$ and a list of entities $list_{e}$ = [ $e_{1},e_{2},e_{3},\ldots,e_{l}$ ], the entity path of $list_{e}$ is defined as a connected sequence of reasoning paths, which is denoted as $path_{\mathcal{G}}(list_{e})$ $=\{path_{\mathcal{G}}(e_{1},e_{2}),$ $path_{\mathcal{G}}(e_{2},e_{3}),\ldots,path_{\mathcal{G}}(e_{l-1},e_{l})\}=\{( e_{s},r,e_{t})$ $|(e_{s},r,e_{t})\in path_{\mathcal{G}}(e_{i},e_{i+1})\land 1\leq i<l\}$ .*
Knowledge Graph Question Answering (KGQA) is a fundamental reasoning task based on KGs. Given a natural language question $q$ and a KG $\mathcal{G}$ , the objective is to devise a function $f$ that predicts answers $a\in Answer(q)$ utilizing knowledge encapsulated in $\mathcal{G}$ , i.e., $a=f(q,\mathcal{G})$ . Consistent with previous research (Sun et al., 2019; Luo et al., 2024; Sun et al., 2024; Ma et al., 2024), we assume the topic entities $Topic(q)$ mentioned in $q$ and answer entities $Answer(q)$ in ground truth are linked to the corresponding entities in $\mathcal{G}$ , i.e., $Topic(q)\subseteq\mathcal{E}\text{ and }Answer(q)\subseteq\mathcal{E}$ .
## 4. Method
PoG implements the “KG-based LLM Reasoning” by first exploring all possible faithful reasoning paths and then collaborating with LLM to perform a 3-step beam search selection on the retrieved paths. Compared to previous approaches (Sun et al., 2024; Ma et al., 2024), our model focuses on providing more accurate and question-relevant retrieval-argument graph information. The framework of PoG is outlined in Figure 2, comprising four main components.
- Initialization. The process begins by identifying the set of topic entities from the question input, and then queries the source KG $\mathcal{G}$ by exploring up to $D_{\max}$ -hop from each topic entity to construct the evidence sub-graph $\mathcal{G}_{q}$ , where $D_{\max}$ is the user-defined maximum exploration depth. Subsequently, we prompt the LLM to analyze the question and generate an indicator that serves as a strategy for the answer formulation process and predicting the exploration depth $D_{\text{predict}}$ .
- Exploration. After initialization, the model retrieves topic entity paths from $\mathcal{G}_{q}$ through three exploration phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. All reasoning paths are constrained within the depth range $D\in[D_{\text{predict}},D_{\max}]$ .
- Path Pruning. Following each exploration phase, PoG employs a pre-trained LM, LLM prompting, and graph structural analysis to perform a three-step beam search. The pruned paths are then evaluated in the question answering.
- Question Answering. Finally, LLM is prompted to assess if the pruned reasoning paths sufficiently answer the question. If not, continue exploration with deeper paths incrementally until the $D_{\max}$ is exceeded or proceed to the next exploration phase.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Reasoning Pipeline
### Overview
This diagram illustrates a pipeline for answering complex questions using a knowledge graph (G). The process involves question analysis, graph reduction, node and relation clustering, and graph detection to arrive at an answer. The diagram shows the flow of information from an input question to a final output.
### Components/Axes
The diagram is structured horizontally, depicting a sequence of stages. Key components include:
* **Input:** A question posed in natural language.
* **Knowledge Graph (G):** A visual representation of entities and their relationships.
* **Question Analysis:** A stage where the question is broken down into sub-questions.
* **Graph Reduction:** A process of simplifying the knowledge graph.
* **Node and Relation Clustering:** Grouping similar nodes and relations.
* **Graph Detection:** Identifying relevant subgraphs within the clustered graph.
* **Output:** The answer to the original question.
The diagram also includes:
* **LLM Indicator:** A representation of the Large Language Model's role in the process.
* **Topic Entity:** Entities identified as central to the question.
* **Labels:** "Question", "Country", "France", "Airport", "Nijmegen", "LLM Indictor", "Output1", "Output2", "Split_question1", "Split_question2", "Graph Reduction", "Node and Relation Clustering", "Graph Detection".
* **Variables:** G, G', Dmax, LLM, αsplit.
* **Relationships:** "serves", "own", "borders", "answer".
### Detailed Analysis or Content Details
The diagram depicts the following flow:
1. **Input:** The input question is: "What country bordering France contains an airport that serves Nijmegen?". The input also includes variables G (Knowledge Graph), G' (Reduced Knowledge Graph), and Dmax (maximum distance).
2. **Question Analysis:** The question is analyzed and split into two sub-questions:
* Split_question1: "What country contains an airport that serves Nijmegen?"
* Split_question2: "What country borders France?"
3. **Knowledge Graph (G):** The knowledge graph shows entities "France" and "Nijmegen" connected by relationships. Dotted circles around "France" and "Nijmegen" indicate a search radius of Dmax. The relationship "borders" connects France to other countries.
4. **LLM Indicator:** The LLM indicator shows the relationships "Nijmegen serves airport", "airport own answer (country)", and "France borders".
5. **Graph Reduction:** The initial graph (G) is reduced to a simplified graph (G'). This is represented by a network of nodes and edges.
6. **Node and Relation Clustering:** The reduced graph (G') undergoes node and relation clustering, resulting in a more organized network.
7. **Graph Detection:** Relevant subgraphs are detected within the clustered graph.
8. **Output1:** The LLM provides an intermediate output.
9. **Output2:** The final output is a boolean value (represented by a green checkmark), indicating a positive answer.
The diagram uses arrows to indicate the flow of information between stages. The networks in the "Graph Reduction", "Node and Relation Clustering", and "Graph Detection" stages are visually similar, consisting of interconnected nodes.
### Key Observations
* The pipeline breaks down a complex question into simpler sub-questions.
* The knowledge graph is iteratively refined through reduction and clustering.
* The LLM plays a role in identifying relevant relationships and providing intermediate outputs.
* The final output is a binary indication of whether the answer is found.
* The diagram emphasizes the visual representation of the knowledge graph and its transformation throughout the process.
### Interpretation
This diagram illustrates a method for knowledge graph reasoning, leveraging the power of LLMs to decompose complex queries and navigate a knowledge graph to find answers. The pipeline demonstrates a structured approach to information retrieval, starting with a natural language question and culminating in a definitive answer. The iterative refinement of the graph (reduction, clustering) suggests an attempt to manage complexity and focus on relevant information. The LLM indicator highlights the integration of language models into the reasoning process, potentially for relationship extraction or answer generation. The use of Dmax suggests a spatial or relational constraint in the search process. The overall design suggests a system aimed at providing accurate and efficient answers to complex questions by combining the strengths of knowledge graphs and LLMs. The diagram is a high-level overview and does not provide details on the specific algorithms or techniques used in each stage.
</details>
Figure 3. Overview of the initialization phase. Output 1: from the input question, the model identifies topic entities and prompts the LLM to decompose questions into split questions $q_{split}$ and generate an indicator $I_{LLM}$ . The indicator outlines a strategy for formulating the answer and predicts the exploration depth $D_{predict}$ . Output 2: the model queries the source KG up to $D_{max}$ -hop from identified topic entities, constructing and pruning the evidence subgraph $\mathcal{G}_{q}$ .
### 4.1. Initialization
The initialization has two main stages, i.e., question subgraph detection and question analysis. The framework is shown in Figure 3.
Question subgraph detection. Given a question $q$ , PoG initially identifies the question subgraph, which includes all the topic entities of $q$ and their $D_{\max}$ -hop neighbors.
Topic entity recognition. To identify the relevant subgraph, PoG first employs LLMs to extract the potential topic entities from the question. Following the identification, the process applies BERT-based similarity matching to align these potential entities with entities from KG. Specifically, as shown in Figure 3, we encode both the keywords and all entities from KG into dense vector embeddings as $H_{T}$ and $H_{\mathcal{G}}$ . We then compute a cosine similarity matrix between these embeddings to determine the matches. For each keyword, the entities with the highest similarity scores are selected to form the set $Topic(q)$ . This set serves as the foundation for constructing the question subgraph in subsequent steps.
Subgraph detection. Upon identifying the topic entities, PoG captures the induced subgraph $\mathcal{G}_{q}\subseteq\mathcal{G}$ by expanding around each entity $e$ in $Topic(q)$ . For each entity, we retrieve knowledge triples associated with its $D_{\max}$ -hop neighbors, thereby incorporating query-relevant and faithful KG information into $\mathcal{G}_{q}$ . Through this process, we update $\mathcal{E}_{q}$ with newly added intermediate nodes that serve as bridging pathways between the topic entities. The result subgraph, $\mathcal{G}_{q}$ is defined as $(\mathcal{E}_{q},\mathcal{R}_{q},\mathcal{T}_{q})$ , where $\mathcal{E}_{q}$ encompasses $Topic(q)$ together with the set $\{N_{\mathcal{G}}(e,D_{\max})\mid e\in Topic(q)\}$ , effectively linking all relevant entities and their connective paths within the defined hop distance. To interact with KG, we utilize the pre-defined SPARQL queries as detailed in Appendix D.
Graph pruning. To efficiently manage information overhead and reduce computational cost, we implement graph pruning on the question subgraph $\mathcal{G}_{q}$ using node and relation clustering alongside graph reduction techniques. As illustrated in Figure 3, node and relation clustering is achieved by compressing multiple nodes and their relations into supernodes, which aggregate information from the original entities and connections. For graph reduction, we employ bidirectional BFS to identify all paths connecting the topic entities. Based on these paths, we regenerate induced subgraphs that involve only the relevant connections, effectively excluding nodes and relations that lack strong relevance to the topic entities.
Question analysis. To reduce hallucinations in LLMs, the question analysis phase is divided into two parts and executed within a single LLM call using an example-based prompt (shown in Appendix E). First, the complex question $q$ is decomposed into simpler questions based on the identified topic entities, each addressing their relationship to the potential answer. Addressing these simpler questions collectively guides the LLM to better answer the original query, thereby reducing hallucinations. Second, a LLM indicator is generated, encapsulating all topic entities and predicting the answer position within a single chain of thought derived from the original question. This indicator highlights the relationships and sequence among the entities and answer. Based on this, a predicted depth $D_{\text{predict}}$ is calculated, defined as the maximum distance between the predicted answer and each topic entity. An example of question analysis is shown in Figure 3 with predicted depth 2.
### 4.2. Exploration
As discussed in Section 1, identifying reasoning paths that encompass all topic entities is essential to derive accurate answers. These paths serve as interpretable chains of thought, providing both the answer and the inference steps leading to it, a feature we refer as interpretability. To optimize the discovery of such paths efficiently and accurately, the exploration process is divided into three phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. After each phase, we perform path pruning and question answering. If a sufficient path is found, the process terminates; otherwise, it advances to the next phase to explore additional paths. Due to the space limitation, the pseudo-code of exploration section is shown in Appendix A.1.
Topic entity path exploration. To reduce LLM usage and search space, PoG begins exploration from a predicted depth $D_{\text{predict}}$ rather than the maximum depth. Using the question subgraph $\mathcal{G}_{q}$ , topic entities $Topic(q)$ , LLM indicator $I_{\text{LLM}}$ , and $D_{\text{predict}}$ , PoG identifies reasoning paths containing all topic entities by iteratively adjusting the exploration depth $D$ . Entities in $Topic(q)$ are ordered according to $I_{\text{LLM}}$ to facilitate reasoning effectively. Starting from the predicted depth $D=min(D_{\text{predict}},D_{\text{max}})$ , we employ a bidirectional BFS to derive all potential entity paths, which is defined as:
$$
Paths_{t}=\{p\mid|Topic(q)|\times(D-1)<length(p)\leq|Topic(q)|\times D\},
$$
where $p=Path_{\mathcal{G}_{q}}(Topic(q))$ . To reduce the complexity, a pruning strategy is employed and selects the top- $W_{\max}$ paths based on $Paths_{t}$ , $I_{\text{LLM}}$ , and split questions from Section 4.1. These paths are evaluated for sufficiency verification. If inadequate, $D$ is incremented until $D_{\max}$ is reached. Then the next phase commences.
LLM supplement path exploration. Traditional KG-based LLM reasoning often rephrases KG facts without utilizing the LLM’s inherent knowledge. To overcome this, PoG prompts LLMs to generate predictions based on path understanding and its implicit knowledge, providing additional relevant insights. It involves generating new LLM thinking indicators $I_{\text{Sup}}$ for predicted entities $e\in Predict(q)$ , and then using text similarity to verify and align them with $\mathcal{E}_{q}\in\mathcal{G}_{q}$ . The supplementary entity list $List_{S}(e)=Topic(q)+e$ is built and ranked by $I_{\text{Sup}}$ to facilitate reasoning effectively. Next, supplementary paths $Paths_{s}$ are derived from $List_{S}(e)$ in the evidence KG $\mathcal{G}_{q}$ with a fixed depth $D_{\max}$ :
$$
Paths_{s}=\{p\mid\text{length}(p)\leq|Topic(q)|\times D_{\max}\},
$$
where $p=Path_{\mathcal{G}_{q}}(List_{S}(e))$ . These paths with new indicators are evaluated similarly to the topic entity path exploration phase. The prompting temple is shown in Appendix E.
Node expand exploration. If previous phases cannot yield sufficient paths, PoG proceeds to node expansion. Unlike previous methods (Sun et al., 2024; Ma et al., 2024) that separately explore relations and entities, PoG explores both simultaneously, leveraging clearer semantic information for easier integration with existing paths. During the exploration, PoG expands unvisited entities by 1-hop neighbors in $\mathcal{G}$ . New triples are merged into existing paths to form the new paths, followed by pruning and evaluation.
### 4.3. Path Pruning
As introduced in Section 2, KGs contain vast amounts of facts, making it impractical to involve all relevant triples in the LLM’s context due to high costs. To address this complexity and reduce LLM overhead, we utilize a three-step beam search for path pruning. The corresponding pseudo-code can be found in Appendix A.2.
Fuzzy selection. Considering that only a small subset of the generated paths is relevant, the initial step of our beam search involves fuzzy selection by integrating a pre-trained language model (e.g. SentenceBERT (Reimers and Gurevych, 2019)), to filter the irrelevant paths quickly. As shown in Figure 2, we encode the LLM indicator $I_{\text{LLM}}$ (or $I_{\text{Sup}}$ ) and all reasoning paths into vector embeddings, denoted as $H_{I}$ and $H_{Paths}$ , and calculate cosine similarities between them. The top- $W_{1}$ paths with the highest similarity scores are selected for further evaluation.
Precise path selection. Following the initial fuzzy selection, the number of candidate paths is reduced to $W_{1}$ . At this stage, we prompt the LLM to select the top- $W_{\max}$ reasoning paths most likely to contain the correct answer. The specific prompt used to guide LLM in selection phase can be found in Appendix E.
Branch reduced selection. Considering that paths are often represented in natural language and can be extensive, leading to high processing costs for LLMs, we implement a branch reduced selection method integrated with the graph structure. This method effectively balances efficiency and accuracy by further refining path selection. Starting with $D=1$ , for each entity $e$ in the entity list, we extract the initial $D$ -step paths from every path in the candidate set $Paths_{c}$ into a new set $Paths_{e}$ . If the number of $Paths_{e}$ exceeds the maximum designated width $W_{\max}$ , these paths are pruned using precise path selection. The process iterates until the number of paths in $Paths_{c}$ reaches $D_{\max}$ . For example, as illustrated in Figure 2, with $W_{\max}=1$ , only the initial step paths (depicted in green) are extracted for further examination, while paths represented by dashed lines are pruned. This selection method enables efficient iterative selection by limiting the number of tokens and ensuring the relevance and conciseness of the reasoning paths.
Beam search strategy. Based on the three path pruning methods above, PoG can support various beam search strategies, ranging from non-reliant to fully reliant on LLMs. These strategies are selectable in a user-friendly manner, allowing flexibility based on the specific requirements of the task. We have defined four such strategies in Algorithm 2 of Appendix A.2.
### 4.4. Question Answering
Based on the pruned paths in Section 4.3, we introduce a two-step question-answering method.
Path Summarizing. To address hallucinations caused by paths with excessive or incorrect text, we develop a summarization strategy by prompting LLM to review and extract relevant triples from provided paths, creating a concise and focused path. Details of the prompts used are in Appendix E.
Question answering. Based on the current reasoning path derived from path pruning and summarizing, we prompt the LLM to first evaluate whether the paths are sufficient for answering the split question and then the main question. If the evaluation is positive, LLM is prompted to generate the answer using these paths, along with the question and question analysis results as inputs, as shown in Figures 2. The prompts for evaluation and generation are detailed in Appendix E. If the evaluation is negative, the exploration process is repeated until completion. If node expand exploration reaches its depth limit without yielding a satisfactory answer, LLM will leverage both provided and inherent knowledge to formulate a response. Additional details on the prompts can be found in Appendix E.
## 5. Experiments
Table 1. Results of PoG across various datasets, compared with the state-of-the-art (SOTA) in Supervised Learning (SL) and In-Context Learning (ICL) methods. The highest scores for ICL methods are highlighted in bold, while the second-best results are underlined. The Prior FT (Fine-tuned) SOTA includes the best-known results achieved through supervised learning.
| Method | Class | LLM | Multi-Hop KGQA | Single-Hop KGQA | Open-Domain QA | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| CWQ | WebQSP | GrailQA | Simple Questions | WebQuestions | | | |
| Without external knowledge | | | | | | | |
| IO prompt (Sun et al., 2024) | - | GPT-3.5-Turbo | 37.6 | 63.3 | 29.4 | 20.0 | 48.7 |
| CoT (Sun et al., 2024) | - | GPT-3.5-Turbo | 38.8 | 62.2 | 28.1 | 20.3 | 48.5 |
| SC (Sun et al., 2024) | - | GPT-3.5-Turbo | 45.4 | 61.1 | 29.6 | 18.9 | 50.3 |
| With external knowledge | | | | | | | |
| Prior FT SOTA | SL | - | 70.4 (Das et al., 2021) | 85.7 (Luo et al., 2024) | 75.4 (Gu et al., 2023) | 85.8 (Baek et al., 2023b) | 56.3 (Kedia et al., 2022) |
| KB-BINDER (Li et al., 2023a) | ICL | Codex | - | 74.4 | 58.5 | - | - |
| ToG/ToG-R (Sun et al., 2024) | ICL | GPT-3.5-Turbo | 58.9 | 76.2 | 68.7 | 53.6 | 54.5 |
| ToG-2.0 (Ma et al., 2024) | ICL | GPT-3.5-Turbo | - | 81.1 | - | - | - |
| ToG/ToG-R (Sun et al., 2024) | ICL | GPT-4 | 69.5 | 82.6 | 81.4 | 66.7 | 57.9 |
| PoG-E | ICL | GPT-3.5-Turbo | 71.9 | 90.9 | 87.6 | 78.3 | 76.9 |
| PoG | ICL | GPT-3.5-Turbo | 74.7 | 93.9 | 91.6 | 80.8 | 81.8 |
| PoG-E | ICL | GPT-4 | 78.5 | 95.4 | 91.4 | 81.2 | 82.0 |
| PoG | ICL | GPT-4 | 81.4 | 96.7 | 94.4 | 84.0 | 84.6 |
Experimental settings. We evaluate PoG on five KGQA datasets, i.e., CWQ (Talmor and Berant, 2018), WebQSP (Yih et al., 2016), GrailQA (Gu et al., 2021), SimpleQuestions (Petrochuk and Zettlemoyer, 2018), and WebQuestions (Berant et al., 2013). PoG is tested against methods without external knowledge (IO, CoT (Wei et al., 2022), SC (Wang et al., 2022)) and the state-of-the-art (SOTA) approaches with external knowledge, including prompting-based and fine-tuning-based methods. Freebase (Bollacker et al., 2008) serves as the background knowledge graph for all datasets. Experiments are conducted using two LLMs, i.e., GPT-3.5 (GPT-3.5-Turbo) and GPT-4. Following prior studies, we use exact match accuracy (Hits@1) as the evaluation metric. Due to the space limitation, detailed experimental settings, including dataset statistics, baselines, and implementation details, are provided in Appendix C.
PoG setting. We adopt the Fuzzy + Precise Path Selection strategy in Algorithm 2 of Appendix A.2 for PoG, with $W_{1}=80$ for fuzzy selection. Additionally, we introduce PoG-E, which randomly selects one relation from each edge in the clustered question subgraph to evaluate the impact of graph structure on KG-based LLM reasoning. $W_{\max}$ and $D_{\max}$ are 3 by default for beam search.
### 5.1. Main Results
Since PoG leverages external knowledge to enhance LLM reasoning, we first compare it with other methods that utilize external knowledge. Although PoG is a training-free, prompting-based method and has natural disadvantages compared to fine-tuned methods trained on evaluation data. As shown in Table 1, PoG with GPT-3.5-Turbo still achieves new SOTA performance across most datasets. Additionally, PoG with GPT-4 surpasses fine-tuned SOTA across all the multi-hop and open-domain datasets by an average of 17.3% and up to 28.3% on the WebQuestions dataset. Comparing all the in-context learning (ICL) methods, PoG with GPT-3.5-Turbo surpasses all the previous SOTA methods. When comparing PoG with GPT-3.5-Turbo against SOTA using GPT-4, PoG outperforms the SOTA by an average of 12.9% and up to 23.9%. When using the same LLM, PoG demonstrates substantial improvements: with GPT-3.5-Turbo, it outperforms SOTA by an average of 21.2% and up to 27.3% on the WebQuestions dataset; with GPT-4, it outperforms SOTA by 16.6% on average and up to 26.7% on the WebQuestions dataset. Additionally, PoG with GPT-3.5-Turbo outperforms methods without external knowledge (e.g., IO, CoT, SC prompting) by 62% on GrailQA and 60.5% on Simple Questions. These results show that incorporating external knowledge graphs significantly enhances reasoning tasks. PoG-E also achieves excellent results. Under GPT-4, PoG-E surpasses all SOTA in ICL by 14.1% on average and up to 24.1% on the WebQuestions dataset. These findings demonstrate that the graph structure is crucial for reasoning tasks, particularly for complex logical reasoning. By integrating the structural information of the question within the graph, PoG enhances the deep reasoning capabilities of LLMs, leading to superior performance.
### 5.2. Ablation Study
We perform various ablation studies to understand the importance of different factors in PoG. These ablation studies are performed with GPT-3.5-Turbo on two subsets of the CWQ and WebQSP test sets, each containing 500 randomly sampled questions. Does search depth matter? As described, PoG’s dynamic deep search is limited by $D_{max}$ . To assess the impact of $D_{\max}$ on performance, we conduct experiments with depth from 1 to 4. The results, shown in Figures 4 (a) and (c), indicate that performance improves with increased depth, but the benefits diminish beyond a depth of 3. Figures 4 (b) and (d), showing which exploration phase the answer is generated from, reveal that higher depths reduce the effectiveness of both LLM-based path supplementation and node exploration. Excessive depth leads to LLM hallucinations and difficulties in managing long reasoning paths. Therefore, we set the maximum depth to 3 for experiments to balance performance and computational efficiency. Additionally, even at lower depths, PoG maintains strong performance by effectively combining the LLM’s inherent knowledge with the structured information from the KG.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Maximum Depth
### Overview
This line chart depicts the relationship between accuracy (in percentage) and varying maximum depth (D<sub>max</sub>) for two different methods: PoG and PoG-E. The chart shows how the accuracy of each method changes as the maximum depth increases from 1 to 4.
### Components/Axes
* **X-axis:** "Varying maximum depth (D<sub>max</sub>)" with markers at 1, 2, 3, and 4.
* **Y-axis:** "Accuracy (%)" with a scale ranging from 50% to 85%.
* **Legend:** Located at the top-center of the chart.
* **PoG:** Represented by a blue line with downward-pointing triangle markers.
* **PoG-E:** Represented by a black line with diamond markers.
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
**PoG (Blue Line):**
The PoG line starts at approximately 63% accuracy at D<sub>max</sub> = 1. It then slopes upward, reaching approximately 73% at D<sub>max</sub> = 2, approximately 80% at D<sub>max</sub> = 3, and plateaus at approximately 80% for D<sub>max</sub> = 4.
* D<sub>max</sub> = 1: Accuracy ≈ 63%
* D<sub>max</sub> = 2: Accuracy ≈ 73%
* D<sub>max</sub> = 3: Accuracy ≈ 80%
* D<sub>max</sub> = 4: Accuracy ≈ 80%
**PoG-E (Black Line):**
The PoG-E line begins at approximately 55% accuracy at D<sub>max</sub> = 1. It increases sharply to approximately 70% at D<sub>max</sub> = 2, continues to rise to approximately 80% at D<sub>max</sub> = 3, and then decreases to approximately 70% at D<sub>max</sub> = 4.
* D<sub>max</sub> = 1: Accuracy ≈ 55%
* D<sub>max</sub> = 2: Accuracy ≈ 70%
* D<sub>max</sub> = 3: Accuracy ≈ 80%
* D<sub>max</sub> = 4: Accuracy ≈ 70%
### Key Observations
* PoG-E starts with lower accuracy than PoG at D<sub>max</sub> = 1.
* Both methods show increasing accuracy as D<sub>max</sub> increases from 1 to 3.
* PoG reaches a plateau in accuracy at D<sub>max</sub> = 3, while PoG-E's accuracy decreases at D<sub>max</sub> = 4.
* At D<sub>max</sub> = 3, both methods achieve the same accuracy of approximately 80%.
### Interpretation
The data suggests that increasing the maximum depth (D<sub>max</sub>) generally improves the accuracy of both PoG and PoG-E methods, up to a certain point. PoG appears to be less sensitive to further increases in depth beyond D<sub>max</sub> = 3, maintaining a consistent accuracy. PoG-E, however, experiences a decline in accuracy at D<sub>max</sub> = 4, indicating that exceeding a certain depth may introduce errors or inefficiencies for this method.
The plateauing of PoG suggests that the benefits of increasing depth diminish after a certain point, potentially due to limitations in the algorithm or the data itself. The decrease in PoG-E's accuracy at D<sub>max</sub> = 4 could be due to overfitting or increased computational complexity leading to errors.
The initial lower accuracy of PoG-E compared to PoG indicates that PoG-E may require a larger depth to achieve comparable performance. The convergence of the two methods at D<sub>max</sub> = 3 suggests that both methods can achieve similar levels of accuracy with appropriate parameter tuning.
</details>
(a) CWQ (Vary $D_{\max}$ )
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Accuracy vs. Maximum Depth
### Overview
This is a stacked bar chart illustrating the accuracy of different exploration methods as a function of varying maximum depth (D<sub>max</sub>). The chart compares "Topic Entity Path Exploration", "LLM Supplement Path Exploration", and "Node Expand Exploration" methods. A total accuracy line is also plotted.
### Components/Axes
* **X-axis:** "Varying maximum depth (D<sub>max</sub>)" with values 1, 2, 3, and 4.
* **Y-axis:** "Accuracy (%)" with a scale from 0 to 100.
* **Data Series:**
* Topic Entity Path Exploration (Grey)
* LLM Supplement Path Exploration (Pink)
* Node Expand Exploration (Red)
* Accuracy Total (Blue, dashed line with circular markers)
* **Legend:** Located in the bottom-left corner, identifying each data series by color and label.
### Detailed Analysis
The chart consists of four stacked bars, one for each value of D<sub>max</sub>. The total accuracy is represented by a dashed blue line with circular markers.
* **D<sub>max</sub> = 1:**
* Topic Entity Path Exploration: Approximately 45%
* LLM Supplement Path Exploration: Approximately 10%
* Node Expand Exploration: Approximately 5%
* Total Accuracy: Approximately 62%
* **D<sub>max</sub> = 2:**
* Topic Entity Path Exploration: Approximately 55%
* LLM Supplement Path Exploration: Approximately 12%
* Node Expand Exploration: Approximately 5%
* Total Accuracy: Approximately 72%
* **D<sub>max</sub> = 3:**
* Topic Entity Path Exploration: Approximately 62%
* LLM Supplement Path Exploration: Approximately 13%
* Node Expand Exploration: Approximately 5%
* Total Accuracy: Approximately 80%
* **D<sub>max</sub> = 4:**
* Topic Entity Path Exploration: Approximately 65%
* LLM Supplement Path Exploration: Approximately 13%
* Node Expand Exploration: Approximately 5%
* Total Accuracy: Approximately 83%
The "Accuracy Total" line shows an upward trend, increasing from approximately 62% at D<sub>max</sub> = 1 to approximately 83% at D<sub>max</sub> = 4. The increase is most significant between D<sub>max</sub> = 1 and D<sub>max</sub> = 3, after which the increase slows down.
### Key Observations
* "Topic Entity Path Exploration" consistently contributes the largest portion of the total accuracy.
* "LLM Supplement Path Exploration" contributes a moderate and relatively stable amount to the total accuracy.
* "Node Expand Exploration" contributes the smallest and most consistent portion of the total accuracy.
* The total accuracy increases with increasing D<sub>max</sub>, but the rate of increase diminishes as D<sub>max</sub> increases.
### Interpretation
The data suggests that increasing the maximum depth of exploration (D<sub>max</sub>) generally improves the overall accuracy of the system. The "Topic Entity Path Exploration" method is the primary driver of accuracy, indicating its effectiveness in the given task. The "LLM Supplement Path Exploration" provides a consistent, but smaller, boost to accuracy. The "Node Expand Exploration" method contributes minimally.
The diminishing returns observed with increasing D<sub>max</sub> suggest that there may be a point beyond which further increasing the depth of exploration does not significantly improve accuracy. This could be due to factors such as increased computational cost or the introduction of irrelevant information at deeper levels of exploration. The chart highlights the importance of balancing exploration depth with accuracy and efficiency. The consistent contribution of the LLM suggests it is a reliable component, while the limited impact of Node Expansion may indicate it is not well-suited to the task or requires further refinement.
</details>
(b) CWQ(PoG)
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Maximum Depth
### Overview
This line chart depicts the relationship between accuracy and varying maximum depth (Dmax) for two models: PoG and PoG-E. The chart shows how the accuracy of each model changes as the maximum depth increases from 1 to 4.
### Components/Axes
* **X-axis:** Varying maximum depth (Dmax), ranging from 1 to 4. The axis is labeled "Varying maximum depth (Dmax)".
* **Y-axis:** Accuracy (%), ranging from 80% to 94%. The axis is labeled "Accuracy (%)".
* **Data Series 1:** PoG, represented by a blue line with downward-pointing triangle markers.
* **Data Series 2:** PoG-E, represented by a black line with diamond markers.
* **Legend:** Located in the bottom-center of the chart, identifying the two data series (PoG and PoG-E) and their corresponding colors/markers.
### Detailed Analysis
**PoG (Blue Line):**
The PoG line slopes upward from Dmax = 1 to Dmax = 3, then slightly downward from Dmax = 3 to Dmax = 4.
* At Dmax = 1, Accuracy ≈ 81%.
* At Dmax = 2, Accuracy ≈ 85%.
* At Dmax = 3, Accuracy ≈ 94%.
* At Dmax = 4, Accuracy ≈ 93%.
**PoG-E (Black Line):**
The PoG-E line slopes upward from Dmax = 1 to Dmax = 3, then downward from Dmax = 3 to Dmax = 4.
* At Dmax = 1, Accuracy ≈ 82%.
* At Dmax = 2, Accuracy ≈ 86%.
* At Dmax = 3, Accuracy ≈ 92%.
* At Dmax = 4, Accuracy ≈ 88%.
### Key Observations
* Both models exhibit increasing accuracy as the maximum depth increases up to a point (Dmax = 3).
* PoG consistently outperforms PoG-E across all maximum depth values.
* Both models show a slight decrease in accuracy when the maximum depth is increased from 3 to 4.
* The largest performance gain for PoG occurs between Dmax = 2 and Dmax = 3.
* The largest performance gain for PoG-E occurs between Dmax = 1 and Dmax = 2.
### Interpretation
The data suggests that increasing the maximum depth of both PoG and PoG-E models initially improves their accuracy. However, beyond a certain depth (Dmax = 3 in this case), further increasing the depth may lead to overfitting or diminishing returns, resulting in a slight decrease in accuracy. PoG consistently demonstrates higher accuracy than PoG-E, indicating that it is a more robust model for this task. The slight decline in accuracy at Dmax = 4 for both models could indicate that the models are starting to overfit the training data at that depth. This suggests that a maximum depth of 3 might be optimal for these models, balancing accuracy and generalization ability. The difference in performance between the two models could be due to differences in their architectures or training procedures. Further investigation would be needed to determine the specific reasons for these differences.
</details>
(c) WebQSP (Vary $D_{\max}$ )
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Stacked Area Chart: Accuracy vs. Depth
### Overview
This is a stacked area chart illustrating the relationship between varying maximum depth (D<sub>max</sub>) and accuracy, broken down into different exploration methods. The chart displays the total accuracy as a line, and the contributing components as stacked areas.
### Components/Axes
* **X-axis:** Varying maximum depth (D<sub>max</sub>) with markers at 1, 2, 3, and 4.
* **Y-axis:** Accuracy (%) with a scale from 0 to 100.
* **Data Series:**
* Accuracy Total (represented by a dashed blue line with circular markers)
* Topic Entity Path Exploration (light blue area)
* Node Expand Exploration (dark gray area)
* LLM Supplement Path Exploration (orange area)
* **Legend:** Located at the bottom-right of the chart.
* Accuracy Total (blue dashed line with circles)
* Topic Entity Path Exploration (light blue)
* Node Expand Exploration (dark gray)
* LLM Supplement Path Exploration (orange)
### Detailed Analysis
The chart shows the stacked areas representing the contribution of each exploration method to the total accuracy. The total accuracy is represented by the blue dashed line.
* **D<sub>max</sub> = 1:**
* Total Accuracy: Approximately 82%.
* Topic Entity Path Exploration: Approximately 62%.
* Node Expand Exploration: Approximately 19%.
* LLM Supplement Path Exploration: Approximately 1%.
* **D<sub>max</sub> = 2:**
* Total Accuracy: Approximately 85%.
* Topic Entity Path Exploration: Approximately 66%.
* Node Expand Exploration: Approximately 18%.
* LLM Supplement Path Exploration: Approximately 1%.
* **D<sub>max</sub> = 3:**
* Total Accuracy: Approximately 90%.
* Topic Entity Path Exploration: Approximately 72%.
* Node Expand Exploration: Approximately 17%.
* LLM Supplement Path Exploration: Approximately 1%.
* **D<sub>max</sub> = 4:**
* Total Accuracy: Approximately 91%.
* Topic Entity Path Exploration: Approximately 74%.
* Node Expand Exploration: Approximately 16%.
* LLM Supplement Path Exploration: Approximately 1%.
The "Accuracy Total" line slopes upward from D<sub>max</sub> = 1 to D<sub>max</sub> = 3, then plateaus between D<sub>max</sub> = 3 and D<sub>max</sub> = 4. The "Topic Entity Path Exploration" area consistently forms the largest portion of the total accuracy, increasing slightly with increasing D<sub>max</sub>. The "Node Expand Exploration" area decreases slightly with increasing D<sub>max</sub>. The "LLM Supplement Path Exploration" area remains consistently small across all D<sub>max</sub> values.
### Key Observations
* The total accuracy increases significantly as the maximum depth (D<sub>max</sub>) increases from 1 to 3, but the improvement diminishes beyond D<sub>max</sub> = 3.
* "Topic Entity Path Exploration" is the dominant contributor to the overall accuracy.
* The contribution of "LLM Supplement Path Exploration" is minimal across all depth values.
* "Node Expand Exploration" shows a slight decrease in contribution as depth increases.
### Interpretation
The data suggests that increasing the maximum depth of exploration improves accuracy up to a certain point (D<sub>max</sub> = 3 in this case). Beyond that point, the gains in accuracy become marginal. The "Topic Entity Path Exploration" method is the most effective component of the system, indicating that leveraging topic entities and paths is crucial for achieving high accuracy. The limited impact of "LLM Supplement Path Exploration" suggests that the current implementation of LLM supplementation does not significantly enhance performance. The slight decrease in "Node Expand Exploration" contribution with increasing depth could indicate that expanding nodes becomes less relevant as the exploration focuses more on paths and entities. This chart demonstrates the importance of balancing exploration depth with the effectiveness of different exploration strategies. The plateauing of the accuracy curve suggests diminishing returns from further increasing the depth, indicating a potential optimization point for the system.
</details>
(d) WebQSP(PoG)
Figure 4. The accuracy of PoG and PoG-E among CWQ and WebQSP datasets by varying different $D_{\max}$ .
### 5.3. Effectiveness Evaluation
Effective evaluation on multi-entity questions. To evaluate PoG’s performance on multi-entity questions, we report the accuracy on all test sets by categorizing questions based on the number of topic entities. The results, shown in Table 2, demonstrate that, despite the increased complexity of multi-entity questions compared to single-entity ones, PoG maintains excellent accuracy, achieving up to 93.9% on the WebQSP dataset. This underscores the effectiveness of our structure-based model in handling complex multi-entity queries. Notably, the slightly lower performance on the GrailQA dataset can be attributed to some questions lacking matched topic entities, which prevents effective reasoning using KG.
Table 2. Performance of PoG and PoG-E on multi-entity and single-entity questions of all datasets. The symbol ‘-’ indicates no multi-entity question inside.
| Question Set | CWQ | WebQSP | GrailQA | WebQuestions | Simple Questions |
| --- | --- | --- | --- | --- | --- |
| PoG with GPT-3.5-Turbo | | | | | |
| Single-entity | 70.3 | 93.9 | 92.1 | 81.7 | 78.3 |
| Multi-entity | 80.2 | 93.1 | 70.7 | 82.8 | - |
| PoG-E with GPT-3.5-Turbo | | | | | |
| Single-entity | 67.5 | 91 | 88.2 | 76.8 | 80.8 |
| Multi-entity | 77.5 | 82.8 | 76.0 | 82.8 | - |
Effective evaluation on multi-hop reasoning. To assess PoG’s performance on multi-hop reasoning tasks, we analyze accuracy by categorizing questions based on the length of their ground-truth SPARQL queries. We randomly sample 1,000 questions from CWQ and WebQSP datasets and determine the reasoning length of each question by counting the number of relations in their ground-truth SPARQL queries. The distribution of questions with varying reasoning lengths is illustrated in Figure 5. We evaluate the performance of PoG and PoG-E across different ground-truth lengths to understand their effectiveness under varying query complexities. As shown in Figure 6, the performance of PoG and PoG-E remains consistent across different reasoning lengths. Even at the highest length levels in the WebQSP dataset, PoG achieves excellent accuracy, reaching up to 90%. Notably, although some questions have ground-truth lengths of eight or more, PoG successfully addresses them without matching the ground-truth length, demonstrating its ability to explore novel paths by effectively combining the LLM’s inherent knowledge with the structured information from the KG. These results demonstrate the effectiveness of PoG in handling complex multi-hop reasoning tasks.
Table 3. The illustration of graph size reduction.
| | CWQ | WebQSP | GrailQA | WebQuestions |
| --- | --- | --- | --- | --- |
| Ave Entity Number | 3,540,267 | 243,826 | 62,524 | 240,863 |
| Ave Entity Number After Pruned | 1,621,055 | 182,673 | 30,267 | 177,822 |
| Ave Entitiy Reduction Proportion (%) | 54% | 25% | 52% | 26% |
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Bar Chart: Distribution of SPARQL Path Lengths for Question Datasets
### Overview
This bar chart displays the distribution of SPARQL path lengths for two question datasets: CWQ and WebQSP. The x-axis represents the length of paths in SPARQL, ranging from 1 to 8. The y-axis represents the number of questions, displayed on a logarithmic scale. The chart uses paired bars for each path length, one for CWQ (white) and one for WebQSP (black).
### Components/Axes
* **X-axis Title:** "Length of paths in SPARQL"
* **X-axis Markers:** 1, 2, 3, 4, 5, 6, 7, 8
* **Y-axis Title:** "Number of questions"
* **Y-axis Scale:** Logarithmic, ranging from 1.0 to 10^2 (100).
* **Legend:** Located at the top-right of the chart.
* **CWQ:** Represented by a white bar.
* **WebQSP:** Represented by a black bar.
### Detailed Analysis
The chart presents paired bars for each SPARQL path length. I will analyze each path length individually, noting the approximate values based on the logarithmic scale.
* **Path Length 1:** CWQ has approximately 2 questions, WebQSP has approximately 1 question.
* **Path Length 2:** CWQ has approximately 4 questions, WebQSP has approximately 8 questions.
* **Path Length 3:** CWQ has approximately 18 questions, WebQSP has approximately 32 questions.
* **Path Length 4:** CWQ has approximately 14 questions, WebQSP has approximately 25 questions.
* **Path Length 5:** CWQ has approximately 7 questions, WebQSP has approximately 16 questions.
* **Path Length 6:** CWQ has approximately 4 questions, WebQSP has approximately 3 questions.
* **Path Length 7:** CWQ has approximately 3 questions, WebQSP has approximately 0 questions.
* **Path Length 8:** CWQ has approximately 10 questions, WebQSP has approximately 10 questions.
**Trend Analysis:**
* **CWQ:** The number of questions generally increases from path length 1 to 3, then decreases from path length 3 to 6, and then increases again at path length 8.
* **WebQSP:** The number of questions increases from path length 1 to 3, then decreases from path length 3 to 6, and then increases again at path length 8.
### Key Observations
* WebQSP generally has a higher number of questions than CWQ for path lengths 1-6.
* Both datasets show a peak in the number of questions around path length 3.
* Both datasets show a similar trend of increasing questions up to path length 3, then decreasing, and then increasing again at path length 8.
* The logarithmic scale is crucial for interpreting the differences in the number of questions, as the differences appear larger for smaller path lengths.
### Interpretation
The chart illustrates the distribution of SPARQL path lengths required to answer questions from the CWQ and WebQSP datasets. The peak at path length 3 suggests that many questions in both datasets can be answered with relatively short SPARQL queries. The differences between the datasets indicate that WebQSP questions, on average, require slightly more complex SPARQL queries (longer paths) than CWQ questions, especially for path lengths 1-6. The increase in questions at path length 8 for both datasets could indicate a subset of questions that require significantly more complex reasoning or involve multiple relationships. The use of a logarithmic scale highlights the relative differences in question counts across different path lengths, revealing that the number of questions drops off significantly as the path length increases. This suggests that the datasets are biased towards questions that can be answered with shorter, simpler SPARQL queries.
</details>
Figure 5. The lengths of the ground-truth SPARQL queries within the CWQ and WebQSP datasets.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Accuracy vs. Path Length in SPARQL
### Overview
This bar chart compares the accuracy of two methods, PoG and PoG-E, across different lengths of paths in SPARQL queries. The x-axis represents the length of the paths, and the y-axis represents the accuracy in percentage. Each path length has two bars, one for PoG and one for PoG-E.
### Components/Axes
* **X-axis Title:** "Length of paths in SPARQL"
* Markers: 1, 2, 3, 4, 5, 6, 7, 8+
* **Y-axis Title:** "Accuracy (%)"
* Scale: 0 to 100, with increments of 20.
* **Legend:** Located at the top-center of the chart.
* PoG: Light blue
* PoG-E: Dark gray
### Detailed Analysis
The chart consists of eight groups of bars, each representing a different path length.
* **Path Length 1:**
* PoG: Approximately 95% accuracy.
* PoG-E: Approximately 75% accuracy.
* **Path Length 2:**
* PoG: Approximately 98% accuracy.
* PoG-E: Approximately 78% accuracy.
* **Path Length 3:**
* PoG: Approximately 82% accuracy.
* PoG-E: Approximately 80% accuracy.
* **Path Length 4:**
* PoG: Approximately 68% accuracy.
* PoG-E: Approximately 65% accuracy.
* **Path Length 5:**
* PoG: Approximately 58% accuracy.
* PoG-E: Approximately 52% accuracy.
* **Path Length 6:**
* PoG: Approximately 54% accuracy.
* PoG-E: Approximately 50% accuracy.
* **Path Length 7:**
* PoG: Approximately 52% accuracy.
* PoG-E: Approximately 46% accuracy.
* **Path Length 8+:**
* PoG: Approximately 57% accuracy.
* PoG-E: Approximately 59% accuracy.
**Trends:**
* **PoG:** The accuracy of PoG generally decreases as the path length increases from 1 to 7, then slightly increases for 8+. The initial drop is significant, from approximately 98% at length 2 to around 52% at length 7.
* **PoG-E:** The accuracy of PoG-E also generally decreases as the path length increases, but the decrease is less pronounced than for PoG. It starts at approximately 75% at length 1 and reaches around 46% at length 7, then increases to 59% at 8+.
### Key Observations
* PoG consistently outperforms PoG-E for path lengths 1, 2, 3, and 4.
* For path lengths 5, 6, and 7, the performance difference between PoG and PoG-E is smaller.
* At path length 8+, PoG-E slightly outperforms PoG.
* Both methods exhibit a decline in accuracy as path length increases, suggesting a challenge in maintaining accuracy with longer SPARQL paths.
### Interpretation
The data suggests that both PoG and PoG-E methods are effective for short SPARQL paths (length 1 and 2), achieving high accuracy. However, as the path length increases, the accuracy of both methods decreases. This could be due to the increased complexity of reasoning over longer paths, leading to more potential errors.
The initial superior performance of PoG suggests it may be better suited for simpler queries. However, PoG-E's relative stability and slight outperformance at path length 8+ indicate it might be more robust for complex queries involving longer paths. The crossover point where PoG-E becomes competitive is around path length 5.
The slight increase in accuracy for both methods at path length 8+ could be due to a variety of factors, such as the nature of the queries used for evaluation or the limitations of the evaluation dataset. Further investigation would be needed to understand this trend. The data highlights a trade-off between simplicity and robustness in SPARQL query processing.
</details>
(a) CWQ
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy vs. SPARQL Path Length
### Overview
This bar chart compares the accuracy of two methods, PoG and PoG-E, across different lengths of paths in SPARQL queries. The x-axis represents the length of the paths, ranging from 1 to 8+, while the y-axis represents the accuracy in percentage (%). Each path length has two bars, one for PoG and one for PoG-E.
### Components/Axes
* **X-axis Title:** "Length of paths in SPARQL"
* **X-axis Markers:** 1, 2, 3, 4, 5, 6, 7, 8+
* **Y-axis Title:** "Accuracy (%)"
* **Y-axis Scale:** 0 to 100, with increments of 20.
* **Legend:**
* PoG (Light Blue)
* PoG-E (Dark Gray)
### Detailed Analysis
The chart consists of paired bars for each path length.
* **Path Length 1:**
* PoG: Approximately 50% accuracy.
* PoG-E: Approximately 52% accuracy.
* **Path Length 2:**
* PoG: Approximately 90% accuracy.
* PoG-E: Approximately 58% accuracy.
* **Path Length 3:**
* PoG: Approximately 97% accuracy.
* PoG-E: Approximately 93% accuracy.
* **Path Length 4:**
* PoG: Approximately 83% accuracy.
* PoG-E: Approximately 80% accuracy.
* **Path Length 5:**
* PoG: Approximately 74% accuracy.
* PoG-E: Approximately 68% accuracy.
* **Path Length 6:**
* PoG: Approximately 68% accuracy.
* PoG-E: Approximately 65% accuracy.
* **Path Length 7:**
* PoG: Approximately 70% accuracy.
* PoG-E: Approximately 68% accuracy.
* **Path Length 8+:**
* PoG: Approximately 93% accuracy.
* PoG-E: Approximately 90% accuracy.
**Trends:**
* **PoG:** The accuracy of PoG generally increases with path length up to path length 3, then decreases for path lengths 4 and 5, remains relatively stable for path lengths 6 and 7, and then increases again for path length 8+.
* **PoG-E:** The accuracy of PoG-E generally increases with path length, but remains lower than PoG for most path lengths.
### Key Observations
* PoG consistently outperforms PoG-E for path lengths 1, 2, 3, 4, 5, 6, 7, and 8+.
* The largest difference in accuracy between PoG and PoG-E occurs at path length 2, where PoG has a significantly higher accuracy.
* Both methods show a dip in accuracy around path length 4 and 5.
* Both methods achieve high accuracy for path length 8+.
### Interpretation
The data suggests that PoG is generally more accurate than PoG-E across various SPARQL path lengths. The initial increase in accuracy for both methods as path length increases could be due to the methods becoming more confident with longer, more defined query patterns. The dip in accuracy around path lengths 4 and 5 might indicate a point where the complexity of the queries starts to introduce more ambiguity or challenges for both methods. The resurgence in accuracy at path length 8+ suggests that the methods can handle even more complex queries effectively. The consistent outperformance of PoG suggests it is a more robust method for handling SPARQL queries, particularly as the path length increases. The chart highlights the importance of considering query complexity (path length) when evaluating the performance of different query methods.
</details>
(b) WebQSP
Figure 6. The accuracy of PoG and PoG-E on the CWQ and WebQSP datasets, categorized by the different lengths of the ground-truth answers for each question.
Graph structure pruning. To evaluate the effectiveness of the graph pruning method proposed in Section 4.1, we conduct experiments using 200 random samples from each dataset. We report the average number of entities per question before and after graph reduction, as well as the proportion of entities reduced, in Table 3. The results indicate that up to 54% of entities in the CWQ dataset can be pruned before path exploration. This demonstrates the effectiveness of eliminating irrelevant data from the outset.
Case study: interpretable reasoning. We also conduct the case study to demonstrate interpretability of PoG, we present three reasoning examples in Table 9 of Appendix B.5. These examples feature questions with one, two, and three entities, respectively. Through the case study, we showcase PoG’s effectiveness in handling multi-entity and multi-hop tasks by providing faithful and interpretable reasoning paths that lead to accurate answers.
To further evaluate the effectiveness and efficiency of PoG, we perform additional experiments, including pruning beam search strategy ablation and prompt setting ablation (Appendix B.1), reasoning faithfulness analysis (Appendix B.2), error analysis (Appendix B.3), LLM calls cost and running time analysis (Appendix B.4), and graph reduction and path pruning case study (Appendix B.5).
## 6. Conclusion
In this paper, we introduce Paths-over-Graphs (PoG), a novel method that integrates LLMs with KGs to enable faithful and interpretable reasoning. PoG addresses complex reasoning tasks through a three-phase dynamic multi-hop path exploration, combining the inherent knowledge of LLMs with factual information from KGs. Efficiency is enhanced by graph-structured pruning and a three-step pruning process to effectively narrow down candidate paths. Extensive experiments on five public datasets demonstrate that PoG outperforms existing baselines, showcasing its superior reasoning capabilities and interoperability.
## 7. Acknowledgment
Xiaoyang Wang is supported by the Australian Research Council DP230101445 and DP240101322. Wenjie Zhang is supported by the Australian Research Council DP230101445 and FT210100303.
## References
- (1)
- Baek et al. (2023b) Jinheon Baek, Alham Fikri Aji, Jens Lehmann, and Sung Ju Hwang. 2023b. Direct Fact Retrieval from Knowledge Graphs without Entity Linking. In ACL.
- Baek et al. (2023a) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023a. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023).
- Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533–1544.
- Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In AAAI, Vol. 38. 17682–17690.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247–1250.
- Brown (2020) Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
- Chen et al. (2024) Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2024. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter 25, 2 (2024), 42–61.
- Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Das et al. (2021) Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In EMNLP.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. In ICLR.
- Gu et al. (2023) Yu Gu, Xiang Deng, and Yu Su. 2023. Don’t Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments. In ACL.
- Gu et al. (2021) Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. In WWW. 3477–3488.
- Guo et al. (2023) Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. arXiv preprint arXiv:2305.15066 (2023).
- Guo et al. (2019) Lingbing Guo, Zequn Sun, and Wei Hu. 2019. Learning to exploit long-term relational dependencies in knowledge graphs. In International conference on machine learning. PMLR, 2505–2514.
- Guo et al. (2024) Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063–7076.
- Huang et al. (2025) Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. 2025. Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models. In COLING.
- Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645 (2023).
- Kedia et al. (2022) Akhil Kedia, Mohd Abbas Zaidi, and Haejun Lee. 2022. FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering. In EMNLP.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).
- Kim et al. (2023) Jiho Kim, Yeonsu Kwon, Yohan Jo, and Edward Choi. 2023. Kg-gpt: A general framework for reasoning on knowledge graphs using large language models. arXiv preprint arXiv:2310.11220 (2023).
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. NeurIPS (2022).
- Li et al. (2024a) Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024a. Hypergraph Self-supervised Learning with Sampling-efficient Signals. arXiv preprint arXiv:2404.11825 (2024).
- Li et al. (2024b) Fan Li, Zhiyu Xu, Dawei Cheng, and Xiaoyang Wang. 2024b. AdaRisk: Risk-adaptive Deep Reinforcement Learning for Vulnerable Nodes Detection. IEEE TKDE (2024).
- Li et al. (2023a) Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023a. Few-shot In-context Learning on Knowledge Base Question Answering. In ACL.
- Li et al. (2023b) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2023b. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. arXiv preprint arXiv:2305.13269 (2023).
- Luo et al. (2023) Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538 (2023).
- Luo et al. (2024) Linhao Luo, Yuanfang Li, Reza Haf, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In ICLR.
- Ma et al. (2024) Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, and Jian Guo. 2024. Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval. arXiv preprint arXiv:2407.10805 (2024).
- Pan et al. (2024) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. TKDE (2024).
- Peters et al. (2019) Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
- Petrochuk and Zettlemoyer (2018) Michael Petrochuk and Luke Zettlemoyer. 2018. SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach. In EMNLP. 554–558.
- Petroni et al. (2020) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).
- Rawte et al. (2023) Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
- Sima et al. (2024) Qing Sima, Jianke Yu, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024. Deep Overlapping Community Search via Subspace Embedding. arXiv preprint arXiv:2404.14692 (2024).
- Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019).
- Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In ICLR.
- Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. In NAACL.
- Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937 (2018).
- Tan et al. (2023) Xingyu Tan, Jingya Qian, Chen Chen, Sima Qing, Yanping Wu, Xiaoyang Wang, and Wenjie Zhang. 2023. Higher-Order Peak Decomposition. In CIKM.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Wang et al. (2024a) Jinghao Wang, Yanping Wu, Xiaoyang Wang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. 2024a. Efficient Influence Minimization via Node Blocking. Proc. VLDB Endow. (2024).
- Wang et al. (2024b) Kai Wang, Yuwei Xu, Zhiyong Wu, and Siqiang Luo. 2024b. LLM as Prompter: Low-resource Inductive Reasoning on Arbitrary Knowledge Graphs. In ACL.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022).
- Wen et al. (2024) Yilin Wen, Zifeng Wang, and Jimeng Sun. 2024. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. In ACL.
- Wu et al. (2024) Yanping Wu, Renjie Sun, Xiaoyang Wang, Dong Wen, Ying Zhang, Lu Qin, and Xuemin Lin. 2024. Efficient Maximal Frequent Group Enumeration in Temporal Bipartite Graphs. Proc. VLDB Endow. (2024).
- Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS (2024).
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- Ye et al. (2021) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. 2021. Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering. arXiv preprint arXiv:2109.08678 (2021).
- Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL. 201–206.
- Zhang et al. (2021) Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long document modeling with pooling attention. In ICML. 12437–12446.
- Zhang et al. (2023) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. In ICLR.
- Zhao et al. (2023) Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. 2023. Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework. In ACL.
## Appendix A Algorithm
### A.1. Exploration
We summarize the comprehensive algorithmic procedure for exploration detailed in Section 4.2 as presented in Algorithm 1.
AlgoLine0.1
Input : Question subgraph ( $\mathcal{G}_{q}$ ), source KG ( $\mathcal{G}$ ),question and split question ( $Q=q$ + $q_{split}$ ), topic entities ( $Topic(q)$ ), LLM indicator ( $I_{\text{LLM}}$ ), predict depth ( $D_{\text{predict}}$ ), maximum depth ( $D_{\max}$ ), maximum width ( $W_{\max}$ ), and path pruning case ( $case$ )
Output : PoG answers ( $a(q)$ ), final reasoning path ( $Paths_{f}(q)$ )
AlgoLine0.2
/* Start with topic entity path exploration */ AlgoLine0.3
$List_{T}\leftarrow$ Reorder ( $Topic(q),I_{\text{LLM}}$ ), $D\leftarrow$ min( $D_{\text{predict}},D_{\max}$ ); AlgoLine0.4
AlgoLine0.5
while $D\leq D_{\max}$ do $Paths_{t}\leftarrow\text{EntityPathFind}$ ( $List_{T},D$ , $\mathcal{G}_{q}$ ); AlgoLine0.6
$\text{PathPruning}(Paths_{t},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; AlgoLine0.7
$Answer,Paths_{T}\leftarrow\text{QuestionAnswering}(Paths_{t},Q,I_{\text{LLM}})$ ; AlgoLine0.8
AlgoLine0.9
if $\texttt{"\{Yes\}"}\text{ in }Answer$ then return $Answer,Paths_{T}$ ; AlgoLine0.10
else $D\leftarrow D+1$ ; AlgoLine0.11
AlgoLine0.12
/* LLM supplement path exploration procedure */ AlgoLine0.13
$Paths_{s}\leftarrow[]$ ; AlgoLine0.14
AlgoLine0.15
$Predict(q)\leftarrow$ SupplementPrediction( $Paths_{T},Q,I_{\text{LLM}}$ ); AlgoLine0.16
for each $e,I_{sup(e)}\in Predict(q)$ do AlgoLine0.17
$List_{S}\leftarrow$ Reorder ( $List_{T}+e,I_{sup(e)}$ ); AlgoLine0.18
$Paths_{s}^{\prime}\leftarrow\text{EntityPathFind}$ ( $List_{S},D_{\max}$ , $\mathcal{G}_{q}$ ); AlgoLine0.19
$Paths_{s}\leftarrow Paths_{s}+\text{FuzzySelect}$ ( $Paths_{s}^{\prime},I_{sup(e)},W_{\max}$ ); AlgoLine0.20
AlgoLine0.21
$\text{PathPruning}(Paths_{s},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{S},case)$ ; AlgoLine0.22
$Answer,Paths_{S}\leftarrow\text{QuestionAnswering}(Paths_{s},Q,I_{\text{LLM}})$ ; AlgoLine0.23
AlgoLine0.24
if "{Yes}" in $Answer$ then return $Answer,Paths_{S}$ ; AlgoLine0.25
AlgoLine0.26
AlgoLine0.27
/* Node expand exploration procedure */ AlgoLine0.28
$Visted\leftarrow\emptyset$ , $D\leftarrow 1$ , $Paths_{e}\leftarrow Paths_{T}+Paths_{S}$ ; AlgoLine0.29
$\text{PathPruning}(Paths_{e},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ;; AlgoLine0.30
AlgoLine0.31
while $D\leq D_{\max}$ do for each $e\in\text{ExtractEntity}(Paths_{e})\land e\notin Visted$ do $Related\_edges=\text{Find\_1\_hop\_Edges}(\mathcal{G},e)$ ; AlgoLine0.32
$Paths_{e}\leftarrow\text{MergeTogether}(Paths_{e},Related\_edge)$ ; AlgoLine0.33
AlgoLine0.34
$\text{PathPruning}(Paths_{e},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; AlgoLine0.35
$Answer,Paths_{e}\leftarrow\text{QuestionAnswering}(Paths_{e},Q,I_{\text{LLM}})$ ; AlgoLine0.36
if $\texttt{"\{Yes\}"}\text{ in }Answer$ then return $Answer,Paths_{e}$ ; AlgoLine0.37
else $Visted\leftarrow Visted\cup e$ ; $D\leftarrow D+1$ ; AlgoLine0.38
AlgoLine0.39
$Paths_{l}\leftarrow Paths_{T}+Paths_{S}+Paths_{E}$ ; AlgoLine0.40
$\text{PathPruning}(Paths_{l},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; AlgoLine0.41
AlgoLine0.42
$Answer,Paths_{L}\leftarrow\text{QuestionAnsweringFinal}(Paths_{l},Q,I_{\text{ LLM}})$ ; AlgoLine0.43
Return $Answer,Paths_{L}$ ; AlgoLine0.44
AlgoLine0.45
Algorithm 1 Exploration
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
### A.2. Path Pruning
We summarize the comprehensive algorithmic procedure of path pruning detailed in Section 4.3 as presented in Algorithm 2.
AlgoLine0.1
Input : Candidate paths( $Paths_{c}$ ), question and split question ( $Q=q+q_{split}$ ), indicator ( $I$ ), maximum width ( $W_{\max}$ ), maximum depth ( $D_{\max}$ ), entity list ( $list$ )
Output : Pruned candidate paths ( $Paths_{c}$ )
if Case = Fuzzy Selection Only then $\text{FuzzySelect}(Paths_{c},Q,I,W_{\max})$ ; AlgoLine0.2
else if Case = Fuzzy + Precise Path Selection then $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; AlgoLine0.3
$\text{PrecisePathSelect}(Paths_{c},Q,I,W_{\max})$ ; AlgoLine0.4
else if Case = Fuzzy + Branch Reduced Selection then $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; AlgoLine0.5
$\text{BranchReduceSelect}(Paths_{c},Q,I,W_{\max},D_{\max},list)$ ; AlgoLine0.6
AlgoLine0.7
else if Case = Fuzzy + Branch Reduced + Precise Path then /* case = 3-Step Beam Search */ $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; AlgoLine0.8
$\text{BranchReduceSelect}(Paths_{c},Q,I,W_{2},D_{\max},list)$ ; AlgoLine0.9
$\text{PrecisePathSelect}(Paths_{c},Q,I,W_{\max})$ ; AlgoLine0.10
AlgoLine0.11
Procedure BranchReduceSelect $(Paths_{c},Q,I,W,D_{\max},list)$ AlgoLine0.12
$D\leftarrow 1$ , $Paths_{e}\leftarrow\emptyset$ ; AlgoLine0.13
AlgoLine0.14
while $|Paths_{c}|\geq W\land D\leq D_{\max}$ do AlgoLine0.15
for each $e\in list$ do AlgoLine0.16
$Paths_{e}\leftarrow Paths_{e}\cup\text{ExtractHeadSteps}(Paths_{c},e,D)$ ; AlgoLine0.17
if $|Paths_{e}|>W$ then $\text{PrecisePathSelect}(Paths_{e},Q,I,W)$ ; AlgoLine0.18
AlgoLine0.19
$Paths_{c}\leftarrow\text{IntersectMatchUpdate}(Paths_{e},Paths_{c})$ ; AlgoLine0.20
$Paths_{e}\leftarrow\emptyset$ ; AlgoLine0.21
$D\leftarrow D+1$ ; AlgoLine0.22
AlgoLine0.23
if $|Paths_{c}|>W$ then $\text{PrecisePathSelect}(Paths_{c},Q,I,W)$ ; AlgoLine0.24
AlgoLine0.25
Algorithm 2 PathPruning
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
## Appendix B Experiment
### B.1. Additional Ablation Study
Compare the effect of different beam search strategies. As introduced in Section 4.3, PoG supports various beam search strategies, ranging from non-reliant to fully reliant on LLMs, selectable in a user-friendly manner. To evaluate the computational cost and performance, we test four cases outlined in Algorithm 2. In the 3-Step Beam Search case, we set $W_{2}=20$ for internal narrowing. The Fuzzy Selection approach, as described in Section 4.3, utilizes all candidate paths and a LLM-generated indicator for encoding and comparison. We report accuracy, average LLM calls in total, and average token input during the path pruning for each beam search strategy applied to PoG and PoG-E in Table 4 and Table 5. These results indicate that PoG/PoG-E with Fuzzy and Precise Path Selection achieves the highest accuracy. Additionally, the BranchReduced Selection method, which leverages the graph structure, not only delivers excellent results but also reduces token usage by over 50% (65%) with only a ±2% (±4.3%) difference in accuracy on PoG (PoG-E) compared to the best-performing strategy. Furthermore, the Fuzzy Selection method, which employs lightweight models instead of relying solely on LLMs, also demonstrates strong performance. These results validate the effectiveness of our beam search strategies and underscore the importance of structure-based faithful path reasoning.
Table 4. Performance comparison of PoG with different beam search methods on CWQ and WebQSP.
| PoG | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| w/ Fuzzy Selection | Accuracy | 57.1 | 86.4 |
| Token Input | - | - | |
| LLM Calls | 6.8 | 6.5 | |
| w/ Fuzzy and | Accuracy | 79.3 | 93.0 |
| BranchReduced Selection | Token Input | 101,455 | 328,742 |
| LLM Calls | 9.7 | 9.3 | |
| w/ Fuzzy and | Accuracy | 81.4 | 93.9 |
| Precise Path Selection | Token Input | 216,884 | 617,448 |
| LLM Calls | 9.1 | 7.5 | |
| w/ 3-Step Beam Search | Accuracy | 79.8 | 91.9 |
| Token Input | 102,036 | 369,175 | |
| LLM Calls | 8.8 | 9.0 | |
How do summary prompts affect? Inspired by GoT (Besta et al., 2024), we utilize summary prompts to reduce LLM hallucinations and decrease computational costs. To evaluate their impact, we conduct an ablation study comparing PoG and PoG-E with and without path summarization. We measure both accuracy and average token input to the LLM API during the path pruning phase to measure efficiency and effectiveness. The results, present in Tabel 6, show that using graph summaries increases accuracy by up to 10% on the CWQ dataset with PoG-E, while reducing token input by up to 36% on WebQSP. These results indicate hat path summarization effectively minimizes LLM hallucinations, enhances the LLM’s understanding of the explored paths, facilitates answer retrieval, enables earlier termination of the reasoning process, and reduces costs.
Table 5. Performance comparison of PoG-E with different beam search methods among CWQ and WebQSP datasets.
| PoG-E | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| w/ FuzzySelect | Accuracy | 62.31 | 82.3 |
| Token Input | - | - | |
| Ave LLM Calls | 6 | 6.3 | |
| w/ Fuzzy and | Accuracy | 71.9 | 88.4 |
| BranchReduced Selection | Token Input | 128,407 | 371,083 |
| Ave LLM Calls | 9.4 | 9.1 | |
| w/ Fuzzy and | Accuracy | 80.4 | 91.4 |
| Precise Path Selection | Token Input | 344,747 | 603,261 |
| Ave LLM Calls | 8.3 | 7.4 | |
| w/ 3-Step Beam Search | Accuracy | 73.87 | 89.4 |
| Token Input | 120,159 | 411,283 | |
| Ave LLM Calls | 8.3 | 9.1 | |
Table 6. Performance comparison of PoG and PoG-E with and without path summarizing on CWQ and WebQSP datasets.
| Method | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| PoG | | | |
| w/ Path Summarizing | Accuracy | 81.4 | 93.9 |
| Token Input | 216,884 | 297,359 | |
| w/o Path Summarizing | Accuracy | 74.7 | 91.9 |
| Token Input | 273,447 | 458,545 | |
| PoG-E | | | |
| w/ Path Summarizing | Accuracy | 80.4 | 91.4 |
| Token Input | 314,747 | 273,407 | |
| w/o Path Summarizing | Accuracy | 70.4 | 90.4 |
| Token Input | 419,679 | 428,545 | |
### B.2. Reasoning Faithfulness Analysis
Overlap ratio between explored paths and ground-truth paths. We analyzed correctly answered samples from three datasets to investigate the overlap ratio between the paths $P$ explored by PoG and the ground-truth paths $S$ in SPARQL queries. The overlap ratio is defined as the proportion of overlapping relations to the total number of relations in the ground-truth SPARQL path:
$$
Ratio(P)=\frac{|Relation(P)\cap Relation(S)|}{|Relation(S)|},
$$
where $Relation(P)$ denotes the set of relations in path $P$ . Figure 7 illustrates the distribution of questions across different overlap ratios. For the WebQSP dataset, PoG achieves the highest proportion of fully overlapping paths with the ground truth, reaching approximately 60% accuracy. In contrast, PoG-E applied to the GrailQA dataset shows the highest proportion of paths with up to 70% non-overlapping relations, indicating that PoG-E explores novel paths to derive the answers. The different results between PoG and PoG-E are due to PoG-E’s strategy of randomly selecting one related edge from each clustered edge. This approach highlights the effectiveness of our structure-based path exploration method in generating diverse and accurate reasoning paths.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Bar Chart: Overlap Ratio Distribution
### Overview
The image presents a bar chart illustrating the distribution of overlap ratios, expressed as percentages. The x-axis represents the overlap ratio categories, while the y-axis indicates the percentage of occurrences within each category.
### Components/Axes
* **X-axis Title:** "Overlap Ratio (%)"
* **Y-axis Title:** "Percentage (%)"
* **X-axis Categories:** 0, (0,25], (25,50], (50,75], (75,100], 100
* **Y-axis Scale:** 0 to 40, with increments of 10.
* **Bars:** Represent the percentage of occurrences for each overlap ratio category.
### Detailed Analysis
The chart displays six bars, each representing a different overlap ratio range.
* **Bar 1 (0%):** The bar is approximately 11% in height.
* **Bar 2 ((0,25]%):** The bar is approximately 6% in height.
* **Bar 3 ((25,50]%):** The bar is approximately 23% in height. This is the tallest bar.
* **Bar 4 ((50,75]%):** The bar is approximately 13% in height.
* **Bar 5 ((75,100]%):** The bar is approximately 12% in height.
* **Bar 6 (100%):** The bar is approximately 17% in height.
### Key Observations
* The highest percentage of overlap ratios falls within the (25,50]% range, at approximately 23%.
* The lowest percentage of overlap ratios is in the (0,25]% range, at approximately 6%.
* The distribution is not uniform; there's a clear peak in the middle ranges of overlap ratios.
* There is a significant number of instances with 0% overlap.
### Interpretation
The data suggests that the majority of instances exhibit moderate overlap ratios, specifically between 25% and 50%. The presence of a substantial number of instances with 0% overlap indicates a significant portion of cases where there is no overlap at all. The distribution shape suggests that overlap ratios are not randomly distributed but rather clustered around certain values. This could indicate underlying factors influencing the degree of overlap, such as the nature of the objects being compared or the method used to measure overlap. The relatively high percentage at 100% overlap suggests a significant number of instances where complete overlap occurs. Further investigation would be needed to understand the context of these overlap ratios and the reasons for the observed distribution.
</details>
(a) CWQ (PoG)
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Histogram: Overlap Ratio Distribution
### Overview
The image presents a histogram illustrating the distribution of overlap ratios, expressed as percentages. The x-axis represents the overlap ratio, categorized into bins, while the y-axis represents the percentage of occurrences within each bin.
### Components/Axes
* **X-axis Title:** "Overlap Ratio (%)"
* **Y-axis Title:** "Percentage (%)"
* **X-axis Categories (Bins):** 0, (0,25], (25,50], (50,75], (75,100], 100
* **Y-axis Scale:** 0 to 40 (approximately)
* **Bars:** Represent the percentage of overlap ratios falling within each bin. The bars are colored in shades of blue and gray.
### Detailed Analysis
The histogram displays the following approximate data points:
* **0% Overlap:** Approximately 12%
* **(0, 25]% Overlap:** Approximately 5%
* **(25, 50]% Overlap:** Approximately 32%
* **(50, 75]% Overlap:** Approximately 12%
* **(75, 100]% Overlap:** Approximately 12%
* **100% Overlap:** Approximately 11%
The tallest bar, representing the (25, 50]% overlap ratio, indicates that the highest concentration of overlap ratios falls within this range. The bars generally decrease in height as the overlap ratio moves away from this peak.
### Key Observations
* The distribution is not symmetrical. It is skewed towards the lower overlap ratios, with a peak between 25% and 50%.
* There is a significant number of instances with low overlap ratios (0% and (0,25]% bins).
* The percentage of instances with 100% overlap is relatively low, but not negligible.
### Interpretation
The data suggests that the majority of the analyzed instances exhibit a moderate degree of overlap, specifically between 25% and 50%. This could represent a scenario where partial overlap is common, while complete or no overlap is less frequent. The skewness of the distribution indicates that the process generating these overlap ratios favors lower overlap values.
The context of what "overlap" refers to is missing, but the data could be applied to various fields such as image analysis (object detection overlap), genomic sequencing (sequence alignment overlap), or market research (customer base overlap). Without further information, it's difficult to draw more specific conclusions. The data suggests that the overlap is not random, and there is a tendency towards partial overlap.
</details>
(b) CWQ (PoG-E)
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Bar Chart: Overlap Ratio vs. Percentage
### Overview
This is a bar chart illustrating the distribution of percentages based on overlap ratio. The x-axis represents the overlap ratio in percentage ranges, and the y-axis represents the corresponding percentage.
### Components/Axes
* **X-axis Title:** Overlap Ratio (%)
* **Y-axis Title:** Percentage (%)
* **X-axis Markers:** 0, (0,25], (25,50], (50,75], (75,100], 100
* **Bar Colors:** Grey, Light Blue
### Detailed Analysis
The chart displays six bars, each representing a different overlap ratio range.
* **Bar 1 (Overlap Ratio: 0):** The bar is grey and reaches approximately 14% on the y-axis.
* **Bar 2 (Overlap Ratio: (0,25]):** The bar is light blue and reaches approximately 4% on the y-axis.
* **Bar 3 (Overlap Ratio: (25,50]):** The bar is light blue and reaches approximately 6% on the y-axis.
* **Bar 4 (Overlap Ratio: (50,75]):** The bar is light blue and reaches approximately 8% on the y-axis.
* **Bar 5 (Overlap Ratio: (75,100]):** The bar is light blue and reaches approximately 10% on the y-axis.
* **Bar 6 (Overlap Ratio: 100):** The bar is light blue and reaches approximately 62% on the y-axis.
### Key Observations
The distribution is heavily skewed towards a 100% overlap ratio. The percentage decreases significantly as the overlap ratio decreases, with the highest percentage observed at 100% and the second highest at 0%. The bars representing overlap ratios between (0,25] and (75,100] have relatively low percentages.
### Interpretation
The data suggests that the majority of instances exhibit a complete overlap (100% overlap ratio). This could indicate a high degree of consistency or alignment in the data being analyzed. The significant drop in percentage as the overlap ratio decreases suggests that partial overlaps are less common. The initial bar at 0% overlap ratio indicates that there are some instances with no overlap at all, but these are far less frequent than instances with complete overlap. This chart could be representing the results of an image recognition task, where the overlap ratio represents the Intersection over Union (IoU) between predicted and ground truth bounding boxes. A high percentage at 100% IoU would indicate high accuracy in the image recognition task.
</details>
(c) WebQSP (PoG)
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Bar Chart: Overlap Ratio Distribution
### Overview
The image presents a bar chart illustrating the distribution of overlap ratios. The x-axis represents the overlap ratio in percentage ranges, and the y-axis represents the percentage of occurrences within each range. The chart shows a bimodal distribution, with a high percentage at 0% overlap and another peak at 100% overlap.
### Components/Axes
* **X-axis Title:** "Overlap Ratio (%)"
* **Y-axis Title:** "Percentage (%)"
* **X-axis Categories:** 0, (0,25], (25,50], (50,75], (75,100], 100
* **Bar Colors:** Dark Gray for 0, (0,25], (25,50], (50,75], (75,100). Light Blue for 100.
### Detailed Analysis
The chart consists of six bars, each representing a different overlap ratio range.
* **Bar 1 (0%):** The bar is dark gray and reaches approximately 23% on the y-axis.
* **Bar 2 ((0,25]%):** The bar is dark gray and reaches approximately 11% on the y-axis.
* **Bar 3 ((25,50]%):** The bar is dark gray and reaches approximately 7% on the y-axis.
* **Bar 4 ((50,75]%):** The bar is dark gray and reaches approximately 3% on the y-axis.
* **Bar 5 ((75,100]%):** The bar is dark gray and reaches approximately 2% on the y-axis.
* **Bar 6 (100%):** The bar is light blue and reaches approximately 44% on the y-axis.
### Key Observations
* The highest percentage of overlap ratios is at 100% (approximately 44%).
* The next highest percentage is at 0% (approximately 23%).
* The percentages decrease as the overlap ratio increases from 0% to 75%.
* The distribution is heavily skewed towards 0% and 100% overlap.
### Interpretation
The data suggests that the analyzed objects or events either have no overlap or complete overlap. The significant peaks at 0% and 100% indicate a binary or extreme behavior in the overlap ratios. This could represent scenarios where objects are either entirely separate or perfectly aligned. The low percentages in the intermediate ranges suggest that partial overlaps are rare. The bimodal distribution implies two distinct populations or processes are contributing to the observed overlap ratios. Further investigation would be needed to understand the underlying reasons for this distribution and the nature of the objects or events being analyzed.
</details>
(d) WebQSP (PoG-E)
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Bar Chart: Overlap Ratio vs. Percentage
### Overview
This is a bar chart illustrating the distribution of percentages across different overlap ratio ranges. The x-axis represents the overlap ratio in percentage ranges, and the y-axis represents the corresponding percentage.
### Components/Axes
* **X-axis Title:** "Overlap Ratio (%)"
* **Y-axis Title:** "Percentage (%)"
* **X-axis Markers:** 0, (0,25], (25,50], (50,75], (75,100], 100
* **Bar Colors:** Dark Gray, Light Blue
* **Y-axis Scale:** 0 to 80, with increments of 10.
### Detailed Analysis
The chart displays the percentage of occurrences for each overlap ratio range.
* **Overlap Ratio 0:** The bar is dark gray and reaches approximately 18% on the y-axis.
* **Overlap Ratio (0,25]:** The bar is dark gray and reaches approximately 8% on the y-axis.
* **Overlap Ratio (25,50]:** The bar is dark gray and reaches approximately 10% on the y-axis.
* **Overlap Ratio (50,75]:** No bar is present, indicating a 0% percentage.
* **Overlap Ratio (75,100]:** No bar is present, indicating a 0% percentage.
* **Overlap Ratio 100:** The bar is light blue and reaches approximately 68% on the y-axis.
### Key Observations
* The majority of the data points (approximately 68%) fall within the 100% overlap ratio range.
* The next highest percentage is observed for the 0% overlap ratio range (approximately 18%).
* Overlap ratios between (0,25] and (25,50] have relatively low percentages (approximately 8% and 10% respectively).
* There are no data points for overlap ratios between (50,75] and (75,100].
### Interpretation
The data suggests a strong tendency towards complete overlap (100% overlap ratio). This could indicate that the analyzed phenomena are highly correlated or that the measurement process favors complete alignment. The relatively small percentages for lower overlap ratios suggest that partial overlaps are less common. The absence of data points for the (50,75]% and (75,100]% ranges could indicate a gap in the data or a genuine lack of occurrences within those ranges. Further investigation would be needed to understand the underlying reasons for this distribution. The two distinct colors (dark gray and light blue) may represent different categories or groups within the data, but without additional context, the meaning of this distinction remains unclear.
</details>
(e) GrailQA (PoG)
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Bar Chart: Overlap Ratio Distribution
### Overview
The image presents a bar chart illustrating the distribution of overlap ratios. The x-axis represents the overlap ratio in percentage, categorized into bins, and the y-axis represents the percentage of occurrences within each bin.
### Components/Axes
* **X-axis Title:** "Overlap Ratio (%)"
* **Y-axis Title:** "Percentage (%)"
* **X-axis Categories:** 0, (0,25], (25,50], (50,75], (75,100], 100
* **Y-axis Scale:** 0 to 80, with increments of 10.
* **Bars:** Represent the percentage of occurrences for each overlap ratio bin.
### Detailed Analysis
The chart displays the following approximate values:
* **Overlap Ratio 0:** The bar reaches approximately 72% on the y-axis.
* **Overlap Ratio (0,25]:** The bar reaches approximately 8% on the y-axis.
* **Overlap Ratio (25,50]:** The bar reaches approximately 10% on the y-axis.
* **Overlap Ratio (50,75]:** The bar reaches approximately 2% on the y-axis.
* **Overlap Ratio (75,100]:** The bar reaches approximately 0% on the y-axis.
* **Overlap Ratio 100:** The bar reaches approximately 17% on the y-axis.
The bars are positioned along the x-axis, corresponding to their respective overlap ratio categories. The height of each bar indicates the percentage of occurrences within that category.
### Key Observations
* The distribution is heavily skewed towards an overlap ratio of 0, with approximately 72% of the data falling into this category.
* There is a secondary peak at an overlap ratio of 100, representing approximately 17% of the data.
* Overlap ratios between 25% and 75% are relatively rare, with percentages below 10%.
* The overlap ratio between 75% and 100% has a percentage of approximately 0%.
### Interpretation
The data suggests that the majority of instances have no overlap (0% overlap ratio), while a significant portion exhibit complete overlap (100% overlap ratio). The scarcity of intermediate overlap ratios indicates a bimodal distribution. This could imply that the phenomenon being measured is either absent or fully present, with very few cases falling in between.
Possible explanations for this distribution include:
* **Binary Outcome:** The measured variable represents a binary outcome (e.g., present/absent, success/failure).
* **Threshold Effect:** There might be a threshold beyond which overlap is considered complete, leading to a concentration of values at 0% and 100%.
* **Data Collection Bias:** The data collection process might be biased towards capturing only instances with no or complete overlap.
Further investigation is needed to understand the underlying reasons for this distribution and its implications. The context of the overlap ratio is crucial for a more meaningful interpretation.
</details>
(f) GrailQA (PoG-E)
Figure 7. The path overlap ratio of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.
Evidence of answer exploration sources. We conduct an analysis of correctly answered samples from three datasets to investigate the sources of evidence used by the LLM in generating answers, as illustrated in Figure 8. Specifically, we categorize all generated answers into three cases: KG only, LLM-inspired KG, and KG-inspired LLM. In the KG only scenario, answers are generated solely based on KG paths. The LLM-inspired KG case involves the LLM predicting an answer using its inherent knowledge and subsequently using the KG to verify its correctness. Conversely, in the KG-inspired LLM case, the paths generated by the KG are insufficient to reach the answer, and the LLM supplements the reasoning using its inherent knowledge. As shown in the figure, up to 14% of answers are generated through the KG-inspired LLM approach, and up to 9% involve LLM-inspired KG path supplementation. Compared to previous work that integrates LLM inherent knowledge with KG data (Sun et al., 2024), PoG more effectively leverages the strengths of both sources. These results demonstrate that PoG is a faithful reasoning method that primarily relies on KG-based reasoning while being supplemented by the LLM, ensuring both accuracy and interpretability in answer generation.
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Pie Charts: Performance Breakdown by Dataset
### Overview
The image presents three pie charts, each representing the performance breakdown across different datasets: CWQ, WebQSP, and GrailQA. The charts illustrate the percentage contribution of three approaches: "KG Only", "LLM Inspired KG", and "KG Inspired LLM". All percentages are represented as whole numbers.
### Components/Axes
Each pie chart is labeled with the dataset name followed by "(%)", indicating percentages. A legend is positioned in the bottom-right corner, defining the colors associated with each approach:
* **KG Only:** Light Blue
* **LLM Inspired KG:** Light Green
* **KG Inspired LLM:** Pale Yellow
### Detailed Analysis or Content Details
**1. CWQ (%)**
* **KG Only:** 76% - Dominant segment, occupying the majority of the pie chart.
* **LLM Inspired KG:** 9% - A smaller segment, approximately one-eighth of the pie.
* **KG Inspired LLM:** 12% - A segment slightly larger than the "LLM Inspired KG" segment.
**2. WebQSP (%)**
* **KG Only:** 86% - The largest segment, representing the vast majority of the pie chart.
* **LLM Inspired KG:** 4% - The smallest segment, a very small slice of the pie.
* **KG Inspired LLM:** 9% - A segment larger than "LLM Inspired KG" but smaller than "KG Only".
**3. GrailQA (%)**
* **KG Only:** 95% - The overwhelmingly dominant segment, almost the entire pie chart.
* **LLM Inspired KG:** 1% - The smallest segment, a negligible portion of the pie.
* **KG Inspired LLM:** 3% - A small segment, slightly larger than "LLM Inspired KG".
### Key Observations
* The "KG Only" approach consistently outperforms the other two approaches across all three datasets.
* The contribution of "LLM Inspired KG" and "KG Inspired LLM" is relatively small compared to "KG Only".
* The "KG Only" approach is particularly dominant in the GrailQA dataset (95%).
* The "LLM Inspired KG" approach consistently has the lowest percentage contribution.
### Interpretation
The data suggests that, for these datasets (CWQ, WebQSP, and GrailQA), relying solely on Knowledge Graphs ("KG Only") yields the best performance. The incorporation of Large Language Models (LLMs) to enhance or inspire the Knowledge Graph construction or utilization ("LLM Inspired KG" and "KG Inspired LLM") provides only marginal improvements, and in some cases, even diminishes performance. This could indicate that the existing Knowledge Graphs are sufficient for these tasks, and the LLM integration doesn't add significant value, or that the LLM integration is not being done effectively. The extremely high performance of "KG Only" on GrailQA suggests this dataset is particularly well-suited to a Knowledge Graph-based approach. The consistent low performance of "LLM Inspired KG" suggests that the LLM's contribution to the KG construction is not improving the KG's ability to answer questions.
</details>
(a) PoG
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Pie Charts: Performance Breakdown by Dataset
### Overview
The image presents three pie charts, each representing the performance breakdown across different datasets: CWQ, WebQSP, and GrailQA. The performance is categorized into three approaches: "KG Only", "LLM Inspired KG", and "KG Inspired LLM". All values are expressed as percentages.
### Components/Axes
Each chart has a title indicating the dataset name followed by "(%)". A legend is positioned in the bottom-right corner, defining the color-coding for each approach:
* **KG Only:** Light Blue
* **LLM Inspired KG:** Light Green
* **KG Inspired LLM:** Light Grey
Each pie chart displays percentage values directly on the slices.
### Detailed Analysis or Content Details
**CWQ (%)**
* **KG Only:** 77% - Largest slice, occupying the majority of the pie chart.
* **LLM Inspired KG:** 14% - Second largest slice.
* **KG Inspired LLM:** 7% - Smallest slice.
**WebQSP (%)**
* **KG Only:** 87% - Dominant slice, representing the vast majority of the pie chart.
* **LLM Inspired KG:** 10% - Second largest slice.
* **KG Inspired LLM:** 2% - Smallest slice.
**GrailQA (%)**
* **KG Only:** 92% - Largest slice, almost the entire pie chart.
* **LLM Inspired KG:** 6% - Second largest slice.
* **KG Inspired LLM:** 1% - Very small slice.
### Key Observations
Across all three datasets, the "KG Only" approach consistently outperforms the other two methods by a significant margin. The "LLM Inspired KG" approach performs better than the "KG Inspired LLM" approach in all cases, but the difference is more pronounced in the WebQSP and GrailQA datasets. The "KG Inspired LLM" approach consistently has the lowest performance across all datasets.
### Interpretation
The data suggests that, for these datasets (CWQ, WebQSP, and GrailQA), relying solely on Knowledge Graphs ("KG Only") yields the best results. While incorporating Large Language Models (LLMs) to enhance the Knowledge Graph ("LLM Inspired KG") provides some improvement over the "KG Inspired LLM" approach, it does not reach the performance level of the "KG Only" method. This could indicate that the LLMs are not effectively contributing to the knowledge graph's performance in these specific scenarios, or that the current implementation of LLM integration is suboptimal. The consistent dominance of the "KG Only" approach highlights the strength of knowledge graph-based methods for these question answering tasks. The relatively small contributions of LLM-inspired methods suggest that further research is needed to effectively leverage LLMs in conjunction with knowledge graphs for improved performance.
</details>
(b) PoG-E
Figure 8. The proportions of answer evidence of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.
### B.3. Error Analysis
To further analyze the integration of LLMs and KGs, we conduct an error analysis on the CWQ, WebQSP, and GrailQA datasets. We categoriz errors into four types: (1) answer generation error, (2) refuse error, (3) format error, and (4) other hallucination errors. Note that answer generation error occurs when PoG provides an accurate reasoning path, but the LLM fails to extract the correct answer from it. The distribution of these error types is illustrated in Figure 9. The results indicate that using more powerful LLMs reduces the number of ”other hallucination errors,” ”refuse errors,” and ”answer generation errors,” as the model offers enhanced reasoning capabilities based on the retrieved data. Specifically, the reduction in ”answer generation errors” shows the reasoning paths provided by PoG are effectively utilized by more advanced LLMs. However, we observe an increase in ”format errors” with more powerful LLMs, which may be attributed to their greater creative flexibility.
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Bar Chart: Error Analysis of Language Models on Question Answering Datasets
### Overview
The image presents a grouped bar chart comparing the error types of two language models, GPT-3.5 and GPT-4, across three question answering datasets: CWQ, WebQSP, and GrailQA. The chart visualizes the number of error samples for each model on each dataset, categorized by error type: Others Hallucination Error, Answer Generation Error, Refuse Answer, and Format Error.
### Components/Axes
* **X-axis:** Represents the combination of dataset and model. The categories are: "PoG (GPT-3.5)", "PoG (GPT-4)", "PoG-E (GPT-3.5)", "PoG-E (GPT-4)" for each dataset (CWQ, WebQSP, GrailQA). "PoG" and "PoG-E" likely represent different prompting strategies.
* **Y-axis:** Labeled "Error Samples", with a scale ranging from 0 to 250, incrementing by 50.
* **Legend:** Located in the top-right corner, defines the color coding for each error type:
* Light Blue: Others Hallucination Error
* Orange: Answer Generation Error
* Red: Refuse Answer
* Dark Blue: Format Error
* **Chart Title:** Each dataset (CWQ, WebQSP, GrailQA) has its own title displayed above the corresponding bar groups.
### Detailed Analysis or Content Details
**CWQ Dataset:**
* **PoG (GPT-3.5):** Total error samples ≈ 230. Breakdown: Others Hallucination Error ≈ 80, Answer Generation Error ≈ 100, Refuse Answer ≈ 30, Format Error ≈ 20.
* **PoG (GPT-4):** Total error samples ≈ 180. Breakdown: Others Hallucination Error ≈ 50, Answer Generation Error ≈ 90, Refuse Answer ≈ 20, Format Error ≈ 20.
* **PoG-E (GPT-3.5):** Total error samples ≈ 250. Breakdown: Others Hallucination Error ≈ 100, Answer Generation Error ≈ 100, Refuse Answer ≈ 30, Format Error ≈ 20.
* **PoG-E (GPT-4):** Total error samples ≈ 170. Breakdown: Others Hallucination Error ≈ 50, Answer Generation Error ≈ 80, Refuse Answer ≈ 20, Format Error ≈ 20.
**WebQSP Dataset:**
* **PoG (GPT-3.5):** Total error samples ≈ 100. Breakdown: Others Hallucination Error ≈ 30, Answer Generation Error ≈ 50, Refuse Answer ≈ 10, Format Error ≈ 10.
* **PoG (GPT-4):** Total error samples ≈ 80. Breakdown: Others Hallucination Error ≈ 20, Answer Generation Error ≈ 40, Refuse Answer ≈ 10, Format Error ≈ 10.
* **PoG-E (GPT-3.5):** Total error samples ≈ 120. Breakdown: Others Hallucination Error ≈ 40, Answer Generation Error ≈ 60, Refuse Answer ≈ 10, Format Error ≈ 10.
* **PoG-E (GPT-4):** Total error samples ≈ 70. Breakdown: Others Hallucination Error ≈ 20, Answer Generation Error ≈ 30, Refuse Answer ≈ 10, Format Error ≈ 10.
**GrailQA Dataset:**
* **PoG (GPT-3.5):** Total error samples ≈ 50. Breakdown: Others Hallucination Error ≈ 20, Answer Generation Error ≈ 20, Refuse Answer ≈ 5, Format Error ≈ 5.
* **PoG (GPT-4):** Total error samples ≈ 40. Breakdown: Others Hallucination Error ≈ 10, Answer Generation Error ≈ 20, Refuse Answer ≈ 5, Format Error ≈ 5.
* **PoG-E (GPT-3.5):** Total error samples ≈ 60. Breakdown: Others Hallucination Error ≈ 20, Answer Generation Error ≈ 30, Refuse Answer ≈ 5, Format Error ≈ 5.
* **PoG-E (GPT-4):** Total error samples ≈ 50. Breakdown: Others Hallucination Error ≈ 10, Answer Generation Error ≈ 20, Refuse Answer ≈ 10, Format Error ≈ 10.
### Key Observations
* **GPT-4 consistently outperforms GPT-3.5** across all datasets and prompting strategies (PoG and PoG-E) in terms of overall error samples.
* **Answer Generation Error is the dominant error type** for both models across all datasets.
* **PoG-E generally results in higher error counts than PoG** for both models, suggesting that the "E" prompting strategy may introduce more errors.
* **Format Error and Refuse Answer errors are relatively low** compared to the other two error types.
* **The error sample counts decrease as the dataset complexity increases** (CWQ > WebQSP > GrailQA).
### Interpretation
The data suggests that GPT-4 is a more reliable language model for question answering tasks than GPT-3.5, as it produces fewer errors across various datasets. However, both models struggle with generating accurate answers, as this is the most frequent type of error. The "PoG-E" prompting strategy appears to be less effective than "PoG," potentially due to increased complexity or ambiguity. The decreasing error counts with more complex datasets could indicate that the models perform better on tasks requiring more reasoning or knowledge. The relatively low occurrence of Format and Refuse Answer errors suggests that the models are generally capable of producing responses in the expected format and are not overly cautious about refusing to answer. This analysis provides valuable insights into the strengths and weaknesses of these language models and can inform future research and development efforts. The differences between PoG and PoG-E suggest that prompt engineering is a critical factor in model performance.
</details>
Figure 9. The error instances and categories of PoG and PoG-E in the CWQ, WebQSP, and GrailQA datasets.
### B.4. Efficiency Analysis
LLM calls cost analysis. To evaluate the cost and efficiency of utilizing LLMs, we conducted an analysis of LLM calls on the CWQ, WebQSP, and GrailQA datasets. Initially, we examined the proportion of questions answered with varying numbers of LLM calls, as depicted in Figure 10. The results indicate that the majority of questions are answered within nine LLM calls across all datasets, with approximately 80% and 50% of questions being resolved within six calls on CWQ and WebQSP, respectively. These findings demonstrate PoG’s efficiency in minimizing LLM usage costs. Furthermore, we compared the average number of LLM calls required by PoG with the current SOTA method, ToG (Sun et al., 2024), as shown in Table 7. Since we utilized identical datasets for WebQSP, GrailQA, Simple Questions, and WebQuestions, we report the ToG results from their paper. The comparison reveals that PoG achieves comparable or superior accuracy while reducing the number of LLM calls by up to 40% on the GrailQA dataset compared to ToG. This improvement is attributed to PoG’s dynamic exploration strategy, which avoids starting from scratch, and its effective use of graph structures to prune irrelevant information, thereby significantly decreasing computational costs.
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Bar Charts: LLM Call Performance Across Datasets
### Overview
The image presents three bar charts, each representing the distribution of the number of Large Language Model (LLM) calls required to answer questions from different datasets: CWQ, WebQSP, and GrailQA. The y-axis represents the percentage of questions answered within a given number of LLM calls, while the x-axis categorizes the number of LLM calls into bins.
### Components/Axes
* **X-axis Title:** "Number of LLM calls"
* **Y-axis Title:** "Percentage %"
* **Datasets (Chart Titles):** CWQ, WebQSP, GrailQA
* **X-axis Categories:** (0, 3], (3, 6], (6, 9], (9, 12], (12, 15], (15, 18], (18, 21], 21+
* **Color Scheme:**
* CWQ: Orange
* WebQSP: Yellow-Orange
* GrailQA: Blue
### Detailed Analysis
**CWQ (Left Chart):**
The distribution peaks in the (3, 6] call range. The trend is a decreasing frequency as the number of LLM calls increases.
* (0, 3]: Approximately 8%
* (3, 6]: Approximately 42%
* (6, 9]: Approximately 25%
* (9, 12]: Approximately 12%
* (12, 15]: Approximately 6%
* (15, 18]: Approximately 4%
* (18, 21]: Approximately 2%
* 21+: Approximately 1%
**WebQSP (Center Chart):**
The distribution peaks in the (6, 9] call range. The trend is a decreasing frequency as the number of LLM calls increases.
* (0, 3]: Approximately 10%
* (3, 6]: Approximately 25%
* (6, 9]: Approximately 45%
* (9, 12]: Approximately 10%
* (12, 15]: Approximately 5%
* (15, 18]: Approximately 3%
* (18, 21]: Approximately 1%
* 21+: Approximately 1%
**GrailQA (Right Chart):**
The distribution peaks in the (9, 12] call range. The trend is a decreasing frequency as the number of LLM calls increases.
* (0, 3]: Approximately 5%
* (3, 6]: Approximately 15%
* (6, 9]: Approximately 20%
* (9, 12]: Approximately 70%
* (12, 15]: Approximately 5%
* (15, 18]: Approximately 2%
* (18, 21]: Approximately 1%
* 21+: Approximately 2%
### Key Observations
* CWQ generally requires fewer LLM calls than WebQSP and GrailQA.
* GrailQA has a very strong peak at 9-12 LLM calls, indicating that most questions can be answered within this range.
* WebQSP shows a broader distribution, suggesting more variability in the complexity of questions.
* The percentage of questions requiring 21+ LLM calls is consistently low across all datasets.
### Interpretation
The data suggests that the difficulty of answering questions varies significantly across the three datasets. GrailQA appears to be the most amenable to LLM-based question answering, with a large proportion of questions resolved within a relatively small number of calls. CWQ is also relatively efficient, while WebQSP requires a more extensive search process, as evidenced by its broader distribution. The low percentage of questions requiring a very large number of LLM calls (21+) across all datasets indicates that the LLM is generally able to converge on an answer within a reasonable number of iterations. This could be due to the quality of the LLM, the nature of the datasets, or a combination of both. The differences in distributions likely reflect the inherent complexity and characteristics of each dataset – for example, the type of questions asked, the depth of knowledge required, and the ambiguity of the language used.
</details>
Figure 10. The proportion of question of PoG and PoG-E by different LLM Calls among CWQ, WebQSP, and GrailQA datasets
Table 7. Average LLM calls per question of PoG and ToG among all datasets.
| Method | CWQ | WebQSP | GrailQA | Simple Questions | WebQuestions |
| --- | --- | --- | --- | --- | --- |
| PoG | 10.7 | 8.3 | 6.5 | 6.1 | 9.3 |
| ToG | - | 11.2 | 10.6 | 8.7 | 10.5 |
Running time analysis. To further evaluate the processing efficiency of PoG, we conducted extensive evaluations focusing on average accuracy, LLM call frequency, and running time on multi-hop question datasets GrailQA and CWQ. The results, presented in Table 8, compare the performance of ToG and PoG across various beam search strategies. The data indicate that while higher accuracy slightly increases runtime, PoG effectively balances accuracy with reduced LLM call costs. Specifically, PoG reduces LLM call costs by up to 53.5% while increasing accuracy by up to 33.4%. This allows users to tailor the PoG framework to their specific needs regarding accuracy, cost, and runtime. All PoG setting provide significantly lower costs. For instance, on the GrailQA dataset, PoG with 3-step beam search reduces LLM call costs by 39.6% and improves accuracy by 33.1%, with only a 1.14% increase in runtime. A similar trend is also observed in the CWQ dataset.
Table 8. Running time evaluation of ToG and PoG with different beam search methods on CWQ and GrailQA.
| Model | Evaluation | CWQ | GrailQA |
| --- | --- | --- | --- |
| ToG | Accuracy | 53.1 | 59.3 |
| Time (s) | 78.7 | 14.8 | |
| LLM Calls | 21.3 | 10.1 | |
| PoG | Accuracy | 81.4 | 92.7 |
| Time (s) | 118.9 | 21.4 | |
| LLM Calls | 9.1 | 6.5 | |
| PoG | Accuracy | 79.8 | 92.4 |
| w/ 3-Step Beam Search | Time (s) | 87.5 | 15.0 |
| LLM Calls | 8.8 | 6.1 | |
| PoG | Accuracy | 57.1 | 83.0 |
| w/ Fuzzy Selection | Time (s) | 109.7 | 15.7 |
| LLM Calls | 6.8 | 4.7 | |
### B.5. Case Study
Case study: graph reduction and path pruning. We conducted a case study using the example question presented in Figure 2 to illustrate the effects of graph pruning and path pruning on the graph structure. Figure 11 (a) shows the results of graph pruning, where vertices in blue are selected as part of the question subgraph, and vertices in black are pruned. In this sample, the number of entities is reduced from 16,740 to 1,245, resulting in a 92% reduction of vertices. Figures 11 (b) and 11 (c) demonstrate the question subgraph induced by the blue vertices in Figure 11 (a) and the results after applying fuzzy and precise path selection. In these figures, vertices in blue represent the selected entity after each pruning, vertices in yellow represent the topic entities, and the vertex in red denotes the final answer entity. From these graphs, we observe that utilizing the graph structure allows for the rapid pruning of irrelevant vertices, ensuring that the reasoning paths remain faithful and highly relevant to the question since all vertices within the question subgraph are interconnected with all topic entities, thereby maintaining the integrity and relevance of the reasoning process.
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Network Graph: Relationship Visualization
### Overview
The image depicts a network graph, visually representing relationships between nodes. The nodes are represented as circles, varying in size and color. The connections between nodes are shown as thin, grey lines. The graph exhibits a central, densely connected cluster with radiating connections to more sparsely connected nodes. There are no explicit axes or labels.
### Components/Axes
There are no explicit axes or labels present in the image. The graph consists of:
* **Nodes:** Represented by circles, colored either black or blue, and varying in size.
* **Edges:** Represented by thin, grey lines connecting the nodes.
* **Background:** A gradient background, transitioning from dark blue in the center to lighter shades towards the edges.
### Detailed Analysis or Content Details
The graph is dominated by a central cluster of nodes. The density of connections is highest in this central region.
* **Node Color:** The nodes are predominantly either black or blue. The blue nodes appear to be more prevalent in the periphery of the graph, while black nodes are concentrated in the central cluster.
* **Node Size:** Node size varies. Larger nodes appear to be more central within clusters, suggesting higher connectivity or importance.
* **Edge Density:** The central region exhibits a very high edge density, meaning many nodes are connected to multiple other nodes. As you move outwards from the center, the edge density decreases significantly.
* **Central Cluster:** The central cluster is characterized by a high concentration of black nodes and a dense network of connections. The connections radiate outwards from this cluster.
* **Peripheral Nodes:** The periphery of the graph contains a mix of blue and black nodes, but with fewer connections compared to the central cluster. The blue nodes are more prominent in the periphery.
* **Gradient Background:** The gradient background, darker in the center and lighter towards the edges, visually emphasizes the central cluster.
It is impossible to provide precise numerical data without specific node identifiers or edge weights. However, we can describe the distribution:
* Approximately 60-70% of the nodes appear to be black.
* Approximately 30-40% of the nodes appear to be blue.
* The largest nodes have a diameter approximately 3-4 times larger than the smallest nodes.
* The edge density in the central cluster is estimated to be 5-10 times higher than in the periphery.
### Key Observations
* The graph exhibits a clear core-periphery structure.
* Black nodes are more central and highly connected.
* Blue nodes are more peripheral and less connected.
* Node size correlates with connectivity.
* The gradient background highlights the central cluster.
### Interpretation
This network graph likely represents a system of relationships where certain entities (represented by black nodes) are more central and influential than others (represented by blue nodes). The high connectivity within the central cluster suggests strong interactions and dependencies among these core entities. The radiating connections indicate that these core entities influence or are connected to more peripheral entities.
The varying node sizes suggest that some entities within the network have greater importance or influence than others. The gradient background visually reinforces this hierarchy, emphasizing the central cluster as the focal point of the network.
The graph could represent a social network, a biological network, a communication network, or any other system where relationships between entities are important. Without additional context, it is difficult to determine the specific meaning of the nodes and edges. However, the core-periphery structure and the varying node sizes suggest a hierarchical system with a clear power dynamic. The prevalence of black nodes in the center could indicate a dominant group or a central authority. The blue nodes on the periphery might represent less influential or more isolated entities.
The lack of labels or axes makes it difficult to draw definitive conclusions. Further analysis would require additional information about the entities represented by the nodes and the nature of the relationships represented by the edges.
</details>
(a) Graph pruning
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Network Diagram: Relationship Visualization
### Overview
The image depicts a network diagram, visually representing relationships between numerous nodes. The majority of nodes are densely clustered in the lower-left quadrant, with connections radiating outwards. Two distinct, larger nodes are positioned in the top-right and bottom-right corners, connected to a subset of the central cluster via sparse lines. The diagram appears to illustrate a hierarchical or radiating network structure.
### Components/Axes
There are no explicit axes or scales present in the image. The diagram consists of:
* **Nodes:** Represented by circles. Most are dark blue, with two highlighted in orange.
* **Edges:** Represented by lines connecting the nodes. Most are light gray, with a few orange lines.
* **Node Density:** The central area has a high density of nodes, while the periphery has a lower density.
* **Color Coding:** Dark blue nodes represent the majority of the network, while orange nodes appear to be focal points or outliers.
### Detailed Analysis or Content Details
The diagram contains approximately 300-400 nodes (difficult to count precisely due to density).
* **Central Cluster:** The majority of the nodes (approximately 350) are densely packed in the lower-left quadrant. These nodes are interconnected by a complex web of light gray lines. The density of connections is highest in the center of this cluster and decreases towards the edges.
* **Top-Right Node:** A single, larger orange node is located in the top-right corner. Approximately 20-30 light gray lines connect this node to the central cluster.
* **Bottom-Right Node:** A single, larger orange node is located in the bottom-right corner. Approximately 5-10 light gray lines connect this node to the central cluster.
* **Orange Connections:** The two orange nodes are connected to the central cluster via light gray lines. There are also a few orange lines connecting the top-right node to a small subset of nodes within the central cluster.
* **Line Distribution:** The lines are distributed unevenly. The central cluster has a high density of lines, while the connections to the orange nodes are sparse.
### Key Observations
* **Centralization:** The network appears to be centralized around the dense cluster in the lower-left quadrant.
* **Outliers:** The two orange nodes stand out as potential outliers or focal points within the network.
* **Hierarchical Structure:** The radiating connections from the orange nodes suggest a hierarchical structure, where the orange nodes may represent higher-level entities or sources.
* **Sparse Connections:** The sparse connections between the orange nodes and the central cluster indicate that these nodes may not be directly connected to all elements of the network.
### Interpretation
The diagram likely represents a complex system of relationships, such as a social network, a knowledge graph, or a communication network. The dense central cluster could represent a core group or a common set of entities, while the orange nodes could represent influential individuals, key concepts, or external sources. The radiating connections suggest that these outliers have an impact on the central cluster, but are not fully integrated into it.
The lack of labels or quantitative data makes it difficult to draw definitive conclusions. However, the visual structure suggests that the network is not entirely random and that there are underlying patterns and relationships that are worth investigating. The diagram could be used to identify key players, understand the flow of information, or explore the structure of a complex system.
The diagram is a qualitative visualization, and does not provide specific data points or numerical values. It is a representation of relationships, and its interpretation depends on the context in which it is used. The orange nodes could represent anything from high-degree nodes to specific categories of nodes, depending on the application. Further investigation would be needed to determine the precise meaning of the diagram.
</details>
(b) Question subgraph
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Network Graph: Relationship Visualization
### Overview
The image depicts a network graph visualizing relationships between nodes. The graph consists of a large number of nodes (approximately 200-300) connected by edges. The nodes are differentiated by size and color, suggesting varying degrees of importance or categorization. There are two distinct color schemes: gray nodes with blue highlights, and orange nodes. The graph appears to have a central cluster of blue-highlighted nodes, with connections radiating outwards to the gray nodes, and a separate set of connections extending from the orange nodes.
### Components/Axes
There are no explicit axes or scales present in the image. The graph's structure itself defines the relationships. The components are:
* **Nodes:** Represent entities or concepts.
* **Edges:** Represent relationships between nodes.
* **Node Colors:** Gray, Blue, and Orange.
* **Node Sizes:** Varying, likely indicating importance or degree.
* **Edge Colors:** Light blue and light orange.
### Detailed Analysis or Content Details
The graph can be divided into two main sections based on the color of the nodes:
**Section 1: Gray and Blue Nodes (Main Cluster)**
* The majority of nodes are gray (approximately 170-220).
* A subset of gray nodes are highlighted in blue (approximately 20-30). These blue nodes are concentrated in the center-left of the image.
* The blue nodes are interconnected with a dense network of light blue edges.
* Edges radiate from the blue nodes to the surrounding gray nodes. The density of connections appears to decrease as distance from the blue nodes increases.
* The size of the blue nodes varies, with some being significantly larger than others. The largest blue nodes are located in the center of the cluster.
**Section 2: Orange Nodes**
* There are two orange nodes.
* One orange node is located in the bottom-center of the image.
* The other orange node is located in the top-right corner of the image.
* Light orange edges connect the orange node in the top-right to a significant number of gray nodes, primarily those in the upper-left quadrant of the graph.
* The orange node in the bottom-center is connected to a small number of gray nodes.
**Approximate Node Counts:**
* Gray Nodes: ~200
* Blue Nodes: ~25
* Orange Nodes: 2
**Edge Density:**
* High density within the blue node cluster.
* Moderate density radiating from blue nodes to gray nodes.
* Moderate density connecting the top-right orange node to gray nodes.
* Low density connecting the bottom-center orange node to gray nodes.
### Key Observations
* The blue nodes appear to be central to the network, acting as hubs connecting to many gray nodes.
* The orange node in the top-right corner has a strong influence on a subset of the gray nodes, distinct from the influence of the blue nodes.
* The orange node in the bottom-center has limited influence.
* The varying sizes of the blue nodes suggest a hierarchy or differing levels of importance within the central cluster.
* The graph lacks any explicit labels or quantitative data, making precise analysis difficult.
### Interpretation
This network graph likely represents relationships within a complex system. The blue nodes could represent key individuals, concepts, or entities within a core group. The gray nodes represent peripheral elements connected to the core. The orange nodes represent external influences or separate groups that interact with the main network.
The strong connection between the top-right orange node and a specific subset of gray nodes suggests a targeted influence or relationship. The limited connection of the bottom-center orange node suggests a weaker or more isolated interaction.
The lack of labels makes it impossible to determine the specific meaning of the nodes and edges. However, the graph structure suggests a hierarchical network with a central core and external influences. The varying node sizes within the blue cluster indicate differing levels of importance or centrality within the core group.
The graph could represent social networks, information flow, or any system where relationships between entities are important. Further investigation would require understanding the context and meaning of the nodes and edges. The visualization suggests a system where information or influence flows primarily through the blue nodes, with the orange node in the top-right exerting a significant influence on a specific portion of the network.
</details>
(c) Fuzzy selection
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Network Graph: Node and Edge Visualization
### Overview
The image depicts a network graph composed of nodes and edges. The majority of nodes are represented as small, light gray circles, densely packed in a roughly square arrangement. A significant number of blue lines connect these gray nodes, radiating outwards from a central cluster. There are also a few distinctly colored nodes: one red, two dark blue, and two orange. The orange nodes are connected to a subset of the gray nodes via lighter, yellow-orange lines. The graph appears to visualize relationships or connections between entities, with varying degrees of centrality and influence.
### Components/Axes
There are no explicit axes or labels in the traditional sense. The graph's structure *is* the data representation. The components are:
* **Nodes:** Represented by circles, varying in color (gray, red, dark blue, orange).
* **Edges:** Represented by lines, varying in color (blue, yellow-orange).
* **Central Cluster:** A dense concentration of gray nodes connected by blue lines.
* **Peripheral Nodes:** Gray nodes connected to the orange nodes via yellow-orange lines.
### Detailed Analysis or Content Details
The graph consists of approximately 500-600 nodes (estimated due to density).
* **Gray Nodes:** The vast majority of nodes are gray. They are distributed relatively evenly within the square area, with a higher concentration towards the center.
* **Blue Lines:** These lines connect the gray nodes, forming a complex network. The density of blue lines is highest in the central region, indicating strong connectivity among the central nodes. The lines radiate outwards, becoming sparser towards the edges of the square.
* **Red Node:** Located near the center of the graph. It is a single, isolated node.
* **Dark Blue Nodes:** Two dark blue nodes are present. One is located near the center, and the other is positioned towards the top-left corner of the square.
* **Orange Nodes:** Two orange nodes are present. One is located at the bottom center of the square, and the other is at the top-right corner.
* **Yellow-Orange Lines:** These lines connect the orange nodes to a subset of the gray nodes. The lines are less dense than the blue lines and appear to connect to nodes that are somewhat peripheral to the central cluster.
There is no numerical data associated with the nodes or edges. The visualization relies on spatial arrangement and color to convey information.
### Key Observations
* **Centrality:** The central cluster of gray nodes exhibits the highest degree of connectivity, suggesting a core group of highly interconnected entities.
* **Isolation:** The red node is isolated, indicating a lack of direct connections to other nodes.
* **Peripheral Connections:** The orange nodes and their associated yellow-orange lines represent connections to a different subset of the network, potentially indicating external influences or relationships.
* **Asymmetry:** The network is not symmetrical. The connections radiating from the central cluster are not evenly distributed.
### Interpretation
This network graph likely represents a complex system of relationships, such as social networks, communication patterns, or dependencies within a system. The central cluster could represent a core group of individuals or entities with strong internal connections. The red node might represent an outlier or a disconnected element. The orange nodes and their connections could represent external influences or connections to other networks.
The lack of numerical data makes it difficult to draw definitive conclusions. However, the visualization suggests a hierarchical structure with a central core and peripheral connections. The density of connections in the central cluster indicates a high degree of information flow or interaction within that group. The isolation of the red node suggests a lack of integration or influence within the network.
The graph could be used to identify key players, detect anomalies, or understand the overall structure of the network. Further analysis, potentially with additional data, would be needed to gain a more complete understanding of the system being represented. The visualization is a qualitative representation of relationships, and its interpretation depends on the context in which it was created.
</details>
(d) Precise selection
Figure 11. Visualization of graph reduction and Path selection.
Case study: interpretable reasoning. In this section, we present Table 9, which illustrates PoG’s interpretability through case studies involving questions with one, two, and three entities. These examples demonstrate PoG’s effectiveness in handling multi-entity and multi-hop tasks by providing clear and understandable reasoning paths that lead to accurate answers.
Table 9. Examples of faithful and interpretable reasoning of PoG.
| Field | Content |
| --- | --- |
| Question | Lou Seal is the mascot for the team that last won the World Series when? |
| Answer | 2014 World Series |
| Topic Entity | Lou Seal |
| LLM indicator | ” Lou Seal ” - mascot for - team - last won - World Series - when |
| Split Question | split_question 1: What team is ”Lou Seal” the mascot for? |
| split_question 2: When did the last World Series win occur for the team with Lou Seal as the mascot? | |
| Reasoning Path | {Lou Seal} $\rightarrow$ sports.mascot.team $\rightarrow$ {San Francisco Giants} $\rightarrow$ sports.sports_team.championships $\rightarrow$ {2014 World Series, 2012 World Series, 2010 World Series} |
| PoG answer | answer: {2014 World Series} |
| reason: From the given path {Lou Seal} $\rightarrow$ sports.mascot.team $\rightarrow$ {San Francisco Giants}, {San Francisco Giants} is the team that ”Lou Seal” serves as the mascot for. From {San Francisco Giants} $\rightarrow$ sports.sports_team.championships $\rightarrow$ {2014 World Series, 2012 World Series, 2010 World Series}, the last World Series won by the {San Francisco Giants} is in {2014}. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {2014 World Series}. | |
| Question | What European Union country sharing borders with Germany contains the Lejre Municipality? |
| Answer | Denmark |
| Topic Entity | {Germany, Lejre Municipality} |
| LLM indicator | ” Lejre Municipality ” – contained by - European Union country (answer)- shares borders with-” Germany ” |
| Split Question | split_question 1: What European Union country shares borders with ”Germany”? |
| split_question 2: What country contains the Lejre Municipality? | |
| Reasoning Path | {Lejre Municipality} $\rightarrow$ location.administrative_division.country $\rightarrow$ {Denmark} $\rightarrow$ location.location.containedby $\rightarrow$ {Eurasia} $\rightarrow$ location.location.contains $\rightarrow$ {Germany}, |
| {Lejre Municipality} $\rightarrow$ location.administrative_division.country $\rightarrow$ {Denmark} $\rightarrow$ base.locations.countries. continent $\rightarrow$ {Europe} $\rightarrow$ location.location.contains $\rightarrow$ {Western Europe} $\rightarrow$ location.location.contains $\rightarrow$ {Germany} | |
| PoG answer | answer: {Denmark} |
| reason: From {Denmark} $\rightarrow$ location.location.containedby $\rightarrow$ {Eurasia} $\rightarrow$ location.location.contains $\rightarrow$ {Germany}, Denmark is European Union country shares borders with ”Germany”. From {Lejre Municipality} $\rightarrow$ location.administrative_division.country $\rightarrow$ {Denmark}, Denmark is a country contains the Lejre Municipality Both paths lead to Denmark, which is the European Union country sharing borders with Germany and containing the Lejre Municipality. Therefore, the provided knowledge graph path is sufficient to answer the main question, and the answer is {Denmark}. | |
| Question | What Central Tiem Zone continent do the Falkland Islands belong to? |
| Answer | Americas |
| Topic Entity | {Falkland Islands, Continent, Central Time Zone} |
| LLM indicator | answer(” Central Time Zone ”) – owns - answer(” Continent ”)- located contain -” Falkland Islands ” |
| Split Question | split_question 1:What continent do the Falkland Islands belong to? |
| split_question 2: What Central Time Zone does the continent belong to? | |
| Reasoning Path | {Central Time Zone} $\rightarrow$ time.time_zone.locations_in_this_time_zone $\rightarrow$ {Americas} $\rightarrow$ common.topic.not able_types $\rightarrow$ {Continent} $\rightarrow$ common.topic. notable_types $\rightarrow$ {Americas} $\rightarrow$ location.location.con tains $\rightarrow$ { Falkland Islands} |
| PoG answer | answer: {Americas} |
| reason: From {Continent} $\rightarrow$ common.topic.notable_types $\rightarrow$ {Americas} $\rightarrow$ location.location. contains $\rightarrow$ { Falkland Islands}, Americas is the continent Falkland Islands belong to. From {Central Time Zone} $\rightarrow$ time.time_zone.locations_in_this_time_zone $\rightarrow$ {Americas} $\rightarrow$ common.topic. notable_types $\rightarrow$ {Continent}. Americas is the Central Time Zone. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {Americas}. | |
## Appendix C Experiment Details
Experiment datasets. To evaluate PoG’s capability in multi-hop knowledge-intensive reasoning tasks, we assess it on four KBQA datasets: three multi-hop datasets (CWQ (Talmor and Berant, 2018), WebQSP (Yih et al., 2016), GrailQA (Gu et al., 2021)) and one single-hop dataset (SimpleQuestions (Petrochuk and Zettlemoyer, 2018)). Additionally, to examine PoG on more general tasks, we include an open-domain QA dataset, WebQuestions. For the evaluation of large datasets such as CWQ, GrailQA, and SimpleQuestions, we utilize a random sample of 1,000 test cases from CWQ and employ the 1,000 samples previously reported by ToG (Sun et al., 2024) to facilitate a comparison with the SOTA while also minimizing computational costs. Freebase serves as the background knowledge graph for all datasets, which encompasses approximately 88 million entities, 20,000 relations, and 126 million triples (Luo et al., 2024; Bollacker et al., 2008). The statistics of the datasets utilized in this study are detailed in Table 10. The source code has been publicly available https://github.com/SteveTANTAN/PoG.
Baselines. Inspired by ToG (Sun et al., 2024), we compare our method with standard prompting (IO), Chain-of-Thought (CoT), and Self-Consistency (SC) promptings with six in-context exemplars and ”step-by-step” reasoning chains. For each dataset, we also include previous SOTA works for comparison. For a fair play, we compare with previous SOTA among all prompting-based methods and previous SOTA among all methods respectively. Since ToG is the current SOTA prompting-based method, we directly refer to their results and those of other baselines reported in their paper for comparisons.
Experimental implementation. Leveraging the plug-and-play convenience of our framework, we experiment with two LLMs: GPT-3.5 and GPT-4. We use the OpenAI API to access GPT-3.5 (GPT-3.5-turbo) and GPT-4. Aligning with ToG, we set the temperature parameter to 0.4 during the exploration process (to increase diversity) and to 0 during the reasoning process (to ensure reproducibility). The maximum token length for generation is set to 256. In all experiments, we set both $W_{\max}$ and $D_{\max}$ to 3 for beam search. All the experiments are conducted on a machine with Intel Xeon Gold 6248R CPU, Nvidia A5000 GPU and 512GB memory.
Table 10. Statistics of the datasets used in this paper. † denotes we randomly select 1,000 samples from the CWQ test set to create the experiment testing set due to the abundance of test samples. ∗ denotes that we utilize the 1,000 samples reported by ToG (Sun et al., 2024) to compare with the state-of-the-art.
| Dataset | Answer Format | Test | Train |
| --- | --- | --- | --- |
| ComplexWebQuestions (CWQ) † | Entity | 1,000 | 27,734 |
| WebQSP | Entity/Number | 1,639 | 3,098 |
| GrailQA ∗ | Entity/Number | 1,000 | 44,337 |
| Simple Question ∗ | Entity/Number | 1,000 | 14,894 |
| WebQuestions | Entity/Number | 2,032 | 3,778 |
## Appendix D SPARQL
This section outlines the pre-defined SPARQL queries used for interacting with the knowledge graph and constructing graphs for our experiments.
### D.1. 1-hop Entity and Relation Search
To facilitate the retrieval of 1-hop neighbors of entities within the Freebase Knowledge Graph, we have predefined a SPARQL query. This query is designed to be executed by simply substituting the appropriate ID for the query entity ID. It returns the connected entities’ IDs and their associated relations’ IDs, indicating whether the connected entity is at the tail or the head of the relation.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT ? relation ? connectedEntity ? direction
WHERE {
{
ns: ID ? relation ? connectedEntity .
BIND ("tail" AS ? direction)
}
UNION
{
? connectedEntity ? relation ns: ID .
BIND ("head" AS ? direction)
}
}
### D.2. Short Textual Description
The following predefined function implements the retrieval of short textual descriptions, $\mathcal{D}(.)$ , for converting the identifiers (IDs) of entities or relations into natural language descriptions.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT DISTINCT ? tailEntity
WHERE {
{
? entity ns: type. object. name ? tailEntity .
FILTER (? entity = ns: ID)
}
UNION
{
? entity < http:// www. w3. org /2002/07/ owlsameAs > ? tailEntity .
FILTER (? entity = ns: ID)
}
}
### D.3. 1-hop Subgraph Search
To facilitate subgraph detection in Section 4.1, we implement the 1-hop subgraph detection feature by integrating SPARQL functions described in Appendix D.1 and D.2. The purpose of this function is to retrieve, in a single SPARQL query, the 1-hop neighbors of a given query with their IDs, natural language names, and connected relationships, specifying whether the connected entity is at the tail or the head of the relationship.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT ? relation ? connectedEntity ? connectedEntityName ? direction
WHERE {
{
ns: ID ? relation ? connectedEntity .
OPTIONAL {
? connectedEntity ns: type. object. name ? name .
FILTER (lang (? name) = ’ en ’)
}
BIND (COALESCE (? name, "Unnamed Entity") AS ? connectedEntityName)
BIND ("tail" AS ? direction)
}
UNION
{
? connectedEntity ? relation ns: ID .
OPTIONAL {
? connectedEntity ns: type. object. name ? name .
FILTER (lang (? name) = ’ en ’)
}
BIND (COALESCE (? name, "Unnamed Entity") AS ? connectedEntityName)
BIND ("head" AS ? direction)
}
}
## Appendix E Prompts
In this section, we detail the prompts required for our main experimental procedures.
Question Analysis Prompt Template
You will receive a multi-hop question, which is composed of several interconnected queries, along with a list of topic entities that serve as the main keywords for the question. Your task is to break the question into simpler parts, using each topic entity once and provide a Chain of Thought (CoT) that shows how the topic entities are related. Note: Each simpler question should explore how one topic entity connects to others or the answer. The goal is to systematically address each entity to derive the final answer. In-Context Few-shot Q: {Query} Topic Entity: {Topic Entity} A:
LLM Supplement Prompt Template
Using the main question, a possibly uncertain chain of thought generated by a language model, some related split questions, paths from the ”Related_paths” section, and main topic entities: please first provide three predicted results, and second offer three possible chains of thought that could lead to these results, using the provided knowledge paths and your own knowledge. If any answers are unclear, suggest alternative answers to fill in the gaps in the chains of thought, following the same format as the provided examples. In-Context Few-shot Q: {Query} Topic Entity: {Topic Entity} Think Indicator:{Think Indicator} Split Question:{Split Question} A:
where {Think Indicator}, and {Split Question} are obtained in section 4.1. An indicator example is shown in Figure 2.
Precise Path Select Prompt Template
Given a main question, a LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}, Please score and give me the top three lists from the candidate paths set can be highly to be the answer of the question. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Candidate Paths:{Candidate Paths} A:
{Candidate Paths} denotes the retrieved reasoning paths $Final_{p}aths$ to be selected in this request which are formatted as a series of structural sentences:
$\{e_{0x},...,e_{0z}\}\to r_{1_{i}}\to\{e_{1x},...,e_{1z}\}\to\dots$ $\to r_{l_{j}}\to\{e_{lx},...,e_{lz}\}$
$\dots$
$\{e_{0x},...,e_{0z}\}\to r_{1_{i}}\to\{e_{1x},...,e_{1z}\}\to\dots$ $\to r_{l_{j}}\to\{e_{lx},...,e_{lz}\}$ ,
where, $i$ and $j$ in $r_{1_{i}},r_{1_{i}}$ represent the $i$ -th, $j$ -th relation from each relation edge in the clustered question subgraph. And $e$ is constructed by its ID and natural language name $\mathcal{D}(ID)$ .
Path Summarizing Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, the associated accuracy retrieved knowledge paths from the Related paths section, and main topic entities. Your task is to summarize the provided knowledge triple in the Related paths section and generate a chain of thoughts by the knowledge triple related to the main topic entities of the question, which will used for generating the answer for the main question and split question further. You have to make sure you summarize correctly by using the provided knowledge triple, you can only use the entity with id from the given path and you can not skip in steps. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A:
{Related_Paths} has the same format with the {Candidate_Paths} before.
Question Answering Evaluation Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}. Your task is to determine if this knowledge graph path is sufficient to answer the given split question first then the main question. If it’s sufficient, you need to respond {Yes} and provide the answer to the main question. If the answer is obtained from the given knowledge path, it should be the entity name from the path. Otherwise, you need to response {No}, then explain the reason. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A:
Question Answering Generation Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}, Your task is to generate the answer based on the given knowledge graph path and your own knowledge. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A: