# Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning
**Authors**: Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Wenjie Zhang
> 0009-0000-7232-7051University of New South WalesData61, CSIROSydneyAustraliaxingyu.tan@unsw.edu.au
> 0000-0003-3554-3219University of New South WalesSydneyAustraliaxiaoyang.wang1@unsw.edu.au
> 0000-0001-7895-9551Data61, CSIROSydneyAustraliaq.liu@data61.csiro.au
> 0000-0002-2273-1862Data61, CSIROSydneyAustraliaxiwei.xu@data61.csiro.au
> 0000-0002-9167-1613Data61, CSIROSydneyAustraliaxin.yuan@data61.csiro.au
> 0000-0001-6572-2600University of New South WalesSydneyAustraliawenjie.zhang@unsw.edu.au
(2025)
Abstract.
Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.
Large Language Models; Knowledge Graph; Knowledge Graph Question Answering; Retrieval-Augmented Generation journalyear: 2025 copyright: acmlicensed conference: Proceedings of the ACM Web Conference 2025; April 28-May 2, 2025; Sydney, NSW, Australia booktitle: Proceedings of the ACM Web Conference 2025 (WWW ’25), April 28-May 2, 2025, Sydney, NSW, Australia doi: 10.1145/3696410.3714892 isbn: 979-8-4007-1274-6/25/04 ccs: Information systems Question answering
1. Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Flowchart: Method Comparison for Airport-Country Reasoning
### Overview
The diagram compares four methods for answering the question: "What country bordering France contains an airport that serves Nijmegen?" Each section (a-d) demonstrates a different approach, highlighting reasoning paths, errors, and outcomes. The correct answer is Germany, but some methods incorrectly identify Belgium or the Netherlands.
---
### Components/Axes
1. **Sections (a-d)**:
- (a) GPT-3.5/GPT-4 with Chain of Thought (CoT) prompt
- (b) LLM empowered KG exploration search
- (c) LLM empowered KG subgraph answering
- (d) PoG (Proposed method)
2. **Visual Elements**:
- **Question Mark (?)**: Represents the input query.
- **Triple Arrows (→)**: Denote reasoning steps or relationships.
- **Red X**: Incorrect answers.
- **Green Check**: Correct answer.
- **Triple Entities**: Structured as [Entity, Relation, Entity] (e.g., [France, location.location.containedby, Europe]).
---
### Detailed Analysis
#### Section (a): GPT-3.5/GPT-4 (CoT Prompt)
- **Question**: "What country bordering France contains an airport that serves Nijmegen?"
- **Reasoning**:
- Nijmegen is served by airports in neighboring countries.
- Brussels Airport (BRU) in Belgium is the closest major airport.
- **Answer**: Belgium (❌ Incorrect).
- **Error**: Misidentifies Nijmegen's location (Nijmegen is in the Netherlands, not Belgium).
#### Section (b): LLM Empowered KG Exploration Search
- **Question**: Same as above.
- **Reasoning**:
- Exploited triples:
- [France, location.location.containedby, Europe]
- [France, location.location.geolocation, Unnamed Entity]
- [Nijmegen, second_level_division, Netherlands]
- Concludes: Netherlands (❌ Incorrect).
- **Error**: Confuses Nijmegen's country (Netherlands) with the country bordering France.
#### Section (c): LLM Empowered KG Subgraph Answering
- **Question**: Same as above.
- **Reasoning**:
- Attempts subgraph analysis but fails due to "extremely large and dense" subgraphs.
- Refuses to answer (❌).
- **Error**: Inability to process complex subgraphs.
#### Section (d): PoG (Proposed Method)
- **Question**: Same as above.
- **Reasoning**:
- **Subgraph Detection**: Identifies relevant subgraphs (e.g., airports near Nijmegen).
- **Reasoning Path Exploration**:
- Nijmegen → Weeze Airport (Germany) → Germany → Europe → France.
- Nijmegen → Weeze Airport (Germany) → Germany → Unnamed Entity → France.
- **Answer**: Germany (✅ Correct).
- **Key Insight**: Uses spatial reasoning (e.g., "nearby" airports) and containment hierarchies to isolate Germany as the correct answer.
---
### Key Observations
1. **Incorrect Answers**:
- Belgium (a) and Netherlands (b) are geographically adjacent to France but do not border it.
- LLM methods (a-c) fail due to flawed reasoning or subgraph complexity.
2. **Correct Answer**:
- Germany borders France and contains Weeze Airport, which serves Nijmegen.
3. **PoG Advantage**:
- Combines subgraph detection with reasoning path pruning to avoid irrelevant entities (e.g., Europe, Unnamed Entity).
---
### Interpretation
The diagram illustrates how structured reasoning and knowledge graph (KG) analysis improve accuracy in geospatial reasoning tasks. Traditional methods (a-c) struggle with:
- **Geographical Ambiguity**: Confusing nearby countries (Belgium/Netherlands) with bordering ones.
- **Subgraph Complexity**: Overwhelmed by dense KG structures.
PoG's success lies in:
- **Spatial Reasoning**: Leveraging "nearby" relationships to prioritize relevant airports.
- **Containment Hierarchies**: Using "contain by" and "adjoints" relations to isolate Germany as the bordering country.
This approach mirrors human cognitive processes, where contextual clues (e.g., airport proximity) and hierarchical knowledge (e.g., country borders) are integrated to resolve ambiguity.
</details>
Figure 1. Representative workflow of four LLM reasoning paradigms.
Large Language Models (LLMs) have demonstrated remarkable performance in various tasks (Brown, 2020; Chowdhery et al., 2023; Touvron et al., 2023; Besta et al., 2024; Huang et al., 2025). These models leverage pre-training techniques by scaling to billions of parameters and training on extensive, diverse, and unlabelled data (Touvron et al., 2023; Rawte et al., 2023). Despite these impressive capabilities, LLMs face two well-known challenges. First, they struggle with deep and responsible reasoning when tackling complex tasks (Petroni et al., 2020; Talmor et al., 2018; Khot et al., 2022). Second, the substantial cost of training makes it difficult to keep models updated with the latest knowledge (Sun et al., 2024; Wen et al., 2024), leading to errors when answering questions that require specialized information not included in their training data. For example, in Figure 1 (a), though models like GPT can generate reasonable answers for knowledge-specific questions, these answers may be incorrect due to outdated information or hallucination of reasoning on LLM inherent Knowledge Base (KB).
To deal with the problems of error reasoning and knowledge gaps, the plan-retrieval-answering method has been proposed (Luo et al., 2024; Zhao et al., 2023; Li et al., 2023b). In this approach, LLMs are prompted to decompose complex reasoning tasks into a series of sub-tasks, forming a plan. Simultaneously, external KBs are retrieved to answer each step of the plan. However, this method still has the issue of heavily relying on the reasoning abilities of LLMs rather than the faithfulness of the retrieved knowledge. The generated reasoning steps guide information selection, but answers are chosen based on the LLM’s interpretation of the retrieved knowledge rather than on whether the selection leads to a correct and faithful answer.
To address these challenges, incorporating external knowledge sources like Knowledge Graphs (KGs) is a promising solution to enhance LLM reasoning (Sun et al., 2024; Luo et al., 2024; Pan et al., 2024; Luo et al., 2023). KGs offer abundant factual knowledge in a structured format, serving as a reliable source to improve LLM capabilities. Knowledge Graph Question Answering (KGQA) serves as an approach for evaluating the integration of KGs with LLMs, which requires machines to answer natural language questions by retrieving relevant facts from KGs. These approaches typically involve: (1) identifying the initial entities from the question, and (2) iteratively retrieving and refining inference paths until sufficient evidence has been obtained. Despite their success, they still face challenges such as handling multi-hop reasoning problems, addressing questions with multiple topic entities, and effectively utilizing the structural information of graphs.
Challenge 1: Multi-hop reasoning problem. Current methods (Guo et al., 2024; Ye et al., 2021; Sun et al., 2024; Ma et al., 2024), such as the ToG model presented in Figure 1 (b), begin by exploring from each topic entity, with LLMs selecting connected knowledge triples like (France, contained_by, Europe). This process relies on the LLM’s inherent understanding of these triples. However, focusing on one-hop neighbors can result in plausible but incorrect answers and prematurely exclude correct ones, especially when multi-hop reasoning is required. Additionally, multi-hop reasoning introduces significant computational overhead, making efficient pruning essential, especially in dense and large KGs.
Challenge 2: Multi-entity question. As shown in Figure 1 (b), existing work (Guo et al., 2024; Ye et al., 2021; Sun et al., 2024; Ma et al., 2024) typically explores KG for each topic entity independently. When a question involves multiple entities, these entities are examined in separate steps without considering their interconnections. This approach can result in a large amount of irrelevant information in the candidate set that does not connect to the other entities in the question, leading to suboptimal results.
Challenge 3: Utilizing graph structure. Existing methods (Wen et al., 2024; Guo et al., 2023; Chen et al., 2024) often overlook the inherent graph structures when processing retrieved subgraphs. For example, the MindMap model in Figure 1 (c) utilizes LLMs to generate text-formatted subgraphs from KG triples, converting them into graph descriptions that are fed back into the LLM to produce answers. This textual approach overlooks the inherent structural information of graphs and can overwhelm the LLM when dealing with large graphs. Additionally, during KG information selection, most methods use in-context learning by feeding triples into the LLM, ignoring the overall graph structure.
Contributions. In this paper, we introduce a novel method, P aths- o ver- G raph (PoG). Unlike previous studies that utilize knowledge triples for retrieval (Sun et al., 2024; Ma et al., 2024), PoG employs knowledge reasoning paths, that contain all the topic entities in a long reasoning length, as a retrieval-augmented input for LLMs. The paths in KGs serve as logical reasoning chains, providing KG-supported, interpretable reasoning logic that addresses issues related to the lack of specific knowledge background and unfaithful reasoning paths.
To address multi-hop reasoning problem, as shown in Figure 1 (d), PoG first performs question analysis, to extract topic entities from questions. Utilizing these topic entities, it decomposes the complex question into sub-questions and generates an LLM thinking indicator termed "Planning". This planning not only serves as an answering strategy but also predicts the implied relationship depths between the answer and each topic entity. The multi-hop paths are then explored starting from a predicted depth, enabling a dynamic search process. Previous approaches using planning usually retrieve information from scratch, which often confuses LLMs with source neighborhood-based semantic information. In contrast, our method ensures that LLMs follow accurate reasoning paths that directly lead to the answer.
To address multi-entity questions, PoG employs a three-phase exploration process to traverse reasoning paths from the retrieved question subgraph. All paths must contain all topic entities in the same order as they occur in the LLM thinking indicator. In terms of reasoning paths in KGs, all paths are inherently logical and faithful. Each path potentially contains one possible answer and serves as the interpretable reasoning logic. The exploration leverages the inherent knowledge of both LLM and KG.
To effectively utilize graph structure, PoG captures the question subgraph by expanding topic entities to their maximal depth neighbors, applying graph clustering and reduction to reduce graph search costs. In the path pruning phase, we select possible correct answers from numerous candidates. All explored paths undergo a three-step beam search pruning, integrating graph structures, LLM prompting, and a pre-trained language understanding model (e.g., BERT) to ensure effectiveness and efficiency. Additionally, inspired by the Graph of Thought (GoT) (Besta et al., 2024), to reduce LLM hallucination, PoG prompts LLMs to summarize the obtained Top- $W_{\max}$ paths before evaluating the answer, where $W_{\max}$ is a user-defined maximum width in the path pruning phase. In summary, the advantage of PoG can be abbreviated as:
- Dynamic deep search: Guided by LLMs, PoG dynamically extracts multi-hop reasoning paths from KGs, enhancing LLM capabilities in complex knowledge-intensive tasks.
- Interpretable and faithful reasoning: By utilizing highly question-relevant knowledge paths, PoG improves the interpretability of LLM reasoning, enhancing the faithfulness and question-relatedness of LLM-generated content.
- Efficient pruning with graph structure integration: PoG incorporates efficient pruning techniques in both the KG and reasoning paths to reduce computational costs, mitigate LLM hallucinations caused by irrelevant noise, and effectively narrow down candidate answers.
- Flexibility and effectiveness: a) PoG is a plug-and-play framework that can be seamlessly applied to various LLMs and KGs. b) PoG allows frequent knowledge updates via the KG, avoiding the expensive and slow updates required for LLMs. c) PoG reduces the LLMs token usage by over 50% with only a ±2% difference in accuracy compared to the best-performing strategy. d) PoG achieves state-of-the-art results on all the tested KGQA datasets, outperforming the strong baseline ToG by an average of 18.9% accuracy using both GPT-3.5 and GPT-4. Notably, PoG with GPT-3.5 can outperform ToG with GPT-4 by up to 23.9%.
2. Related Work
KG-based LLM reasoning. KGs provide structured knowledge valuable for integration with LLMs (Pan et al., 2024). Early studies (Peters et al., 2019; Luo et al., 2024; Zhang et al., 2021; Li et al., 2023b) embed KG knowledge into neural networks during pre-training or fine-tuning, but this can reduce explainability and hinder efficient knowledge updating (Pan et al., 2024). Recent methods combine KGs with LLMs by converting relevant knowledge into textual prompts, often ignoring structural information (Pan et al., 2024; Wen et al., 2024). Advanced works (Sun et al., 2024; Jiang et al., 2023; Ma et al., 2024) involve LLMs directly exploring KGs, starting from an initial entity and iteratively retrieving and refining reasoning paths until the LLM decides the augmented knowledge is sufficient. However, by starting from a single vertex and ignoring the question’s position within the KG’s structure, these methods overlook multiple topic entities and the explainability provided by multi-entity paths.
Reasoning with LLM prompting. LLMs have shown significant potential in solving complex tasks through effective prompting strategies. Chain of Thought (CoT) prompting (Wei et al., 2022) enhances reasoning by following logical steps in few-shot learning. Extensions like Auto-CoT (Zhang et al., 2023), Complex-CoT (Fu et al., 2022), CoT-SC (Wang et al., 2022), Zero-Shot CoT (Kojima et al., 2022), ToT (Yao et al., 2024), and GoT (Besta et al., 2024) build upon this approach. However, these methods often rely solely on knowledge present in training data, limiting their ability to handle knowledge-intensive or deep reasoning tasks. To solve this, some studies integrate external KBs using plan-and-retrieval methods such as CoK (Li et al., 2023b), RoG (Luo et al., 2024), and ReAct (Yao et al., 2022), decomposing complex questions into subtasks to reduce hallucinations. However, they may focus on the initial steps of sub-problems and overlook further steps of final answers, leading to locally optimal solutions instead of globally optimal ones. To address these deep reasoning challenges, we introduce dynamic multi-hop question reasoning. By adaptively determining reasoning depths for different questions, we enable the model to handle varying complexities effectively.
KG information pruning. Graphs are widely used to model complex relationships among different entities (Tan et al., 2023; Sima et al., 2024; Li et al., 2024b, a). KGs contain vast amounts of facts (Guo et al., 2019; Wu et al., 2024; Wang et al., 2024a), making it impractical to involve all relevant triples in the context of the LLM due to high costs and potential noise (Wang et al., 2024b). Existing methods (Sun et al., 2024; Jiang et al., 2023; Ma et al., 2024) typically identify initial entities and iteratively retrieve reasoning paths until an answer is reached, often treating the LLM as a function executor and relying on in-context learning or fine-tuning, which is expensive. Some works attempt to reduce pruning costs. KAPING (Baek et al., 2023a) projects questions and triples into the same semantic space to retrieve relevant knowledge via similarity measures. KG-GPT (Kim et al., 2023) decomposes complex questions, matches, and selects the relevant relations with sub-questions to form evidence triples. However, these methods often overlook the overall graph structure and the interrelations among multiple topic entities, leading to suboptimal pruning and reasoning performance.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flowchart: Knowledge Graph Question Answering System
### Overview
The image depicts a multi-stage flowchart for a knowledge graph-based question answering system. It illustrates the process of extracting answers from structured data through entity recognition, path exploration, and logical inference. The diagram uses color-coded nodes and directional arrows to represent relationships between entities and processes.
### Components/Axes
1. **Legend** (Top-right corner):
- Red: Topic Entity
- Blue: Airport
- Yellow: France
- Gray: Unnamed Entity
- Purple: Time Zone
- Green: Answer
- Orange: Path Summarizing
2. **Main Flowchart Sections**:
- **Initialization** (Left column):
- Topic Entity Recognition
- Question Subgraph Detection
- Split Questions
- LLM Indicator
- Ordered Entities
- **Exploration** (Center):
- Topic Entity Path Exploration
- LLM Supplement Path Exploration
- Node Expand Exploration
- **Question Answering** (Right column):
- Path Summarizing
- Answer (Green node)
3. **Subprocesses**:
- Fuzzy Selection
- Precise Path Selection
- Branch Reduced Selection
- Path Pruning
### Detailed Analysis
1. **Initialization Phase**:
- Starts with a question: "What country bordering France contains an airport that serves Nijmegen?"
- Uses Topic Entity Recognition to identify key entities (France, Nijmegen, airports).
- Detects question subgraphs and splits complex queries into ordered entities.
2. **Exploration Phase**:
- Explores multiple paths through the knowledge graph:
- **Topic Entity Path**: Traverses relationships like "olympic athletes" → "2000 Olympics" → "Netherlands".
- **LLM Supplement Path**: Uses language models to expand node connections (e.g., "Germany" → "Europe").
- **Node Expand Exploration**: Adds contextual relationships (e.g., "Central European Time Zone").
3. **Question Answering Phase**:
- Combines paths through fuzzy selection (broad indicators) and precise path selection (specific routes).
- Prunes irrelevant branches and summarizes valid paths to derive answers.
### Key Observations
1. **Color-Coded Relationships**:
- Red nodes (Topic Entities) form the core of the knowledge graph.
- Blue nodes (Airports) connect to geographic entities (Netherlands, Germany).
- Yellow nodes (France) act as central reference points.
2. **Temporal and Spatial Context**:
- Time zones (gray nodes) and geographic divisions (e.g., "Central European Time Zone") are integrated into the graph.
3. **LLM Integration**:
- Language models are used at multiple stages to expand node relationships and validate paths.
### Interpretation
This flowchart represents a hybrid system combining structured knowledge graphs with LLM capabilities. The process begins with precise entity recognition, then explores multiple logical paths through the graph using both rule-based and AI-driven methods. The final answer emerges from synthesizing validated paths, demonstrating how structured data and AI can collaborate to answer complex geographical queries. The use of color coding and directional flow emphasizes the systematic traversal of relationships, while the inclusion of time zones and administrative divisions highlights the system's ability to handle multi-dimensional data.
</details>
Figure 2. Overview of the PoG architecture. Exploration: After initialization (detailed in Figure 3), the model retrieves entity paths from $\mathcal{G}_{q}$ through three exploration phases. Path Pruning: PoG applies a three-step beam search to prune paths after each exploration phase. Question Answering: The pruned paths are then evaluated for question answering. If these paths do not fully answer the question, the model explores deeper paths until $D_{max}$ is reached or moves on to the next exploration phase.
3. Preliminary
Consider a Knowledge Graph (KG) $\mathcal{G(E,R,T)}$ , where $\mathcal{E}$ , $\mathcal{R}$ and $\mathcal{T}$ represent the set of entities, relations, and knowledge triples, respectively. Each knowledge triple $T∈\mathcal{T}$ encapsulates the factual knowledge in $\mathcal{G}$ , and is represented as $T=(e_{h},r,e_{t})$ , where $e_{h},e_{t}∈\mathcal{E}$ and $r∈\mathcal{R}$ . Given an entity set $\mathcal{E_{S}⊂eq E}$ , the induced subgraph of $\mathcal{E_{S}}$ is denoted as $\mathcal{S=(E_{S},R_{S},T_{S})}$ , where $\mathcal{T}_{S}=\{(e,r,e^{\prime})∈\mathcal{T}\mid e,e^{\prime}∈\mathcal{E%
}_{S}\}$ , and $\mathcal{R}_{S}=\{r∈\mathcal{R}\mid(e,r,e^{\prime})∈\mathcal{T}_{S}\}.$ Furthermore, we denote $\mathcal{D}(e)$ and $\mathcal{D}(r)$ as the sets of short textual descriptions for each entity $e∈\mathcal{E}$ and each relation $r∈\mathcal{R}$ , respectively. For example, the text description of the entity “m.0f8l9c” is $\mathcal{D}$ (“m.0f8l9c”)= “France”. For simplicity, in this paper, all entities and relations are referenced through their $\mathcal{D}$ representations and transformed into natural language.
** Definition 0 (Reasoning Path)**
*Given a KG $\mathcal{G}$ , a reasoning path within $\mathcal{G}$ is defined as a connected sequence of knowledge triples, represented as: $path_{\mathcal{G}}(e_{1},e_{l+1})=\{T_{1},T_{2},...,T_{l}\}=\{(e_{1},r_{1},e_{%
2}),(e_{2},r_{2},e_{3})$ $,...,(e_{l},r_{l},e_{l+1})\}$ , where $T_{i}∈\mathcal{T}$ denotes the $i$ -th triple in the path and $l$ denotes the length of the path, i.e., $length(path_{\mathcal{G}}(e_{1},e_{l+1}))=l$ .*
** Example 0**
*Consider a reasoning path between ”University” and ”Student” in KG: $path_{\mathcal{G}}(\text{University}$ , $\text{Student})$ $=\{(\text{University}$ , employs, $\text{Professor})$ , $(\text{Professor}$ , teaches, $\text{Course})$ , $(\text{Course}$ , enrolled_in, $\text{Student})\}$ , and can be visualized as:
$$
\text{University}\xrightarrow{\text{employs}}\text{Professor}\xrightarrow{%
\text{teaches}}\text{Course}\xrightarrow{\text{enrolled\_in}}\text{Student}.
$$
It indicates that a “University” employs a “Professor,” who teaches a “Course,” in which a ”Student” is enrolled. The length of the path is 3.*
For any entity $s$ and $t$ in $\mathcal{G}$ , if there exists a reasoning path between $s$ and $t$ , we say $s$ and $t$ can reach each other, denoted as $s\leftrightarrow t$ . The distance between $s$ and $t$ in $\mathcal{G}$ , denoted as $dist_{\mathcal{G}}(s,t)$ , is the shortest reasoning path distance between $s$ and $t$ . For the non-reachable vertices, their distance is infinite. Given a positive integer $h$ , the $h$ -hop neighbors of an entity $s$ in $\mathcal{G}$ is defined as $N_{\mathcal{G}}(s,h)=\{t∈\mathcal{E}|dist_{\mathcal{G}}(s,t)≤ h\}$ .
** Definition 0 (Entity Path)**
*Given a KG $\mathcal{G}$ and a list of entities $list_{e}$ = [ $e_{1},e_{2},e_{3},...,e_{l}$ ], the entity path of $list_{e}$ is defined as a connected sequence of reasoning paths, which is denoted as $path_{\mathcal{G}}(list_{e})$ $=\{path_{\mathcal{G}}(e_{1},e_{2}),$ $path_{\mathcal{G}}(e_{2},e_{3}),...,path_{\mathcal{G}}(e_{l-1},e_{l})\}=\{(%
e_{s},r,e_{t})$ $|(e_{s},r,e_{t})∈ path_{\mathcal{G}}(e_{i},e_{i+1})\land 1≤ i<l\}$ .*
Knowledge Graph Question Answering (KGQA) is a fundamental reasoning task based on KGs. Given a natural language question $q$ and a KG $\mathcal{G}$ , the objective is to devise a function $f$ that predicts answers $a∈ Answer(q)$ utilizing knowledge encapsulated in $\mathcal{G}$ , i.e., $a=f(q,\mathcal{G})$ . Consistent with previous research (Sun et al., 2019; Luo et al., 2024; Sun et al., 2024; Ma et al., 2024), we assume the topic entities $Topic(q)$ mentioned in $q$ and answer entities $Answer(q)$ in ground truth are linked to the corresponding entities in $\mathcal{G}$ , i.e., $Topic(q)⊂eq\mathcal{E}\text{ and }Answer(q)⊂eq\mathcal{E}$ .
4. Method
PoG implements the “KG-based LLM Reasoning” by first exploring all possible faithful reasoning paths and then collaborating with LLM to perform a 3-step beam search selection on the retrieved paths. Compared to previous approaches (Sun et al., 2024; Ma et al., 2024), our model focuses on providing more accurate and question-relevant retrieval-argument graph information. The framework of PoG is outlined in Figure 2, comprising four main components.
- Initialization. The process begins by identifying the set of topic entities from the question input, and then queries the source KG $\mathcal{G}$ by exploring up to $D_{\max}$ -hop from each topic entity to construct the evidence sub-graph $\mathcal{G}_{q}$ , where $D_{\max}$ is the user-defined maximum exploration depth. Subsequently, we prompt the LLM to analyze the question and generate an indicator that serves as a strategy for the answer formulation process and predicting the exploration depth $D_{\text{predict}}$ .
- Exploration. After initialization, the model retrieves topic entity paths from $\mathcal{G}_{q}$ through three exploration phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. All reasoning paths are constrained within the depth range $D∈[D_{\text{predict}},D_{\max}]$ .
- Path Pruning. Following each exploration phase, PoG employs a pre-trained LM, LLM prompting, and graph structural analysis to perform a three-step beam search. The pruned paths are then evaluated in the question answering.
- Question Answering. Finally, LLM is prompted to assess if the pruned reasoning paths sufficiently answer the question. If not, continue exploration with deeper paths incrementally until the $D_{\max}$ is exceeded or proceed to the next exploration phase.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Flowchart: Knowledge Graph Question Answering System
### Overview
The image depicts a multi-stage knowledge graph processing pipeline for answering complex questions. It combines natural language processing with graph analysis to extract answers from a knowledge graph (G). The system handles spatial reasoning ("borders", "contains") and entity relationships through graph operations.
### Components/Axes
1. **Input Section** (Top-left)
- Input variables: G (Knowledge Graph), q (Question), D_max (Maximum distance)
- Example question: "What country borders France contains an airport that serves Nijmegen?"
- Visual elements: Purple box with question text, question mark icon
2. **Question Analysis** (Center-left)
- LLM Indicator: Splits question into sub-questions
- Output1: "Split_question1: What country contains an airport that serves Nijmegen?"
- Output2: "Split_question2: What country borders France?"
- Visual elements: Green box with split questions, arrows connecting components
3. **Knowledge Graph** (Top-right)
- Contains two topic entities: France (yellow) and Nijmegen (red)
- Relationships: "borders", "contains", "serves"
- Visual elements: Concentric circles with entity labels, knowledge graph network
4. **Graph Processing** (Bottom)
- Graph Reduction: Simplifies knowledge graph to G_q
- Node & Relation Clustering: Groups nodes by color-coded relationships
- Graph Detection: Identifies relevant subgraph connecting France and Nijmegen
- Visual elements: Three-stage flowchart with color-coded nodes
### Detailed Analysis
1. **Input Processing**
- Input variables: G (Knowledge Graph), q (Question), D_max (Maximum distance)
- Example question: "What country borders France contains an airport that serves Nijmegen?"
2. **Question Splitting**
- LLM Indicator splits question into:
- Split_question1: "What country contains an airport that serves Nijmegen?"
- Split_question2: "What country borders France?"
3. **Entity Recognition**
- Topic Entity box identifies:
- France (yellow)
- Nijmegen (red)
4. **Graph Operations**
- Graph Reduction: Simplifies knowledge graph to G_q
- Node Clustering: Groups nodes by relationship type (color-coded)
- Graph Detection: Identifies subgraph connecting France and Nijmegen within D_max distance
### Key Observations
1. The system handles complex spatial reasoning through graph distance constraints (D_max)
2. Question splitting enables parallel processing of different relationship types
3. Color coding in graph visualization indicates relationship types:
- Orange: "borders"
- Red: "contains"
- Green: "serves"
4. The final graph detection stage focuses on the intersection of both sub-questions
### Interpretation
This system demonstrates a hybrid approach combining:
1. **Natural Language Understanding** (question splitting)
2. **Knowledge Graph Navigation** (entity recognition and relationship analysis)
3. **Spatial Reasoning** (distance constraints in graph traversal)
The workflow suggests an ontology-based QA system where:
- Questions are decomposed into sub-questions
- Each sub-question is mapped to graph traversal operations
- Results are combined through graph intersection operations
- Spatial constraints (D_max) limit search space for efficient processing
The color-coded visualization helps users understand relationship types and graph structure, while the multi-stage processing enables handling of complex spatial queries through systematic decomposition.
</details>
Figure 3. Overview of the initialization phase. Output 1: from the input question, the model identifies topic entities and prompts the LLM to decompose questions into split questions $q_{split}$ and generate an indicator $I_{LLM}$ . The indicator outlines a strategy for formulating the answer and predicts the exploration depth $D_{predict}$ . Output 2: the model queries the source KG up to $D_{max}$ -hop from identified topic entities, constructing and pruning the evidence subgraph $\mathcal{G}_{q}$ .
4.1. Initialization
The initialization has two main stages, i.e., question subgraph detection and question analysis. The framework is shown in Figure 3.
Question subgraph detection. Given a question $q$ , PoG initially identifies the question subgraph, which includes all the topic entities of $q$ and their $D_{\max}$ -hop neighbors.
Topic entity recognition. To identify the relevant subgraph, PoG first employs LLMs to extract the potential topic entities from the question. Following the identification, the process applies BERT-based similarity matching to align these potential entities with entities from KG. Specifically, as shown in Figure 3, we encode both the keywords and all entities from KG into dense vector embeddings as $H_{T}$ and $H_{\mathcal{G}}$ . We then compute a cosine similarity matrix between these embeddings to determine the matches. For each keyword, the entities with the highest similarity scores are selected to form the set $Topic(q)$ . This set serves as the foundation for constructing the question subgraph in subsequent steps.
Subgraph detection. Upon identifying the topic entities, PoG captures the induced subgraph $\mathcal{G}_{q}⊂eq\mathcal{G}$ by expanding around each entity $e$ in $Topic(q)$ . For each entity, we retrieve knowledge triples associated with its $D_{\max}$ -hop neighbors, thereby incorporating query-relevant and faithful KG information into $\mathcal{G}_{q}$ . Through this process, we update $\mathcal{E}_{q}$ with newly added intermediate nodes that serve as bridging pathways between the topic entities. The result subgraph, $\mathcal{G}_{q}$ is defined as $(\mathcal{E}_{q},\mathcal{R}_{q},\mathcal{T}_{q})$ , where $\mathcal{E}_{q}$ encompasses $Topic(q)$ together with the set $\{N_{\mathcal{G}}(e,D_{\max})\mid e∈ Topic(q)\}$ , effectively linking all relevant entities and their connective paths within the defined hop distance. To interact with KG, we utilize the pre-defined SPARQL queries as detailed in Appendix D.
Graph pruning. To efficiently manage information overhead and reduce computational cost, we implement graph pruning on the question subgraph $\mathcal{G}_{q}$ using node and relation clustering alongside graph reduction techniques. As illustrated in Figure 3, node and relation clustering is achieved by compressing multiple nodes and their relations into supernodes, which aggregate information from the original entities and connections. For graph reduction, we employ bidirectional BFS to identify all paths connecting the topic entities. Based on these paths, we regenerate induced subgraphs that involve only the relevant connections, effectively excluding nodes and relations that lack strong relevance to the topic entities.
Question analysis. To reduce hallucinations in LLMs, the question analysis phase is divided into two parts and executed within a single LLM call using an example-based prompt (shown in Appendix E). First, the complex question $q$ is decomposed into simpler questions based on the identified topic entities, each addressing their relationship to the potential answer. Addressing these simpler questions collectively guides the LLM to better answer the original query, thereby reducing hallucinations. Second, a LLM indicator is generated, encapsulating all topic entities and predicting the answer position within a single chain of thought derived from the original question. This indicator highlights the relationships and sequence among the entities and answer. Based on this, a predicted depth $D_{\text{predict}}$ is calculated, defined as the maximum distance between the predicted answer and each topic entity. An example of question analysis is shown in Figure 3 with predicted depth 2.
4.2. Exploration
As discussed in Section 1, identifying reasoning paths that encompass all topic entities is essential to derive accurate answers. These paths serve as interpretable chains of thought, providing both the answer and the inference steps leading to it, a feature we refer as interpretability. To optimize the discovery of such paths efficiently and accurately, the exploration process is divided into three phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. After each phase, we perform path pruning and question answering. If a sufficient path is found, the process terminates; otherwise, it advances to the next phase to explore additional paths. Due to the space limitation, the pseudo-code of exploration section is shown in Appendix A.1.
Topic entity path exploration. To reduce LLM usage and search space, PoG begins exploration from a predicted depth $D_{\text{predict}}$ rather than the maximum depth. Using the question subgraph $\mathcal{G}_{q}$ , topic entities $Topic(q)$ , LLM indicator $I_{\text{LLM}}$ , and $D_{\text{predict}}$ , PoG identifies reasoning paths containing all topic entities by iteratively adjusting the exploration depth $D$ . Entities in $Topic(q)$ are ordered according to $I_{\text{LLM}}$ to facilitate reasoning effectively. Starting from the predicted depth $D=min(D_{\text{predict}},D_{\text{max}})$ , we employ a bidirectional BFS to derive all potential entity paths, which is defined as:
$$
Paths_{t}=\{p\mid|Topic(q)|\times(D-1)<length(p)\leq|Topic(q)|\times D\},
$$
where $p=Path_{\mathcal{G}_{q}}(Topic(q))$ . To reduce the complexity, a pruning strategy is employed and selects the top- $W_{\max}$ paths based on $Paths_{t}$ , $I_{\text{LLM}}$ , and split questions from Section 4.1. These paths are evaluated for sufficiency verification. If inadequate, $D$ is incremented until $D_{\max}$ is reached. Then the next phase commences.
LLM supplement path exploration. Traditional KG-based LLM reasoning often rephrases KG facts without utilizing the LLM’s inherent knowledge. To overcome this, PoG prompts LLMs to generate predictions based on path understanding and its implicit knowledge, providing additional relevant insights. It involves generating new LLM thinking indicators $I_{\text{Sup}}$ for predicted entities $e∈ Predict(q)$ , and then using text similarity to verify and align them with $\mathcal{E}_{q}∈\mathcal{G}_{q}$ . The supplementary entity list $List_{S}(e)=Topic(q)+e$ is built and ranked by $I_{\text{Sup}}$ to facilitate reasoning effectively. Next, supplementary paths $Paths_{s}$ are derived from $List_{S}(e)$ in the evidence KG $\mathcal{G}_{q}$ with a fixed depth $D_{\max}$ :
$$
Paths_{s}=\{p\mid\text{length}(p)\leq|Topic(q)|\times D_{\max}\},
$$
where $p=Path_{\mathcal{G}_{q}}(List_{S}(e))$ . These paths with new indicators are evaluated similarly to the topic entity path exploration phase. The prompting temple is shown in Appendix E.
Node expand exploration. If previous phases cannot yield sufficient paths, PoG proceeds to node expansion. Unlike previous methods (Sun et al., 2024; Ma et al., 2024) that separately explore relations and entities, PoG explores both simultaneously, leveraging clearer semantic information for easier integration with existing paths. During the exploration, PoG expands unvisited entities by 1-hop neighbors in $\mathcal{G}$ . New triples are merged into existing paths to form the new paths, followed by pruning and evaluation.
4.3. Path Pruning
As introduced in Section 2, KGs contain vast amounts of facts, making it impractical to involve all relevant triples in the LLM’s context due to high costs. To address this complexity and reduce LLM overhead, we utilize a three-step beam search for path pruning. The corresponding pseudo-code can be found in Appendix A.2.
Fuzzy selection. Considering that only a small subset of the generated paths is relevant, the initial step of our beam search involves fuzzy selection by integrating a pre-trained language model (e.g. SentenceBERT (Reimers and Gurevych, 2019)), to filter the irrelevant paths quickly. As shown in Figure 2, we encode the LLM indicator $I_{\text{LLM}}$ (or $I_{\text{Sup}}$ ) and all reasoning paths into vector embeddings, denoted as $H_{I}$ and $H_{Paths}$ , and calculate cosine similarities between them. The top- $W_{1}$ paths with the highest similarity scores are selected for further evaluation.
Precise path selection. Following the initial fuzzy selection, the number of candidate paths is reduced to $W_{1}$ . At this stage, we prompt the LLM to select the top- $W_{\max}$ reasoning paths most likely to contain the correct answer. The specific prompt used to guide LLM in selection phase can be found in Appendix E.
Branch reduced selection. Considering that paths are often represented in natural language and can be extensive, leading to high processing costs for LLMs, we implement a branch reduced selection method integrated with the graph structure. This method effectively balances efficiency and accuracy by further refining path selection. Starting with $D=1$ , for each entity $e$ in the entity list, we extract the initial $D$ -step paths from every path in the candidate set $Paths_{c}$ into a new set $Paths_{e}$ . If the number of $Paths_{e}$ exceeds the maximum designated width $W_{\max}$ , these paths are pruned using precise path selection. The process iterates until the number of paths in $Paths_{c}$ reaches $D_{\max}$ . For example, as illustrated in Figure 2, with $W_{\max}=1$ , only the initial step paths (depicted in green) are extracted for further examination, while paths represented by dashed lines are pruned. This selection method enables efficient iterative selection by limiting the number of tokens and ensuring the relevance and conciseness of the reasoning paths.
Beam search strategy. Based on the three path pruning methods above, PoG can support various beam search strategies, ranging from non-reliant to fully reliant on LLMs. These strategies are selectable in a user-friendly manner, allowing flexibility based on the specific requirements of the task. We have defined four such strategies in Algorithm 2 of Appendix A.2.
4.4. Question Answering
Based on the pruned paths in Section 4.3, we introduce a two-step question-answering method.
Path Summarizing. To address hallucinations caused by paths with excessive or incorrect text, we develop a summarization strategy by prompting LLM to review and extract relevant triples from provided paths, creating a concise and focused path. Details of the prompts used are in Appendix E.
Question answering. Based on the current reasoning path derived from path pruning and summarizing, we prompt the LLM to first evaluate whether the paths are sufficient for answering the split question and then the main question. If the evaluation is positive, LLM is prompted to generate the answer using these paths, along with the question and question analysis results as inputs, as shown in Figures 2. The prompts for evaluation and generation are detailed in Appendix E. If the evaluation is negative, the exploration process is repeated until completion. If node expand exploration reaches its depth limit without yielding a satisfactory answer, LLM will leverage both provided and inherent knowledge to formulate a response. Additional details on the prompts can be found in Appendix E.
5. Experiments
Table 1. Results of PoG across various datasets, compared with the state-of-the-art (SOTA) in Supervised Learning (SL) and In-Context Learning (ICL) methods. The highest scores for ICL methods are highlighted in bold, while the second-best results are underlined. The Prior FT (Fine-tuned) SOTA includes the best-known results achieved through supervised learning.
| Method | Class | LLM | Multi-Hop KGQA | Single-Hop KGQA | Open-Domain QA | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| CWQ | WebQSP | GrailQA | Simple Questions | WebQuestions | | | |
| Without external knowledge | | | | | | | |
| IO prompt (Sun et al., 2024) | - | GPT-3.5-Turbo | 37.6 | 63.3 | 29.4 | 20.0 | 48.7 |
| CoT (Sun et al., 2024) | - | GPT-3.5-Turbo | 38.8 | 62.2 | 28.1 | 20.3 | 48.5 |
| SC (Sun et al., 2024) | - | GPT-3.5-Turbo | 45.4 | 61.1 | 29.6 | 18.9 | 50.3 |
| With external knowledge | | | | | | | |
| Prior FT SOTA | SL | - | 70.4 (Das et al., 2021) | 85.7 (Luo et al., 2024) | 75.4 (Gu et al., 2023) | 85.8 (Baek et al., 2023b) | 56.3 (Kedia et al., 2022) |
| KB-BINDER (Li et al., 2023a) | ICL | Codex | - | 74.4 | 58.5 | - | - |
| ToG/ToG-R (Sun et al., 2024) | ICL | GPT-3.5-Turbo | 58.9 | 76.2 | 68.7 | 53.6 | 54.5 |
| ToG-2.0 (Ma et al., 2024) | ICL | GPT-3.5-Turbo | - | 81.1 | - | - | - |
| ToG/ToG-R (Sun et al., 2024) | ICL | GPT-4 | 69.5 | 82.6 | 81.4 | 66.7 | 57.9 |
| PoG-E | ICL | GPT-3.5-Turbo | 71.9 | 90.9 | 87.6 | 78.3 | 76.9 |
| PoG | ICL | GPT-3.5-Turbo | 74.7 | 93.9 | 91.6 | 80.8 | 81.8 |
| PoG-E | ICL | GPT-4 | 78.5 | 95.4 | 91.4 | 81.2 | 82.0 |
| PoG | ICL | GPT-4 | 81.4 | 96.7 | 94.4 | 84.0 | 84.6 |
Experimental settings. We evaluate PoG on five KGQA datasets, i.e., CWQ (Talmor and Berant, 2018), WebQSP (Yih et al., 2016), GrailQA (Gu et al., 2021), SimpleQuestions (Petrochuk and Zettlemoyer, 2018), and WebQuestions (Berant et al., 2013). PoG is tested against methods without external knowledge (IO, CoT (Wei et al., 2022), SC (Wang et al., 2022)) and the state-of-the-art (SOTA) approaches with external knowledge, including prompting-based and fine-tuning-based methods. Freebase (Bollacker et al., 2008) serves as the background knowledge graph for all datasets. Experiments are conducted using two LLMs, i.e., GPT-3.5 (GPT-3.5-Turbo) and GPT-4. Following prior studies, we use exact match accuracy (Hits@1) as the evaluation metric. Due to the space limitation, detailed experimental settings, including dataset statistics, baselines, and implementation details, are provided in Appendix C.
PoG setting. We adopt the Fuzzy + Precise Path Selection strategy in Algorithm 2 of Appendix A.2 for PoG, with $W_{1}=80$ for fuzzy selection. Additionally, we introduce PoG-E, which randomly selects one relation from each edge in the clustered question subgraph to evaluate the impact of graph structure on KG-based LLM reasoning. $W_{\max}$ and $D_{\max}$ are 3 by default for beam search.
5.1. Main Results
Since PoG leverages external knowledge to enhance LLM reasoning, we first compare it with other methods that utilize external knowledge. Although PoG is a training-free, prompting-based method and has natural disadvantages compared to fine-tuned methods trained on evaluation data. As shown in Table 1, PoG with GPT-3.5-Turbo still achieves new SOTA performance across most datasets. Additionally, PoG with GPT-4 surpasses fine-tuned SOTA across all the multi-hop and open-domain datasets by an average of 17.3% and up to 28.3% on the WebQuestions dataset. Comparing all the in-context learning (ICL) methods, PoG with GPT-3.5-Turbo surpasses all the previous SOTA methods. When comparing PoG with GPT-3.5-Turbo against SOTA using GPT-4, PoG outperforms the SOTA by an average of 12.9% and up to 23.9%. When using the same LLM, PoG demonstrates substantial improvements: with GPT-3.5-Turbo, it outperforms SOTA by an average of 21.2% and up to 27.3% on the WebQuestions dataset; with GPT-4, it outperforms SOTA by 16.6% on average and up to 26.7% on the WebQuestions dataset. Additionally, PoG with GPT-3.5-Turbo outperforms methods without external knowledge (e.g., IO, CoT, SC prompting) by 62% on GrailQA and 60.5% on Simple Questions. These results show that incorporating external knowledge graphs significantly enhances reasoning tasks. PoG-E also achieves excellent results. Under GPT-4, PoG-E surpasses all SOTA in ICL by 14.1% on average and up to 24.1% on the WebQuestions dataset. These findings demonstrate that the graph structure is crucial for reasoning tasks, particularly for complex logical reasoning. By integrating the structural information of the question within the graph, PoG enhances the deep reasoning capabilities of LLMs, leading to superior performance.
5.2. Ablation Study
We perform various ablation studies to understand the importance of different factors in PoG. These ablation studies are performed with GPT-3.5-Turbo on two subsets of the CWQ and WebQSP test sets, each containing 500 randomly sampled questions. Does search depth matter? As described, PoG’s dynamic deep search is limited by $D_{max}$ . To assess the impact of $D_{\max}$ on performance, we conduct experiments with depth from 1 to 4. The results, shown in Figures 4 (a) and (c), indicate that performance improves with increased depth, but the benefits diminish beyond a depth of 3. Figures 4 (b) and (d), showing which exploration phase the answer is generated from, reveal that higher depths reduce the effectiveness of both LLM-based path supplementation and node exploration. Excessive depth leads to LLM hallucinations and difficulties in managing long reasoning paths. Therefore, we set the maximum depth to 3 for experiments to balance performance and computational efficiency. Additionally, even at lower depths, PoG maintains strong performance by effectively combining the LLM’s inherent knowledge with the structured information from the KG.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Graph: Accuracy vs. Varying Maximum Depth (D_max)
### Overview
The image depicts a line graph comparing the accuracy of two methods, **PoG** and **PoG-E**, across varying maximum depths (D_max = 1 to 4). The y-axis represents accuracy in percentage (%), while the x-axis represents discrete depth values. The graph highlights trends in performance as depth increases.
---
### Components/Axes
- **X-axis (Horizontal)**: Labeled "Varying maximum depth (D_max)" with discrete values: 1, 2, 3, 4.
- **Y-axis (Vertical)**: Labeled "Accuracy (%)" with a scale from 50% to 85%.
- **Legend**: Located in the **bottom-right corner**, with:
- **Blue triangles** representing **PoG**.
- **Black diamonds** representing **PoG-E**.
---
### Detailed Analysis
#### PoG (Blue Triangles)
- **D_max = 1**: ~62.5% accuracy.
- **D_max = 2**: ~73.5% accuracy.
- **D_max = 3**: ~80.5% accuracy.
- **D_max = 4**: ~80.3% accuracy.
- **Trend**: Steady increase from D_max = 1 to 3, followed by a slight plateau/decline at D_max = 4.
#### PoG-E (Black Diamonds)
- **D_max = 1**: ~55.5% accuracy.
- **D_max = 2**: ~69% accuracy.
- **D_max = 3**: ~78% accuracy.
- **D_max = 4**: ~70% accuracy.
- **Trend**: Initial rise from D_max = 1 to 3, followed by a sharp decline at D_max = 4.
---
### Key Observations
1. **PoG outperforms PoG-E** across all depths, with a maximum accuracy of ~80.5% (D_max = 3) compared to PoG-E's ~78%.
2. **PoG-E shows a significant drop** in accuracy at D_max = 4 (~70%), suggesting potential overfitting or instability at higher depths.
3. **PoG maintains higher consistency**, with only a marginal decrease (~0.2%) between D_max = 3 and 4.
4. Both methods exhibit **non-linear behavior**, with PoG-E's performance being more volatile.
---
### Interpretation
- **PoG's superiority** suggests it is more robust to depth variations, possibly due to better generalization or simpler architecture.
- **PoG-E's decline at D_max = 4** implies that increasing depth beyond 3 introduces noise or complexity that harms performance, a common issue in deep learning models.
- The **plateau in PoG** at D_max = 3–4 indicates diminishing returns, suggesting an optimal depth range of 3 for maximum accuracy.
- The graph underscores the trade-off between model complexity (depth) and performance, with PoG achieving higher accuracy with relatively stable behavior.
</details>
(a) CWQ (Vary $D_{\max}$ )
<details>
<summary>x5.png Details</summary>

### Visual Description
## Bar Chart: Accuracy by Maximum Depth (D_max)
### Overview
The chart compares accuracy percentages across four maximum depth values (D_max = 1 to 4) for three exploration strategies: Topic Entity Path Exploration (blue), LLM Supplement Path Exploration (orange), and Node Expand Exploration (gray). A dashed blue line represents total accuracy, showing a general upward trend with diminishing returns.
### Components/Axes
- **X-axis**: "Varying maximum depth (D_max)" with discrete values 1, 2, 3, 4.
- **Y-axis**: "Accuracy (%)" scaled from 0 to 100.
- **Legend**: Located at bottom-left, mapping colors to strategies:
- Blue: Topic Entity Path Exploration
- Orange: LLM Supplement Path Exploration
- Gray: Node Expand Exploration
- **Dashed Line**: Represents cumulative accuracy across all strategies.
### Detailed Analysis
1. **D_max = 1**:
- Total accuracy: ~60% (dashed line).
- Segments:
- Blue (Topic Entity): ~35%.
- Orange (LLM Supplement): ~15%.
- Gray (Node Expand): ~10%.
2. **D_max = 2**:
- Total accuracy: ~70%.
- Segments:
- Blue: ~55%.
- Orange: ~10%.
- Gray: ~5%.
3. **D_max = 3**:
- Total accuracy: ~80%.
- Segments:
- Blue: ~65%.
- Orange: ~10%.
- Gray: ~5%.
4. **D_max = 4**:
- Total accuracy: ~80% (plateau).
- Segments:
- Blue: ~70%.
- Orange: ~5%.
- Gray: ~5%.
### Key Observations
- **Trend**: Total accuracy increases with D_max up to 3, then plateaus at D_max = 4.
- **Dominance**: Topic Entity Path Exploration (blue) consistently contributes the largest share of accuracy.
- **Diminishing Returns**: LLM Supplement and Node Expand contributions shrink as D_max increases.
- **Plateau**: No significant accuracy gain between D_max = 3 and 4.
### Interpretation
The data suggests that increasing D_max improves accuracy by enabling deeper exploration of topic entities, but beyond D_max = 3, additional depth yields minimal benefits. Topic Entity Path Exploration is the most effective strategy, while LLM Supplement and Node Expand provide supplementary gains that diminish with depth. The plateau at D_max = 4 implies a practical limit to exploration depth for optimal performance.
</details>
(b) CWQ(PoG)
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Graph: Accuracy vs. Varying Maximum Depth (D_max)
### Overview
The image depicts a line graph comparing the accuracy of two methods, "PoG" (blue) and "PoG-E" (black), as the maximum depth (D_max) varies from 1 to 4. The y-axis represents accuracy in percentage (%), ranging from 80% to 94%, while the x-axis represents D_max values (1–4). The graph shows distinct trends for both methods, with accuracy increasing initially and then diverging at higher depths.
### Components/Axes
- **X-axis**: "Varying maximum depth (D_max)" with discrete values labeled 1, 2, 3, and 4.
- **Y-axis**: "Accuracy (%)" with a scale from 80% to 94%.
- **Legend**: Located at the bottom-right corner, with:
- **Blue line with triangle markers**: Labeled "PoG".
- **Black line with diamond markers**: Labeled "PoG-E".
### Detailed Analysis
#### PoG (Blue Line)
- **D_max = 1**: Accuracy ≈ 81% (triangle marker).
- **D_max = 2**: Accuracy ≈ 86% (triangle marker).
- **D_max = 3**: Accuracy ≈ 94% (triangle marker, peak).
- **D_max = 4**: Accuracy ≈ 92% (triangle marker, slight decline).
#### PoG-E (Black Line)
- **D_max = 1**: Accuracy ≈ 82% (diamond marker).
- **D_max = 2**: Accuracy ≈ 86% (diamond marker).
- **D_max = 3**: Accuracy ≈ 91% (diamond marker, peak).
- **D_max = 4**: Accuracy ≈ 88% (diamond marker, decline).
### Key Observations
1. **Initial Increase**: Both methods show a consistent upward trend in accuracy as D_max increases from 1 to 2.
2. **Divergence at D_max = 3**:
- PoG achieves the highest accuracy (94%), surpassing PoG-E (91%).
- PoG-E’s accuracy plateaus at D_max = 3, while PoG continues to rise sharply.
3. **Decline at D_max = 4**:
- PoG drops slightly to 92%, while PoG-E declines more noticeably to 88%.
4. **Stability**: PoG-E exhibits more consistent performance across all D_max values compared to PoG.
### Interpretation
The data suggests that increasing D_max improves accuracy up to a critical point (D_max = 3), after which further depth may lead to overfitting or reduced effectiveness. PoG outperforms PoG-E at D_max = 3 but becomes less stable at D_max = 4, indicating potential sensitivity to over-parameterization. PoG-E’s slightly lower but more consistent performance across depths implies it may be more robust to variations in D_max. The divergence at D_max = 4 highlights a trade-off between peak performance and stability, which could inform optimal D_max selection depending on application requirements.
</details>
(c) WebQSP (Vary $D_{\max}$ )
<details>
<summary>x7.png Details</summary>

### Visual Description
## Bar Chart: Accuracy by Varying Maximum Depth (D_max)
### Overview
The chart visualizes the relationship between maximum depth (D_max) and accuracy percentages across four categories: **Accuracy Total**, **LLM Supplement Path Exploration**, **Topic Entity Path Exploration**, and **Node Expand Exploration**. Accuracy is measured on the y-axis (0–100%), while D_max values (1–4) are on the x-axis. A dashed blue line represents the **Accuracy Total** trend.
### Components/Axes
- **X-axis**: "Varying maximum depth (D_max)" with discrete values 1, 2, 3, 4.
- **Y-axis**: "Accuracy (%)" scaled from 0 to 100.
- **Legend**: Located at the bottom, with color-coded categories:
- **Blue (dashed line)**: Accuracy Total
- **Orange**: LLM Supplement Path Exploration
- **Gray**: Node Expand Exploration
- **Light Blue**: Topic Entity Path Exploration
### Detailed Analysis
- **D_max = 1**:
- **Accuracy Total**: ~80% (blue dashed line).
- **Topic Entity Path Exploration**: ~55% (light blue).
- **Node Expand Exploration**: ~25% (gray).
- **LLM Supplement Path Exploration**: ~10% (orange).
- **D_max = 2**:
- **Accuracy Total**: ~85%.
- **Topic Entity Path Exploration**: ~70%.
- **Node Expand Exploration**: ~15%.
- **LLM Supplement Path Exploration**: ~5%.
- **D_max = 3**:
- **Accuracy Total**: ~95% (peak).
- **Topic Entity Path Exploration**: ~85%.
- **Node Expand Exploration**: ~10%.
- **LLM Supplement Path Exploration**: ~5%.
- **D_max = 4**:
- **Accuracy Total**: ~90% (slight drop from D_max=3).
- **Topic Entity Path Exploration**: ~85%.
- **Node Expand Exploration**: ~5%.
- **LLM Supplement Path Exploration**: ~5%.
### Key Observations
1. **Accuracy Total** increases with D_max up to 3, then declines slightly at D_max=4.
2. **Topic Entity Path Exploration** dominates all D_max values, contributing ~55–85% of total accuracy.
3. **Node Expand Exploration** contribution drops sharply from ~25% (D_max=1) to ~5% (D_max=4).
4. **LLM Supplement Path Exploration** remains consistently low (~5–10%) across all D_max values.
### Interpretation
- **Optimal Depth**: D_max=3 maximizes accuracy (~95%), suggesting deeper exploration improves performance up to a point.
- **Topic Entity Dominance**: Its consistent high contribution implies it is the most effective method for accuracy.
- **Diminishing Returns**: Node Expand Exploration’s declining share at higher D_max may indicate inefficiency or redundancy at deeper levels.
- **LLM Supplement’s Limited Impact**: Minimal contribution across all D_max values suggests it plays a minor role compared to other methods.
- **Post-Peak Decline**: The slight drop in Accuracy Total at D_max=4 could signal overfitting or saturation, where further depth adds noise rather than value.
This analysis highlights the trade-off between exploration depth and accuracy, emphasizing the importance of Topic Entity Path Exploration while cautioning against excessive depth.
</details>
(d) WebQSP(PoG)
Figure 4. The accuracy of PoG and PoG-E among CWQ and WebQSP datasets by varying different $D_{\max}$ .
5.3. Effectiveness Evaluation
Effective evaluation on multi-entity questions. To evaluate PoG’s performance on multi-entity questions, we report the accuracy on all test sets by categorizing questions based on the number of topic entities. The results, shown in Table 2, demonstrate that, despite the increased complexity of multi-entity questions compared to single-entity ones, PoG maintains excellent accuracy, achieving up to 93.9% on the WebQSP dataset. This underscores the effectiveness of our structure-based model in handling complex multi-entity queries. Notably, the slightly lower performance on the GrailQA dataset can be attributed to some questions lacking matched topic entities, which prevents effective reasoning using KG.
Table 2. Performance of PoG and PoG-E on multi-entity and single-entity questions of all datasets. The symbol ‘-’ indicates no multi-entity question inside.
| Question Set | CWQ | WebQSP | GrailQA | WebQuestions | Simple Questions |
| --- | --- | --- | --- | --- | --- |
| PoG with GPT-3.5-Turbo | | | | | |
| Single-entity | 70.3 | 93.9 | 92.1 | 81.7 | 78.3 |
| Multi-entity | 80.2 | 93.1 | 70.7 | 82.8 | - |
| PoG-E with GPT-3.5-Turbo | | | | | |
| Single-entity | 67.5 | 91 | 88.2 | 76.8 | 80.8 |
| Multi-entity | 77.5 | 82.8 | 76.0 | 82.8 | - |
Effective evaluation on multi-hop reasoning. To assess PoG’s performance on multi-hop reasoning tasks, we analyze accuracy by categorizing questions based on the length of their ground-truth SPARQL queries. We randomly sample 1,000 questions from CWQ and WebQSP datasets and determine the reasoning length of each question by counting the number of relations in their ground-truth SPARQL queries. The distribution of questions with varying reasoning lengths is illustrated in Figure 5. We evaluate the performance of PoG and PoG-E across different ground-truth lengths to understand their effectiveness under varying query complexities. As shown in Figure 6, the performance of PoG and PoG-E remains consistent across different reasoning lengths. Even at the highest length levels in the WebQSP dataset, PoG achieves excellent accuracy, reaching up to 90%. Notably, although some questions have ground-truth lengths of eight or more, PoG successfully addresses them without matching the ground-truth length, demonstrating its ability to explore novel paths by effectively combining the LLM’s inherent knowledge with the structured information from the KG. These results demonstrate the effectiveness of PoG in handling complex multi-hop reasoning tasks.
Table 3. The illustration of graph size reduction.
| | CWQ | WebQSP | GrailQA | WebQuestions |
| --- | --- | --- | --- | --- |
| Ave Entity Number | 3,540,267 | 243,826 | 62,524 | 240,863 |
| Ave Entity Number After Pruned | 1,621,055 | 182,673 | 30,267 | 177,822 |
| Ave Entitiy Reduction Proportion (%) | 54% | 25% | 52% | 26% |
<details>
<summary>x8.png Details</summary>

### Visual Description
## Bar Chart: Number of Questions by Path Length in SPARQL
### Overview
The chart compares the distribution of questions across two datasets (CWQ and WebQSP) based on the length of paths in SPARQL queries. The y-axis uses a logarithmic scale (10⁰ to 10³) to represent the number of questions, while the x-axis categorizes path lengths from 1 to 8.
### Components/Axes
- **X-axis**: "Length of paths in SPARQL" (categories: 1, 2, 3, 4, 5, 6, 7, 8).
- **Y-axis**: "Number of questions" (logarithmic scale: 10⁰, 10¹, 10², 10³).
- **Legend**:
- White bars: CWQ (top-right legend).
- Black bars: WebQSP (top-right legend).
- **Placement**: Legend is positioned in the top-right corner, aligned with the chart.
### Detailed Analysis
- **Path Length 1**:
- WebQSP: ~2 questions (10⁰.3).
- CWQ: 0 questions.
- **Path Length 2**:
- WebQSP: ~50 questions (10¹.7).
- CWQ: ~5 questions (10⁰.7).
- **Path Length 3**:
- WebQSP: ~800 questions (10².9).
- CWQ: ~500 questions (10².7).
- **Path Length 4**:
- WebQSP: ~500 questions (10².7).
- CWQ: ~500 questions (10².7).
- **Path Length 5**:
- WebQSP: ~150 questions (10².2).
- CWQ: ~150 questions (10².2).
- **Path Length 6**:
- WebQSP: ~30 questions (10¹.5).
- CWQ: ~30 questions (10¹.5).
- **Path Length 7**:
- WebQSP: 0 questions.
- CWQ: ~15 questions (10¹.2).
- **Path Length 8**:
- WebQSP: ~70 questions (10¹.9).
- CWQ: ~100 questions (10²).
### Key Observations
1. **Dominance of WebQSP**: WebQSP consistently has higher question counts for path lengths 1–5 and 8, with peaks at lengths 3 and 4.
2. **CWQ Prevalence at Longer Paths**: CWQ surpasses WebQSP at path lengths 7 and 8, with a notable increase at length 8.
3. **Symmetry at Lengths 3–5**: Both datasets show similar magnitudes at lengths 3–5, suggesting overlapping query complexity.
4. **Missing Data**: WebQSP has no questions at length 7, while CWQ has a moderate count.
5. **Logarithmic Scale Impact**: The exponential spread of values emphasizes the disparity in question distribution across path lengths.
### Interpretation
- **Dataset Differences**: WebQSP appears optimized for shorter path lengths (1–5), while CWQ handles longer paths (7–8) more effectively. This could reflect differences in dataset structure or query design priorities.
- **Peak at Length 3**: The highest question count for both datasets at length 3 suggests this path length is a common or critical component in SPARQL queries for these datasets.
- **Anomaly at Length 7**: The absence of WebQSP data at length 7 may indicate a limitation in WebQSP’s query capabilities or dataset scope.
- **Logarithmic Trends**: The exponential scale highlights that question counts grow rapidly with path length, but the distribution plateaus or declines beyond certain lengths, indicating diminishing returns or complexity thresholds.
This analysis underscores how path length influences question distribution, offering insights into dataset design and query optimization strategies.
</details>
Figure 5. The lengths of the ground-truth SPARQL queries within the CWQ and WebQSP datasets.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Accuracy (%) vs. Length of Paths in SPARQL
### Overview
The chart compares the accuracy of two systems, **PoG** (blue) and **PoG-E** (gray), across varying lengths of SPARQL query paths. Accuracy is measured in percentage, with path lengths categorized as 1, 2, 3, 4, 5, 6, 7, and "8+". The legend is positioned at the top, with PoG in blue and PoG-E in gray.
### Components/Axes
- **Y-axis**: Accuracy (%) ranging from 0 to 100% in 20% increments.
- **X-axis**: Path lengths labeled as 1, 2, 3, 4, 5, 6, 7, and "8+".
- **Legend**: Located at the top, associating blue with PoG and gray with PoG-E.
### Detailed Analysis
- **Path Length 1**: No data (0% accuracy for both systems).
- **Path Length 2**:
- PoG: ~100% accuracy.
- PoG-E: ~75% accuracy.
- **Path Length 3**:
- PoG: ~80% accuracy.
- PoG-E: ~80% accuracy.
- **Path Length 4**:
- PoG: ~70% accuracy.
- PoG-E: ~70% accuracy.
- **Path Length 5**:
- PoG: ~55% accuracy.
- PoG-E: ~50% accuracy.
- **Path Length 6**:
- PoG: ~55% accuracy.
- PoG-E: ~55% accuracy.
- **Path Length 7**:
- PoG: ~50% accuracy.
- PoG-E: ~45% accuracy.
- **Path Length 8+**:
- PoG: ~60% accuracy.
- PoG-E: ~50% accuracy.
### Key Observations
1. **PoG Dominates Shorter Paths**: PoG achieves near-perfect accuracy (100%) for path length 2, while PoG-E lags at ~75%.
2. **Convergence at Medium Paths**: At path lengths 3 and 4, both systems show similar accuracy (~70-80%).
3. **Decline in Longer Paths**: Accuracy drops for both systems as path length increases beyond 4, with PoG-E consistently underperforming PoG.
4. **Slight Uptick at 8+**: PoG shows a modest recovery (~60%) for very long paths, while PoG-E remains lower (~50%).
### Interpretation
The data suggests that **PoG** is more effective than **PoG-E** in handling shorter SPARQL query paths, achieving near-perfect accuracy for path length 2. Both systems degrade in performance as path length increases, but PoG-E consistently underperforms, indicating potential limitations in handling complex or longer queries. The slight improvement in PoG’s accuracy for path length "8+" may imply better scalability for extremely long paths, though this trend is less pronounced. The convergence at medium path lengths (3-4) suggests both systems are similarly capable in moderately complex scenarios. PoG-E’s lower accuracy across most categories highlights a need for optimization in longer-path processing.
</details>
(a) CWQ
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart: Accuracy of PoG and PoG-E by Path Length in SPARQL
### Overview
The chart compares the accuracy (%) of two methods, **PoG** (blue) and **PoG-E** (gray), across varying path lengths in SPARQL queries. Path lengths are categorized as 1, 2, 3, 4, 5, 6, 7, and "8+". Accuracy is measured on a y-axis from 0% to 100%.
### Components/Axes
- **X-axis**: "Length of paths in SPARQL" with categories: 1, 2, 3, 4, 5, 6, 7, 8+.
- **Y-axis**: "Accuracy (%)" ranging from 0 to 100.
- **Legend**: Located at the top-right, with **PoG** (blue) and **PoG-E** (gray) labeled.
- **Bars**: Grouped pairs for each path length, with PoG (blue) on the left and PoG-E (gray) on the right.
### Detailed Analysis
- **Path Length 1**:
- PoG: ~50%
- PoG-E: ~50%
- **Path Length 2**:
- PoG: ~90%
- PoG-E: ~60%
- **Path Length 3**:
- PoG: ~95%
- PoG-E: ~95%
- **Path Length 4**:
- PoG: ~85%
- PoG-E: ~83%
- **Path Length 5**:
- PoG: ~75%
- PoG-E: ~75%
- **Path Length 6**:
- PoG: ~65%
- PoG-E: ~65%
- **Path Length 7**:
- No data (empty bars).
- **Path Length 8+**:
- PoG: ~92%
- PoG-E: ~88%
### Key Observations
1. **PoG outperforms PoG-E** in most categories, except at path lengths 3 and 5 where accuracies are nearly identical.
2. **Accuracy peaks at path length 3** for both methods (~95%).
3. **Divergence at path length 2**: PoG achieves ~90% accuracy, while PoG-E drops to ~60%.
4. **No data for path length 7**, creating a gap in the trend.
5. **High accuracy at "8+" paths**: Both methods maintain ~88-92% accuracy, suggesting robustness for longer paths.
### Interpretation
- **Trend Analysis**:
- PoG shows a **U-shaped trend**, with high accuracy at short (1), medium (3), and long (8+) paths but lower performance at intermediate lengths (2, 4, 5, 6).
- PoG-E exhibits a **more consistent decline** after path length 3, with no recovery at longer paths.
- **Anomalies**:
- The absence of data for path length 7 suggests either no queries of this length were tested or a data collection error.
- The sharp drop in PoG-E accuracy at path length 2 may indicate sensitivity to specific query structures.
- **Implications**:
- PoG’s superior performance at longer paths ("8+") suggests it may be better suited for complex SPARQL queries.
- The near-parity at path length 3 implies both methods handle moderately complex queries effectively.
- The lack of data for path length 7 warrants further investigation into query distribution or testing constraints.
### Spatial Grounding & Verification
- Legend colors (blue for PoG, gray for PoG-E) match bar colors exactly.
- All numerical values align with visual bar heights, confirming accuracy.
- Relative positioning of bars (PoG left, PoG-E right) is consistent across categories.
### Final Notes
The chart highlights the importance of path length in query accuracy, with PoG demonstrating greater adaptability to varying complexities. Further analysis of path length 7 and query-specific factors could refine these conclusions.
</details>
(b) WebQSP
Figure 6. The accuracy of PoG and PoG-E on the CWQ and WebQSP datasets, categorized by the different lengths of the ground-truth answers for each question.
Graph structure pruning. To evaluate the effectiveness of the graph pruning method proposed in Section 4.1, we conduct experiments using 200 random samples from each dataset. We report the average number of entities per question before and after graph reduction, as well as the proportion of entities reduced, in Table 3. The results indicate that up to 54% of entities in the CWQ dataset can be pruned before path exploration. This demonstrates the effectiveness of eliminating irrelevant data from the outset.
Case study: interpretable reasoning. We also conduct the case study to demonstrate interpretability of PoG, we present three reasoning examples in Table 9 of Appendix B.5. These examples feature questions with one, two, and three entities, respectively. Through the case study, we showcase PoG’s effectiveness in handling multi-entity and multi-hop tasks by providing faithful and interpretable reasoning paths that lead to accurate answers.
To further evaluate the effectiveness and efficiency of PoG, we perform additional experiments, including pruning beam search strategy ablation and prompt setting ablation (Appendix B.1), reasoning faithfulness analysis (Appendix B.2), error analysis (Appendix B.3), LLM calls cost and running time analysis (Appendix B.4), and graph reduction and path pruning case study (Appendix B.5).
6. Conclusion
In this paper, we introduce Paths-over-Graphs (PoG), a novel method that integrates LLMs with KGs to enable faithful and interpretable reasoning. PoG addresses complex reasoning tasks through a three-phase dynamic multi-hop path exploration, combining the inherent knowledge of LLMs with factual information from KGs. Efficiency is enhanced by graph-structured pruning and a three-step pruning process to effectively narrow down candidate paths. Extensive experiments on five public datasets demonstrate that PoG outperforms existing baselines, showcasing its superior reasoning capabilities and interoperability.
7. Acknowledgment
Xiaoyang Wang is supported by the Australian Research Council DP230101445 and DP240101322. Wenjie Zhang is supported by the Australian Research Council DP230101445 and FT210100303.
References
- (1)
- Baek et al. (2023b) Jinheon Baek, Alham Fikri Aji, Jens Lehmann, and Sung Ju Hwang. 2023b. Direct Fact Retrieval from Knowledge Graphs without Entity Linking. In ACL.
- Baek et al. (2023a) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023a. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023).
- Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533–1544.
- Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In AAAI, Vol. 38. 17682–17690.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247–1250.
- Brown (2020) Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
- Chen et al. (2024) Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2024. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter 25, 2 (2024), 42–61.
- Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Das et al. (2021) Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In EMNLP.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. In ICLR.
- Gu et al. (2023) Yu Gu, Xiang Deng, and Yu Su. 2023. Don’t Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments. In ACL.
- Gu et al. (2021) Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. In WWW. 3477–3488.
- Guo et al. (2023) Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. arXiv preprint arXiv:2305.15066 (2023).
- Guo et al. (2019) Lingbing Guo, Zequn Sun, and Wei Hu. 2019. Learning to exploit long-term relational dependencies in knowledge graphs. In International conference on machine learning. PMLR, 2505–2514.
- Guo et al. (2024) Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063–7076.
- Huang et al. (2025) Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. 2025. Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models. In COLING.
- Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645 (2023).
- Kedia et al. (2022) Akhil Kedia, Mohd Abbas Zaidi, and Haejun Lee. 2022. FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering. In EMNLP.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).
- Kim et al. (2023) Jiho Kim, Yeonsu Kwon, Yohan Jo, and Edward Choi. 2023. Kg-gpt: A general framework for reasoning on knowledge graphs using large language models. arXiv preprint arXiv:2310.11220 (2023).
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. NeurIPS (2022).
- Li et al. (2024a) Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024a. Hypergraph Self-supervised Learning with Sampling-efficient Signals. arXiv preprint arXiv:2404.11825 (2024).
- Li et al. (2024b) Fan Li, Zhiyu Xu, Dawei Cheng, and Xiaoyang Wang. 2024b. AdaRisk: Risk-adaptive Deep Reinforcement Learning for Vulnerable Nodes Detection. IEEE TKDE (2024).
- Li et al. (2023a) Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023a. Few-shot In-context Learning on Knowledge Base Question Answering. In ACL.
- Li et al. (2023b) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2023b. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. arXiv preprint arXiv:2305.13269 (2023).
- Luo et al. (2023) Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538 (2023).
- Luo et al. (2024) Linhao Luo, Yuanfang Li, Reza Haf, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In ICLR.
- Ma et al. (2024) Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, and Jian Guo. 2024. Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval. arXiv preprint arXiv:2407.10805 (2024).
- Pan et al. (2024) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. TKDE (2024).
- Peters et al. (2019) Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
- Petrochuk and Zettlemoyer (2018) Michael Petrochuk and Luke Zettlemoyer. 2018. SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach. In EMNLP. 554–558.
- Petroni et al. (2020) Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).
- Rawte et al. (2023) Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
- Sima et al. (2024) Qing Sima, Jianke Yu, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024. Deep Overlapping Community Search via Subspace Embedding. arXiv preprint arXiv:2404.14692 (2024).
- Sun et al. (2019) Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019).
- Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In ICLR.
- Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. In NAACL.
- Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937 (2018).
- Tan et al. (2023) Xingyu Tan, Jingya Qian, Chen Chen, Sima Qing, Yanping Wu, Xiaoyang Wang, and Wenjie Zhang. 2023. Higher-Order Peak Decomposition. In CIKM.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Wang et al. (2024a) Jinghao Wang, Yanping Wu, Xiaoyang Wang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. 2024a. Efficient Influence Minimization via Node Blocking. Proc. VLDB Endow. (2024).
- Wang et al. (2024b) Kai Wang, Yuwei Xu, Zhiyong Wu, and Siqiang Luo. 2024b. LLM as Prompter: Low-resource Inductive Reasoning on Arbitrary Knowledge Graphs. In ACL.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022).
- Wen et al. (2024) Yilin Wen, Zifeng Wang, and Jimeng Sun. 2024. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. In ACL.
- Wu et al. (2024) Yanping Wu, Renjie Sun, Xiaoyang Wang, Dong Wen, Ying Zhang, Lu Qin, and Xuemin Lin. 2024. Efficient Maximal Frequent Group Enumeration in Temporal Bipartite Graphs. Proc. VLDB Endow. (2024).
- Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS (2024).
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- Ye et al. (2021) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. 2021. Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering. arXiv preprint arXiv:2109.08678 (2021).
- Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL. 201–206.
- Zhang et al. (2021) Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long document modeling with pooling attention. In ICML. 12437–12446.
- Zhang et al. (2023) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. In ICLR.
- Zhao et al. (2023) Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. 2023. Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework. In ACL.
Appendix A Algorithm
A.1. Exploration
We summarize the comprehensive algorithmic procedure for exploration detailed in Section 4.2 as presented in Algorithm 1.
\Hy@raisedlink \hyper@anchorstart AlgoLine0.1 \hyper@anchorend
Input : Question subgraph ( $\mathcal{G}_{q}$ ), source KG ( $\mathcal{G}$ ),question and split question ( $Q=q$ + $q_{split}$ ), topic entities ( $Topic(q)$ ), LLM indicator ( $I_{\text{LLM}}$ ), predict depth ( $D_{\text{predict}}$ ), maximum depth ( $D_{\max}$ ), maximum width ( $W_{\max}$ ), and path pruning case ( $case$ )
Output : PoG answers ( $a(q)$ ), final reasoning path ( $Paths_{f}(q)$ )
\Hy@raisedlink \hyper@anchorstart AlgoLine0.2 \hyper@anchorend
/* Start with topic entity path exploration */ \Hy@raisedlink \hyper@anchorstart AlgoLine0.3 \hyper@anchorend
$List_{T}←$ Reorder ( $Topic(q),I_{\text{LLM}}$ ), $D←$ min( $D_{\text{predict}},D_{\max}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.4 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.5 \hyper@anchorend
while $D≤ D_{\max}$ do $Paths_{t}←\text{EntityPathFind}$ ( $List_{T},D$ , $\mathcal{G}_{q}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.6 \hyper@anchorend
$\text{PathPruning}(Paths_{t},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.7 \hyper@anchorend
$Answer,Paths_{T}←\text{QuestionAnswering}(Paths_{t},Q,I_{\text{LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.8 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.9 \hyper@anchorend
if $\texttt{"\{Yes\}"}\text{ in }Answer$ then return $Answer,Paths_{T}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.10 \hyper@anchorend
else $D← D+1$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.11 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.12 \hyper@anchorend
/* LLM supplement path exploration procedure */ \Hy@raisedlink \hyper@anchorstart AlgoLine0.13 \hyper@anchorend
$Paths_{s}←[]$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.14 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.15 \hyper@anchorend
$Predict(q)←$ SupplementPrediction( $Paths_{T},Q,I_{\text{LLM}}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.16 \hyper@anchorend
for each $e,I_{sup(e)}∈ Predict(q)$ do \Hy@raisedlink \hyper@anchorstart AlgoLine0.17 \hyper@anchorend
$List_{S}←$ Reorder ( $List_{T}+e,I_{sup(e)}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.18 \hyper@anchorend
$Paths_{s}^{\prime}←\text{EntityPathFind}$ ( $List_{S},D_{\max}$ , $\mathcal{G}_{q}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.19 \hyper@anchorend
$Paths_{s}← Paths_{s}+\text{FuzzySelect}$ ( $Paths_{s}^{\prime},I_{sup(e)},W_{\max}$ ); \Hy@raisedlink \hyper@anchorstart AlgoLine0.20 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.21 \hyper@anchorend
$\text{PathPruning}(Paths_{s},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{S},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.22 \hyper@anchorend
$Answer,Paths_{S}←\text{QuestionAnswering}(Paths_{s},Q,I_{\text{LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.23 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.24 \hyper@anchorend
if "{Yes}" in $Answer$ then return $Answer,Paths_{S}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.25 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.26 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.27 \hyper@anchorend
/* Node expand exploration procedure */ \Hy@raisedlink \hyper@anchorstart AlgoLine0.28 \hyper@anchorend
$Visted←\emptyset$ , $D← 1$ , $Paths_{e}← Paths_{T}+Paths_{S}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.29 \hyper@anchorend
$\text{PathPruning}(Paths_{e},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ;; \Hy@raisedlink \hyper@anchorstart AlgoLine0.30 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.31 \hyper@anchorend
while $D≤ D_{\max}$ do for each $e∈\text{ExtractEntity}(Paths_{e})\land e∉ Visted$ do $Related\_edges=\text{Find\_1\_hop\_Edges}(\mathcal{G},e)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.32 \hyper@anchorend
$Paths_{e}←\text{MergeTogether}(Paths_{e},Related\_edge)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.33 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.34 \hyper@anchorend
$\text{PathPruning}(Paths_{e},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.35 \hyper@anchorend
$Answer,Paths_{e}←\text{QuestionAnswering}(Paths_{e},Q,I_{\text{LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.36 \hyper@anchorend
if $\texttt{"\{Yes\}"}\text{ in }Answer$ then return $Answer,Paths_{e}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.37 \hyper@anchorend
else $Visted← Visted\cup e$ ; $D← D+1$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.38 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.39 \hyper@anchorend
$Paths_{l}← Paths_{T}+Paths_{S}+Paths_{E}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.40 \hyper@anchorend
$\text{PathPruning}(Paths_{l},Q,I_{\text{LLM}},W_{\max},D_{\max},List_{T},case)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.41 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.42 \hyper@anchorend
$Answer,Paths_{L}←\text{QuestionAnsweringFinal}(Paths_{l},Q,I_{\text{%
LLM}})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.43 \hyper@anchorend
Return $Answer,Paths_{L}$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.44 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.45 \hyper@anchorend
Algorithm 1 Exploration
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
A.2. Path Pruning
We summarize the comprehensive algorithmic procedure of path pruning detailed in Section 4.3 as presented in Algorithm 2.
\Hy@raisedlink \hyper@anchorstart AlgoLine0.1 \hyper@anchorend
Input : Candidate paths( $Paths_{c}$ ), question and split question ( $Q=q+q_{split}$ ), indicator ( $I$ ), maximum width ( $W_{\max}$ ), maximum depth ( $D_{\max}$ ), entity list ( $list$ )
Output : Pruned candidate paths ( $Paths_{c}$ )
if Case = Fuzzy Selection Only then $\text{FuzzySelect}(Paths_{c},Q,I,W_{\max})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.2 \hyper@anchorend
else if Case = Fuzzy + Precise Path Selection then $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.3 \hyper@anchorend
$\text{PrecisePathSelect}(Paths_{c},Q,I,W_{\max})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.4 \hyper@anchorend
else if Case = Fuzzy + Branch Reduced Selection then $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.5 \hyper@anchorend
$\text{BranchReduceSelect}(Paths_{c},Q,I,W_{\max},D_{\max},list)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.6 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.7 \hyper@anchorend
else if Case = Fuzzy + Branch Reduced + Precise Path then /* case = 3-Step Beam Search */ $\text{FuzzySelect}(Paths_{c},Q,I,W_{1})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.8 \hyper@anchorend
$\text{BranchReduceSelect}(Paths_{c},Q,I,W_{2},D_{\max},list)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.9 \hyper@anchorend
$\text{PrecisePathSelect}(Paths_{c},Q,I,W_{\max})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.10 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.11 \hyper@anchorend
Procedure BranchReduceSelect $(Paths_{c},Q,I,W,D_{\max},list)$ \Hy@raisedlink \hyper@anchorstart AlgoLine0.12 \hyper@anchorend
$D← 1$ , $Paths_{e}←\emptyset$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.13 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.14 \hyper@anchorend
while $|Paths_{c}|≥ W\land D≤ D_{\max}$ do \Hy@raisedlink \hyper@anchorstart AlgoLine0.15 \hyper@anchorend
for each $e∈ list$ do \Hy@raisedlink \hyper@anchorstart AlgoLine0.16 \hyper@anchorend
$Paths_{e}← Paths_{e}\cup\text{ExtractHeadSteps}(Paths_{c},e,D)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.17 \hyper@anchorend
if $|Paths_{e}|>W$ then $\text{PrecisePathSelect}(Paths_{e},Q,I,W)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.18 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.19 \hyper@anchorend
$Paths_{c}←\text{IntersectMatchUpdate}(Paths_{e},Paths_{c})$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.20 \hyper@anchorend
$Paths_{e}←\emptyset$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.21 \hyper@anchorend
$D← D+1$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.22 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.23 \hyper@anchorend
if $|Paths_{c}|>W$ then $\text{PrecisePathSelect}(Paths_{c},Q,I,W)$ ; \Hy@raisedlink \hyper@anchorstart AlgoLine0.24 \hyper@anchorend
\Hy@raisedlink \hyper@anchorstart AlgoLine0.25 \hyper@anchorend
Algorithm 2 PathPruning
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
Appendix B Experiment
B.1. Additional Ablation Study
Compare the effect of different beam search strategies. As introduced in Section 4.3, PoG supports various beam search strategies, ranging from non-reliant to fully reliant on LLMs, selectable in a user-friendly manner. To evaluate the computational cost and performance, we test four cases outlined in Algorithm 2. In the 3-Step Beam Search case, we set $W_{2}=20$ for internal narrowing. The Fuzzy Selection approach, as described in Section 4.3, utilizes all candidate paths and a LLM-generated indicator for encoding and comparison. We report accuracy, average LLM calls in total, and average token input during the path pruning for each beam search strategy applied to PoG and PoG-E in Table 4 and Table 5. These results indicate that PoG/PoG-E with Fuzzy and Precise Path Selection achieves the highest accuracy. Additionally, the BranchReduced Selection method, which leverages the graph structure, not only delivers excellent results but also reduces token usage by over 50% (65%) with only a ±2% (±4.3%) difference in accuracy on PoG (PoG-E) compared to the best-performing strategy. Furthermore, the Fuzzy Selection method, which employs lightweight models instead of relying solely on LLMs, also demonstrates strong performance. These results validate the effectiveness of our beam search strategies and underscore the importance of structure-based faithful path reasoning.
Table 4. Performance comparison of PoG with different beam search methods on CWQ and WebQSP.
| PoG | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| w/ Fuzzy Selection | Accuracy | 57.1 | 86.4 |
| Token Input | - | - | |
| LLM Calls | 6.8 | 6.5 | |
| w/ Fuzzy and | Accuracy | 79.3 | 93.0 |
| BranchReduced Selection | Token Input | 101,455 | 328,742 |
| LLM Calls | 9.7 | 9.3 | |
| w/ Fuzzy and | Accuracy | 81.4 | 93.9 |
| Precise Path Selection | Token Input | 216,884 | 617,448 |
| LLM Calls | 9.1 | 7.5 | |
| w/ 3-Step Beam Search | Accuracy | 79.8 | 91.9 |
| Token Input | 102,036 | 369,175 | |
| LLM Calls | 8.8 | 9.0 | |
How do summary prompts affect? Inspired by GoT (Besta et al., 2024), we utilize summary prompts to reduce LLM hallucinations and decrease computational costs. To evaluate their impact, we conduct an ablation study comparing PoG and PoG-E with and without path summarization. We measure both accuracy and average token input to the LLM API during the path pruning phase to measure efficiency and effectiveness. The results, present in Tabel 6, show that using graph summaries increases accuracy by up to 10% on the CWQ dataset with PoG-E, while reducing token input by up to 36% on WebQSP. These results indicate hat path summarization effectively minimizes LLM hallucinations, enhances the LLM’s understanding of the explored paths, facilitates answer retrieval, enables earlier termination of the reasoning process, and reduces costs.
Table 5. Performance comparison of PoG-E with different beam search methods among CWQ and WebQSP datasets.
| PoG-E | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| w/ FuzzySelect | Accuracy | 62.31 | 82.3 |
| Token Input | - | - | |
| Ave LLM Calls | 6 | 6.3 | |
| w/ Fuzzy and | Accuracy | 71.9 | 88.4 |
| BranchReduced Selection | Token Input | 128,407 | 371,083 |
| Ave LLM Calls | 9.4 | 9.1 | |
| w/ Fuzzy and | Accuracy | 80.4 | 91.4 |
| Precise Path Selection | Token Input | 344,747 | 603,261 |
| Ave LLM Calls | 8.3 | 7.4 | |
| w/ 3-Step Beam Search | Accuracy | 73.87 | 89.4 |
| Token Input | 120,159 | 411,283 | |
| Ave LLM Calls | 8.3 | 9.1 | |
Table 6. Performance comparison of PoG and PoG-E with and without path summarizing on CWQ and WebQSP datasets.
| Method | Evaluation | CWQ | WebQSP |
| --- | --- | --- | --- |
| PoG | | | |
| w/ Path Summarizing | Accuracy | 81.4 | 93.9 |
| Token Input | 216,884 | 297,359 | |
| w/o Path Summarizing | Accuracy | 74.7 | 91.9 |
| Token Input | 273,447 | 458,545 | |
| PoG-E | | | |
| w/ Path Summarizing | Accuracy | 80.4 | 91.4 |
| Token Input | 314,747 | 273,407 | |
| w/o Path Summarizing | Accuracy | 70.4 | 90.4 |
| Token Input | 419,679 | 428,545 | |
B.2. Reasoning Faithfulness Analysis
Overlap ratio between explored paths and ground-truth paths. We analyzed correctly answered samples from three datasets to investigate the overlap ratio between the paths $P$ explored by PoG and the ground-truth paths $S$ in SPARQL queries. The overlap ratio is defined as the proportion of overlapping relations to the total number of relations in the ground-truth SPARQL path:
$$
Ratio(P)=\frac{|Relation(P)\cap Relation(S)|}{|Relation(S)|},
$$
where $Relation(P)$ denotes the set of relations in path $P$ . Figure 7 illustrates the distribution of questions across different overlap ratios. For the WebQSP dataset, PoG achieves the highest proportion of fully overlapping paths with the ground truth, reaching approximately 60% accuracy. In contrast, PoG-E applied to the GrailQA dataset shows the highest proportion of paths with up to 70% non-overlapping relations, indicating that PoG-E explores novel paths to derive the answers. The different results between PoG and PoG-E are due to PoG-E’s strategy of randomly selecting one related edge from each clustered edge. This approach highlights the effectiveness of our structure-based path exploration method in generating diverse and accurate reasoning paths.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio Distribution
### Overview
The chart displays a bar graph showing the percentage distribution of overlap ratios across five categories: 0%, (0,25], (25,50], (50,75], (75,100], and 100%. Two bars are visible: one at 0% (dark gray) and one at 100% (light blue), with no bars for intermediate categories.
### Components/Axes
- **X-axis (Overlap Ratio (%))**: Labeled with categories: 0, (0,25], (25,50], (50,75], (75,100], 100.
- **Y-axis (Percentage (%))**: Scaled from 0 to 40% in increments of 10.
- **Legend**: Located on the right, associating dark gray with "0" and light blue with "100".
### Detailed Analysis
- **0% (Dark Gray)**: Bar height corresponds to **10%** on the y-axis.
- **100% (Light Blue)**: Bar height corresponds to **15%** on the y-axis.
- **Intermediate Categories**: No bars present, indicating **0%** for (0,25], (25,50], (50,75], and (75,100].
### Key Observations
1. **Extreme Values Dominate**: The highest percentages occur at the extremes (0% and 100%), with 100% slightly exceeding 0%.
2. **No Intermediate Data**: Categories between 0% and 100% show no recorded values, suggesting a lack of overlap in these ranges.
3. **Color Consistency**: Legend colors match bar colors exactly (dark gray for 0%, light blue for 100%).
### Interpretation
The data suggests a bimodal distribution where overlap ratios are exclusively observed at the extremes (0% and 100%). The absence of values in intermediate ranges implies that partial overlaps (e.g., 25%, 50%) are either rare or nonexistent in the dataset. The slight dominance of 100% over 0% could indicate a preference or constraint favoring complete overlap in the studied context. This pattern might reflect a binary outcome scenario (e.g., full agreement/disagreement, total inclusion/exclusion) rather than gradual variation.
</details>
(a) CWQ (PoG)
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Overlap Ratios
### Overview
The chart displays two vertical bars representing the percentage distribution of overlap ratios. The x-axis categorizes overlap ratios into discrete ranges, while the y-axis shows the corresponding percentage values. Two distinct bars are present: one for "0" overlap ratio and another for "100" overlap ratio, both at 10% frequency.
### Components/Axes
- **X-axis (Overlap Ratio (%))**:
- Categories:
- `0` (dark gray bar)
- `(0,25]` (dark gray bar)
- `(25,50]` (dark gray bar)
- `(50,75]` (dark gray bar)
- `(75,100]` (dark gray bar)
- `100` (light blue bar)
- Labels: Overlap Ratio (%)
- **Y-axis (Percentage (%))**:
- Scale: 0–40% in 10% increments
- Labels: Percentage (%)
- **Legend**:
- Position: Right side of the chart
- Entries:
- Dark gray: Represents "0" and range categories
- Light blue: Represents "100" overlap ratio
### Detailed Analysis
- **Bar 1 (Dark Gray, "0" overlap ratio)**:
- Height: 10% on the y-axis
- Position: Leftmost bar
- **Bar 2 (Light Blue, "100" overlap ratio)**:
- Height: 10% on the y-axis
- Position: Rightmost bar
- **Intermediate Ranges**:
- All other overlap ratio ranges (`(0,25]`, `(25,50]`, `(50,75]`, `(75,100]`) have no visible bars, implying 0% frequency for these categories.
### Key Observations
1. **Equal Frequencies**: Both "0" and "100" overlap ratios account for 10% of the total distribution.
2. **Missing Data**: No bars exist for intermediate overlap ranges, suggesting these categories have no occurrences.
3. **Color Consistency**: The legend correctly maps dark gray to "0" and light blue to "100," with no mismatches.
### Interpretation
The chart suggests a bimodal distribution where only the extreme overlap ratios ("0" and "100") occur, each with equal frequency. The absence of intermediate ranges implies that most data points fall exclusively at these two extremes. This could indicate a binary classification system or a scenario where overlap is either nonexistent or complete. The equal percentages for "0" and "100" might reflect a balanced distribution between non-overlapping and fully overlapping cases, though the lack of intermediate data limits deeper analysis of trends or relationships.
</details>
(b) CWQ (PoG-E)
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio Distribution
### Overview
The chart displays the distribution of overlap ratios across six categories, with percentages ranging from 0% to 60%. The x-axis represents overlap ratio intervals, while the y-axis shows the corresponding percentage of occurrences. The tallest bar corresponds to the "100" category, indicating a dominant trend in maximum overlap.
### Components/Axes
- **X-axis (Overlap Ratio %)**: Categories labeled as `0`, `(0,25]`, `(25,50]`, `(50,75]`, `(75,100]`, and `100`.
- **Y-axis (Percentage %)**: Scale from 0% to 60% in increments of 20%.
- **Legend**: Located on the right, associating colors with categories:
- Dark Gray: `0`
- Gray: `(0,25]`
- Light Gray: `(25,50]`
- Blue: `(50,75]`
- Dark Blue: `(75,100]`
### Detailed Analysis
- **Bar 1 (0)**: Dark Gray, ~15% height.
- **Bar 2 (0,25]**: Gray, ~2% height.
- **Bar 3 (25,50]**: Light Gray, ~5% height.
- **Bar 4 (50,75]**: Blue, ~6% height.
- **Bar 5 (75,100]**: Dark Blue, ~63% height.
- **Bar 6 (100)**: Dark Blue, ~63% height.
### Key Observations
1. The `100` category (Dark Blue) dominates with ~63%, far exceeding all other categories.
2. The `0` category (Dark Gray) is the second-largest at ~15%.
3. Categories `(0,25]`, `(25,50]`, and `(50,75]` collectively account for ~13% (~2% + ~5% + ~6%).
4. The `100` category overlaps with the `(75,100]` label in the legend, suggesting a potential inconsistency in labeling.
### Interpretation
The data indicates a strong concentration of cases with maximum overlap (100%), which may reflect a systemic or design-driven phenomenon where full overlap is the norm. The `0` category’s 15% suggests a notable minority of instances with no overlap, possibly indicating edge cases or anomalies. The discrepancy between the `100` label and the `(75,100]` legend entry warrants clarification, as it could imply either a mislabeling error or intentional categorization of the `100` value as a distinct outlier. The sharp drop from 63% to 15% between the `100` and `0` categories highlights a bimodal distribution, emphasizing the prevalence of extreme overlap ratios.
</details>
(c) WebQSP (PoG)
<details>
<summary>x14.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Overlap Ratios
### Overview
The chart displays a distribution of percentages across different overlap ratio categories. Two prominent bars are visible: one at 0% overlap ratio and another at 100% overlap ratio. The y-axis represents percentage values, while the x-axis categorizes overlap ratios into discrete ranges.
### Components/Axes
- **X-axis (Overlap Ratio (%))**:
Categories:
- `0` (0% overlap)
- `(0,25]` (0–25% overlap)
- `(25,50]` (25–50% overlap)
- `(50,75]` (50–75% overlap)
- `(75,100]` (75–100% overlap)
- `100` (100% overlap)
- **Y-axis (Percentage (%))**:
Scale ranges from 0 to 60% in increments of 20%.
- **Legend**:
Located in the top-right corner. Two colors are used:
- **Gray**: Represents the 0% overlap category.
- **Blue**: Represents the 100% overlap category.
### Detailed Analysis
- **0% Overlap (Gray Bar)**:
Positioned at the far left. Approximately **22%** of the data falls into this category.
- **100% Overlap (Blue Bar)**:
Positioned at the far right. Approximately **45%** of the data falls into this category.
- **Intermediate Categories**:
The ranges `(0,25]`, `(25,50]`, `(50,75]`, and `(75,100]` have no visible bars, indicating 0% contribution to the distribution.
### Key Observations
1. **Dominance of 100% Overlap**: The 100% overlap category accounts for nearly half of the total percentage, suggesting a strong concentration of maximum similarity or agreement.
2. **Significant 0% Overlap**: The 0% overlap category represents a notable minority (22%), indicating a subset of data with no overlap.
3. **Absence of Intermediate Values**: No data exists for overlap ratios between 0% and 100%, implying a binary distribution.
### Interpretation
The chart suggests a polarized distribution where data points either fully overlap (100%) or do not overlap at all (0%). The absence of intermediate values could indicate a threshold effect, where overlap is either complete or nonexistent. The 100% overlap category’s dominance (45%) might reflect a system or process optimized for maximum alignment, while the 0% overlap (22%) could represent outliers or edge cases. The lack of data in intermediate ranges raises questions about the underlying mechanism—whether it inherently prevents partial overlap or if the data was filtered to exclude such cases.
</details>
(d) WebQSP (PoG-E)
<details>
<summary>x15.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Overlap Ratios
### Overview
The chart displays the percentage distribution of overlap ratios across six categories. The y-axis represents percentage (%) from 0 to 80, while the x-axis categorizes overlap ratios into intervals: 0, (0,25], (25,50], (50,75], (75,100], and 100. Three bars are visible, with the majority of data concentrated in the 100% overlap category.
### Components/Axes
- **Y-axis**: "Percentage (%)" (0–80, linear scale)
- **X-axis**: "Overlap Ratio (%)" with categories:
- 0
- (0,25]
- (25,50]
- (50,75]
- (75,100]
- 100
- **Bars**:
- Dark gray bar at 0 (15% height)
- Medium gray bar at (25,50] (10% height)
- Light blue bar at 100 (65% height)
- **Legend**: No explicit legend present; colors differentiate categories implicitly.
### Detailed Analysis
- **0**: Dark gray bar at 15% (exact value: ~15%).
- **(25,50]**: Medium gray bar at ~10% (exact value: ~10%).
- **(50,75]**: No bar (0%).
- **(75,100]**: No bar (0%).
- **100**: Light blue bar at ~65% (exact value: ~65%).
### Key Observations
1. **Dominance of 100% Overlap**: The light blue bar at 100% accounts for ~65% of the total, indicating a strong concentration of maximum overlap.
2. **Low Representation in Middle Ranges**: Categories (50,75] and (75,100] have no data, suggesting these overlap ratios are absent or negligible.
3. **Skewed Distribution**: The dark gray (0%) and medium gray (25–50%) bars represent only ~25% combined, highlighting a lack of intermediate overlap values.
### Interpretation
The data suggests a near-binary distribution of overlap ratios: most instances either have no overlap (0%) or complete overlap (100%), with minimal representation in intermediate ranges. The absence of data in (50,75] and (75,100] implies either a strict threshold for overlap or a dataset where partial overlaps are rare. The 15% at 0% and 10% in (25,50] may indicate edge cases or measurement limitations. This distribution could reflect scenarios where systems or processes either fully align or are entirely disjoint, with little room for partial agreement.
</details>
(e) GrailQA (PoG)
<details>
<summary>x16.png Details</summary>

### Visual Description
## Bar Chart: Overlap Ratio Distribution
### Overview
The chart displays a distribution of overlap ratios across six categories, with percentages represented by two bars. The x-axis categorizes overlap ratios, while the y-axis shows the percentage of occurrences. Two distinct bars are present: one at the "0" category and another at the "100" category, with no bars in the intermediate ranges.
### Components/Axes
- **X-axis (Overlap Ratio (%))**: Categories labeled as "0", "(0,25]", "(25,50]", "(50,75]", "(75,100]", and "100".
- **Y-axis (Percentage (%))**: Scaled from 0 to 80 in increments of 20.
- **Legend**: Not explicitly visible in the image.
- **Bars**:
- Dark gray bar at "0" (Overlap Ratio).
- Light blue bar at "100" (Overlap Ratio).
### Detailed Analysis
- **Dark Gray Bar (0 Overlap Ratio)**: Approximately 70% of the total occurrences.
- **Light Blue Bar (100 Overlap Ratio)**: Approximately 15% of the total occurrences.
- **Intermediate Categories**: No bars present for "(0,25]", "(25,50]", "(50,75]", or "(75,100]".
### Key Observations
1. The majority of occurrences (70%) are concentrated at the "0" overlap ratio.
2. A smaller portion (15%) occurs at the "100" overlap ratio.
3. No data points exist for overlap ratios between 0 and 100, suggesting a bimodal or extreme distribution.
### Interpretation
The data indicates a strong skew toward minimal overlap (0%) and a smaller but notable presence of maximal overlap (100%). The absence of intermediate values suggests that overlaps are either negligible or complete, with no partial overlaps observed. This could reflect a binary or threshold-based system where overlaps are categorized as either absent or fully present. The lack of a legend limits direct interpretation of the bar colors, but their distinct hues (dark gray vs. light blue) may imply separate data series or categories. The chart highlights a polarized distribution, emphasizing extremes over middle-ground values.
</details>
(f) GrailQA (PoG-E)
Figure 7. The path overlap ratio of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.
Evidence of answer exploration sources. We conduct an analysis of correctly answered samples from three datasets to investigate the sources of evidence used by the LLM in generating answers, as illustrated in Figure 8. Specifically, we categorize all generated answers into three cases: KG only, LLM-inspired KG, and KG-inspired LLM. In the KG only scenario, answers are generated solely based on KG paths. The LLM-inspired KG case involves the LLM predicting an answer using its inherent knowledge and subsequently using the KG to verify its correctness. Conversely, in the KG-inspired LLM case, the paths generated by the KG are insufficient to reach the answer, and the LLM supplements the reasoning using its inherent knowledge. As shown in the figure, up to 14% of answers are generated through the KG-inspired LLM approach, and up to 9% involve LLM-inspired KG path supplementation. Compared to previous work that integrates LLM inherent knowledge with KG data (Sun et al., 2024), PoG more effectively leverages the strengths of both sources. These results demonstrate that PoG is a faithful reasoning method that primarily relies on KG-based reasoning while being supplemented by the LLM, ensuring both accuracy and interpretability in answer generation.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Pie Charts: Knowledge Graph (KG) Approaches in QA Systems
### Overview
The image contains three pie charts comparing the distribution of three knowledge graph (KG) approaches across three QA systems: CWQ(%), WebQSP(%), and GrailQA(%). Each chart uses three color-coded segments to represent:
- **KG Only** (blue)
- **LLM Inspired KG** (green)
- **KG Inspired LLM** (gray)
### Components/Axes
- **Legend**: Located in the top-right corner, with color-coded labels:
- Blue: KG Only
- Green: LLM Inspired KG
- Gray: KG Inspired LLM
- **Chart Labels**:
- Top-left: "CWQ(%)"
- Center: "WebQSP(%)"
- Right: "GrailQA(%)"
- **Segment Values**: Percentages are explicitly labeled within each pie chart segment.
### Detailed Analysis
#### CWQ(%)
- **KG Only**: 78% (blue, largest segment)
- **LLM Inspired KG**: 9% (green, smallest segment)
- **KG Inspired LLM**: 12% (gray, medium segment)
#### WebQSP(%)
- **KG Only**: 86% (blue, largest segment)
- **LLM Inspired KG**: 4% (green, smallest segment)
- **KG Inspired LLM**: 9% (gray, medium segment)
#### GrailQA(%)
- **KG Only**: 95% (blue, largest segment)
- **LLM Inspired KG**: 1% (green, smallest segment)
- **KG Inspired LLM**: 3% (gray, smallest non-green segment)
### Key Observations
1. **Dominance of KG Only**: Across all three QA systems, the "KG Only" approach consistently holds the largest share:
- CWQ: 78%
- WebQSP: 86%
- GrailQA: 95%
2. **LLM Inspired KG Decline**: The "LLM Inspired KG" approach shows decreasing adoption:
- Largest in CWQ (9%)
- Smallest in GrailQA (1%)
3. **KG Inspired LLM Variability**: The "KG Inspired LLM" approach has moderate adoption in CWQ (12%) but minimal in GrailQA (3%).
### Interpretation
The data suggests a strong reliance on traditional **KG Only** methods across all QA systems, with GrailQA(%) being the most dependent on this approach. The declining share of **LLM Inspired KG** in GrailQA(%) (1%) indicates a potential shift away from hybrid LLM-KG integration in this system. Meanwhile, **KG Inspired LLM** shows inconsistent adoption, peaking in CWQ(%) (12%) but nearly absent in GrailQA(%) (3%). This could reflect differing design philosophies or evaluation criteria between the QA systems. The minimal presence of LLM-inspired approaches in GrailQA(%) might imply a focus on pure KG-based reasoning in that context.
</details>
(a) PoG
<details>
<summary>x18.png Details</summary>

### Visual Description
## Pie Charts: Comparative Analysis of KG Approaches Across Metrics
### Overview
The image contains three pie charts comparing the distribution of three knowledge graph (KG) approaches across different metrics: CWQ(%), WebQSP(%), and GrailQA(%). Each chart uses three color-coded segments to represent:
- **KG Only** (blue)
- **LLM Inspired KG** (green)
- **KG Inspired LLM** (gray)
### Components/Axes
- **Legend**: Located at the bottom-right corner, mapping colors to categories.
- **Chart Labels**:
- Top-left: Metric name (e.g., "CWQ(%)")
- Bottom-left: Dominant category percentage (e.g., "77")
- **Segments**: Each chart divided into three proportional sections with embedded percentage labels.
### Detailed Analysis
#### CWQ(%)
- **KG Only**: 77% (blue, largest segment)
- **LLM Inspired KG**: 7% (green, smallest segment)
- **KG Inspired LLM**: 14% (gray, intermediate segment)
#### WebQSP(%)
- **KG Only**: 87% (blue, largest segment)
- **LLM Inspired KG**: 2% (green, smallest segment)
- **KG Inspired LLM**: 10% (gray, intermediate segment)
#### GrailQA(%)
- **KG Only**: 92% (blue, largest segment)
- **LLM Inspired KG**: 1% (green, smallest segment)
- **KG Inspired LLM**: 6% (gray, intermediate segment)
### Key Observations
1. **KG Only Dominance**:
- Consistently the largest segment across all metrics (77–92%).
- Shows a clear upward trend from CWQ(%) to GrailQA(%).
2. **LLM Inspired KG Decline**:
- Drops from 7% (CWQ) to 1% (GrailQA), indicating reduced relevance or adoption.
3. **KG Inspired LLM Stability**:
- Decreases moderately from 14% (CWQ) to 6% (GrailQA), suggesting partial but declining utility.
### Interpretation
The data suggests **KG Only** is the most effective or widely adopted approach across all metrics, with its dominance increasing as the metric complexity or specificity rises (CWQ → GrailQA). **LLM Inspired KG** shows a sharp decline, possibly due to limitations in handling structured KG tasks. **KG Inspired LLM** maintains a moderate presence but decreases slightly, indicating it may serve niche roles that diminish with higher metric specificity. The stark contrast between KG Only and other approaches highlights a potential trade-off between simplicity and performance in KG systems.
</details>
(b) PoG-E
Figure 8. The proportions of answer evidence of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.
B.3. Error Analysis
To further analyze the integration of LLMs and KGs, we conduct an error analysis on the CWQ, WebQSP, and GrailQA datasets. We categoriz errors into four types: (1) answer generation error, (2) refuse error, (3) format error, and (4) other hallucination errors. Note that answer generation error occurs when PoG provides an accurate reasoning path, but the LLM fails to extract the correct answer from it. The distribution of these error types is illustrated in Figure 9. The results indicate that using more powerful LLMs reduces the number of ”other hallucination errors,” ”refuse errors,” and ”answer generation errors,” as the model offers enhanced reasoning capabilities based on the retrieved data. Specifically, the reduction in ”answer generation errors” shows the reasoning paths provided by PoG are effectively utilized by more advanced LLMs. However, we observe an increase in ”format errors” with more powerful LLMs, which may be attributed to their greater creative flexibility.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Bar Chart: Error Samples by Model and Dataset
### Overview
The chart compares error samples across three question-answering datasets (CWQ, WebQSP, GrailQA) for two models (PoG and PoG-E) using GPT-3.5 and GPT-4. Each bar is segmented into four error types: Others Hallucination Error (light blue), Answer Generation Error (orange), Refuse Answer (yellow), and Format Error (dark blue). The y-axis represents error sample counts, with values ranging from 0 to 250.
### Components/Axes
- **X-axis**: Model/Dataset combinations:
- CWQ: PoG (GPT-3.5), PoG (GPT-4), PoG-E (GPT-3.5), PoG-E (GPT-4)
- WebQSP: PoG (GPT-3.5), PoG (GPT-4), PoG-E (GPT-3.5), PoG-E (GPT-4)
- GrailQA: PoG (GPT-3.5), PoG (GPT-4), PoG-E (GPT-3.5), PoG-E (GPT-4)
- **Y-axis**: "Error Samples" (0–250)
- **Legend**: Located on the right, with color-coded error types:
- Light blue: Others Hallucination Error
- Orange: Answer Generation Error
- Yellow: Refuse Answer
- Dark blue: Format Error
### Detailed Analysis
1. **CWQ Dataset**:
- **PoG (GPT-4)**: Tallest bar (~220 total errors). Format Error (dark blue) dominates (~120), followed by Others Hallucination Error (~80), Answer Generation Error (~15), and Refuse Answer (~5).
- **PoG-E (GPT-4)**: Second-tallest (~190 total). Format Error (~90), Others Hallucination Error (~70), Answer Generation Error (~20), Refuse Answer (~10).
2. **WebQSP Dataset**:
- **PoG-E (GPT-4)**: Tallest bar (~140 total). Answer Generation Error (orange, ~50) is largest, followed by Others Hallucination Error (~60), Format Error (~25), and Refuse Answer (~5).
- **PoG (GPT-4)**: ~100 total. Answer Generation Error (~30), Others Hallucination Error (~40), Format Error (~20), Refuse Answer (~10).
3. **GrailQA Dataset**:
- **PoG-E (GPT-4)**: Tallest bar (~110 total). Others Hallucination Error (light blue, ~60) dominates, followed by Answer Generation Error (~30), Refuse Answer (~10), and Format Error (~10).
- **PoG (GPT-4)**: ~80 total. Others Hallucination Error (~40), Answer Generation Error (~25), Refuse Answer (~10), Format Error (~5).
### Key Observations
- **Model Performance**: GPT-4 models consistently show higher error counts than GPT-3.5 across all datasets.
- **Error Type Dominance**:
- **CWQ**: Format Error is most prevalent.
- **WebQSP**: Answer Generation Error is most prevalent.
- **GrailQA**: Others Hallucination Error is most prevalent.
- **PoG-E vs. PoG**: PoG-E models generally have fewer errors than PoG in WebQSP and GrailQA but more in CWQ.
### Interpretation
The data suggests that model performance varies significantly by dataset. GPT-4 models exhibit higher error rates overall, with PoG-E performing better in WebQSP and GrailQA but worse in CWQ. The error type distribution highlights dataset-specific challenges:
- **CWQ**: Struggles with format adherence.
- **WebQSP**: Faces issues with answer generation accuracy.
- **GrailQA**: Prone to hallucination errors. These trends imply that model fine-tuning or dataset-specific adjustments may be necessary to address these error patterns.
</details>
Figure 9. The error instances and categories of PoG and PoG-E in the CWQ, WebQSP, and GrailQA datasets.
B.4. Efficiency Analysis
LLM calls cost analysis. To evaluate the cost and efficiency of utilizing LLMs, we conducted an analysis of LLM calls on the CWQ, WebQSP, and GrailQA datasets. Initially, we examined the proportion of questions answered with varying numbers of LLM calls, as depicted in Figure 10. The results indicate that the majority of questions are answered within nine LLM calls across all datasets, with approximately 80% and 50% of questions being resolved within six calls on CWQ and WebQSP, respectively. These findings demonstrate PoG’s efficiency in minimizing LLM usage costs. Furthermore, we compared the average number of LLM calls required by PoG with the current SOTA method, ToG (Sun et al., 2024), as shown in Table 7. Since we utilized identical datasets for WebQSP, GrailQA, Simple Questions, and WebQuestions, we report the ToG results from their paper. The comparison reveals that PoG achieves comparable or superior accuracy while reducing the number of LLM calls by up to 40% on the GrailQA dataset compared to ToG. This improvement is attributed to PoG’s dynamic exploration strategy, which avoids starting from scratch, and its effective use of graph structures to prune irrelevant information, thereby significantly decreasing computational costs.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Bar Charts: LLM Call Distribution Across Three Datasets
### Overview
The image contains three side-by-side bar charts comparing the distribution of LLM (Large Language Model) calls across three datasets: CWQ (orange), WebQSP (yellow), and GrailQA (blue). Each chart shows the percentage of queries falling into specific ranges of LLM call counts, with the x-axis representing call count ranges and the y-axis showing percentages.
### Components/Axes
- **X-axis**: Number of LLM calls in ranges:
`(0,3]`, `(3,6]`, `(6,9]`, `(9,12]`, `(12,15]`, `(15,18]`, `(18,21]`, `21+`
- **Y-axis**: Percentage (%) of queries in each range
- **Legends**:
- Top-left: CWQ (orange)
- Top-center: WebQSP (yellow)
- Top-right: GrailQA (blue)
### Detailed Analysis
#### CWQ (Orange)
- `(0,3]`: ~5%
- `(3,6]`: ~25%
- `(6,9]`: ~45% (peak)
- `(9,12]`: ~8%
- `(12,15]`: ~5%
- `(15,18]`: ~2%
- `(18,21]`: ~2%
- `21+`: ~8%
#### WebQSP (Yellow)
- `(0,3]`: 0%
- `(3,6]`: ~50% (peak)
- `(6,9]`: ~30%
- `(9,12]`: ~10%
- `(12,15]`: ~5%
- `(15,18]`: ~2%
- `(18,21]`: ~1%
- `21+`: ~2%
#### GrailQA (Blue)
- `(0,3]`: ~2%
- `(3,6]`: ~15%
- `(6,9]`: ~80% (peak)
- `(9,12]`: ~3%
- `(15,18]`: ~1%
- `(18,21]`: ~1%
- `21+`: ~2%
### Key Observations
1. **GrailQA Dominance**: GrailQA has an extreme concentration of queries in the `(6,9]` range (~80%), far exceeding the other datasets.
2. **CWQ vs. WebQSP**:
- CWQ peaks at `(6,9]` (~45%) but has a broader distribution.
- WebQSP peaks at `(3,6]` (~50%) with a sharper drop-off.
3. **Long-Tail Behavior**: All datasets show minimal usage in the `21+` range (<5%), indicating most queries are resolved within 9 calls.
4. **GrailQA Anomaly**: The `(6,9]` range for GrailQA is 3–4x higher than CWQ and 5x higher than WebQSP, suggesting a unique query pattern or model behavior.
### Interpretation
The data suggests significant differences in query complexity or model efficiency across datasets:
- **GrailQA** likely handles highly complex or multi-step queries requiring 6–9 LLM calls for ~80% of cases.
- **WebQSP** favors simpler queries resolved in 3–6 calls (~80% of its distribution).
- **CWQ** exhibits a middle ground, with moderate complexity queries (6–9 calls) being most common.
The near-absence of `21+` calls across all datasets implies robust query resolution mechanisms, though GrailQA’s extreme peak may indicate either specialized use cases or potential inefficiencies in handling edge cases. The stark contrast between datasets highlights the need for dataset-specific optimization strategies.
</details>
Figure 10. The proportion of question of PoG and PoG-E by different LLM Calls among CWQ, WebQSP, and GrailQA datasets
Table 7. Average LLM calls per question of PoG and ToG among all datasets.
| Method | CWQ | WebQSP | GrailQA | Simple Questions | WebQuestions |
| --- | --- | --- | --- | --- | --- |
| PoG | 10.7 | 8.3 | 6.5 | 6.1 | 9.3 |
| ToG | - | 11.2 | 10.6 | 8.7 | 10.5 |
Running time analysis. To further evaluate the processing efficiency of PoG, we conducted extensive evaluations focusing on average accuracy, LLM call frequency, and running time on multi-hop question datasets GrailQA and CWQ. The results, presented in Table 8, compare the performance of ToG and PoG across various beam search strategies. The data indicate that while higher accuracy slightly increases runtime, PoG effectively balances accuracy with reduced LLM call costs. Specifically, PoG reduces LLM call costs by up to 53.5% while increasing accuracy by up to 33.4%. This allows users to tailor the PoG framework to their specific needs regarding accuracy, cost, and runtime. All PoG setting provide significantly lower costs. For instance, on the GrailQA dataset, PoG with 3-step beam search reduces LLM call costs by 39.6% and improves accuracy by 33.1%, with only a 1.14% increase in runtime. A similar trend is also observed in the CWQ dataset.
Table 8. Running time evaluation of ToG and PoG with different beam search methods on CWQ and GrailQA.
| Model | Evaluation | CWQ | GrailQA |
| --- | --- | --- | --- |
| ToG | Accuracy | 53.1 | 59.3 |
| Time (s) | 78.7 | 14.8 | |
| LLM Calls | 21.3 | 10.1 | |
| PoG | Accuracy | 81.4 | 92.7 |
| Time (s) | 118.9 | 21.4 | |
| LLM Calls | 9.1 | 6.5 | |
| PoG | Accuracy | 79.8 | 92.4 |
| w/ 3-Step Beam Search | Time (s) | 87.5 | 15.0 |
| LLM Calls | 8.8 | 6.1 | |
| PoG | Accuracy | 57.1 | 83.0 |
| w/ Fuzzy Selection | Time (s) | 109.7 | 15.7 |
| LLM Calls | 6.8 | 4.7 | |
B.5. Case Study
Case study: graph reduction and path pruning. We conducted a case study using the example question presented in Figure 2 to illustrate the effects of graph pruning and path pruning on the graph structure. Figure 11 (a) shows the results of graph pruning, where vertices in blue are selected as part of the question subgraph, and vertices in black are pruned. In this sample, the number of entities is reduced from 16,740 to 1,245, resulting in a 92% reduction of vertices. Figures 11 (b) and 11 (c) demonstrate the question subgraph induced by the blue vertices in Figure 11 (a) and the results after applying fuzzy and precise path selection. In these figures, vertices in blue represent the selected entity after each pruning, vertices in yellow represent the topic entities, and the vertex in red denotes the final answer entity. From these graphs, we observe that utilizing the graph structure allows for the rapid pruning of irrelevant vertices, ensuring that the reasoning paths remain faithful and highly relevant to the question since all vertices within the question subgraph are interconnected with all topic entities, thereby maintaining the integrity and relevance of the reasoning process.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Network Visualization: Unlabeled Node-Edge Structure
### Overview
The image depicts a dense network visualization with interconnected nodes represented as colored circles (blue and black) and edges as thin gray lines. No explicit textual labels, axis titles, legends, or annotations are visible. The structure suggests a graph-based representation, possibly of relationships, connections, or interactions.
### Components/Axes
- **Nodes**:
- **Blue Circles**: Approximately 30-40% of nodes (estimated 150-200 total nodes).
- **Black Circles**: Approximately 60-70% of nodes.
- No size differentiation between node types.
- **Edges**:
- Thin gray lines connecting nodes.
- No visible weights or directional indicators (e.g., arrows).
- **Background**:
- White canvas with no gridlines or coordinate system.
### Detailed Analysis
- **Node Distribution**:
- Blue nodes are clustered in the upper-left and lower-right quadrants.
- Black nodes dominate the central region and upper-right quadrant.
- **Edge Density**:
- High connectivity in the central region (black nodes).
- Sparse connections in peripheral areas (blue nodes).
- **No Textual Elements**:
- No legends, axis labels, or annotations present.
- No numerical values or categorical identifiers visible.
### Key Observations
1. **Color Coding Ambiguity**: The lack of a legend prevents interpretation of blue vs. black node significance.
2. **Connectivity Patterns**: Central black nodes exhibit higher connectivity, suggesting potential "hub" nodes.
3. **Peripheral Isolation**: Blue nodes in outer regions have fewer connections, indicating possible marginalization.
### Interpretation
This visualization likely represents a network where nodes (entities) are differentiated by color, but without explicit labels, the meaning of these categories remains unclear. The central clustering of black nodes and their dense connections imply they may act as critical connectors or hubs. Blue nodes in peripheral regions might represent less connected entities, though their purpose is undefined. The absence of textual metadata limits the ability to draw definitive conclusions about the network's structure or purpose.
**Note**: No factual data or textual information is present in the image. The analysis is based solely on visual patterns and spatial distribution.
</details>
(a) Graph pruning
<details>
<summary>x22.png Details</summary>

### Visual Description
## Network Diagram: Node Interconnectivity Structure
### Overview
The image depicts a complex network diagram composed of numerous interconnected nodes. The majority of nodes are dark blue circular elements densely packed in a central region, with two distinct orange nodes positioned at the periphery. Connections between nodes are represented by thin, crisscrossing lines that create a web-like structure. No textual labels, axis markers, or legends are visible in the image.
### Components/Axes
- **Nodes**:
- **Dark Blue Nodes**: Approximately 150-200 circular elements densely clustered in the central region (coordinates: x=50-800, y=50-800 relative to image bounds).
- **Orange Nodes**: Two isolated nodes positioned at:
- Top-right quadrant (x=850, y=150)
- Bottom-left quadrant (x=250, y=850)
- **Connections**:
- Thin, gray lines linking nodes with no visible thickness variation or labeling.
- Connection density decreases radially outward from the central cluster.
### Detailed Analysis
- **Node Distribution**:
- Central cluster density: ~120 nodes per 100x100 pixel area.
- Peripheral nodes (orange): 2 isolated instances with no direct connections to the central cluster.
- **Connection Patterns**:
- Central nodes exhibit high interconnectivity (average degree: ~8-10 connections per node).
- Orange nodes show no visible connections to other nodes.
- **Spatial Relationships**:
- Orange nodes positioned at image extremities (top-right and bottom-left).
- Central cluster occupies ~70% of the image area.
### Key Observations
1. **Central Hub Structure**: The dense central cluster suggests a core network with high mutual connectivity.
2. **Peripheral Isolation**: Orange nodes lack connections, potentially indicating isolated entities or external systems.
3. **No Hierarchical Labeling**: Absence of node labels or connection weights prevents quantitative analysis of relationships.
### Interpretation
This diagram likely represents a system architecture or social network where:
- The central blue nodes form a highly interconnected core, possibly representing primary users, servers, or data points.
- The orange nodes may symbolize external systems, outliers, or specialized components disconnected from the main network.
- The lack of connection weights or labels suggests the diagram emphasizes structural relationships over quantitative metrics.
- The isolation of orange nodes could indicate potential points of failure, integration challenges, or intentional segmentation in the system design.
No textual data or quantitative values are present in the image to support further analysis.
</details>
(b) Question subgraph
<details>
<summary>x23.png Details</summary>

### Visual Description
## Network Diagram: Organizational or Social Network Structure
### Overview
The image depicts a complex network diagram with interconnected nodes and pathways. Nodes are represented as circles of varying sizes and colors, connected by lines. The diagram lacks explicit labels, axis titles, or legends, but visual patterns suggest hierarchical or relational relationships.
### Components/Axes
- **Nodes**:
- **Gray nodes**: Dominant in the network, representing the majority of entities.
- **Blue nodes**: Smaller in size, centrally located, and densely connected to gray nodes.
- **Orange nodes**: Two distinct nodes (one at the top-right, one at the bottom-left) with significantly larger size and fewer connections.
- **Connections**:
- **Blue lines**: Connect gray and blue nodes, forming dense clusters.
- **Orange lines**: Radiate from the two orange nodes, connecting to gray nodes in a star-like pattern.
- **Spatial Layout**:
- Orange nodes are positioned at the periphery (top-right and bottom-left).
- Blue nodes cluster near the center, acting as intermediaries.
- Gray nodes form the bulk of the network, with connections radiating outward.
### Detailed Analysis
- **Node Sizes and Colors**:
- Orange nodes are ~3x larger than blue nodes and ~10x larger than gray nodes.
- Blue nodes are centrally located, suggesting higher connectivity or importance.
- **Connection Patterns**:
- Orange nodes have ~15-20 direct connections each, while blue nodes have ~50-70.
- Gray nodes average ~10-15 connections, with most links to blue nodes.
- **Flow Direction**:
- Orange nodes act as "hubs," with connections radiating outward to gray nodes.
- Blue nodes form a dense core, with bidirectional connections to gray nodes.
### Key Observations
1. **Centrality of Blue Nodes**: Blue nodes are critical connectors, bridging gray nodes and orange hubs.
2. **Peripheral Orange Nodes**: Orange nodes are isolated from each other, suggesting independent roles or functions.
3. **Lack of Labels**: No textual identifiers or legends are present, limiting direct interpretation of node roles.
4. **Color Coding**: Blue and orange lines may indicate different relationship types (e.g., primary vs. secondary connections).
### Interpretation
This diagram likely represents a network where:
- **Orange nodes** are key entities (e.g., leaders, hubs) with limited but strategic connections.
- **Blue nodes** serve as intermediaries, facilitating communication or resource flow between gray nodes.
- **Gray nodes** represent peripheral or subordinate entities with fewer direct interactions.
The absence of labels suggests the diagram is a generic representation, possibly for illustrative purposes. The hierarchical structure implies a system where central nodes (blue) maintain cohesion, while peripheral nodes (orange) drive external interactions. Outliers include the isolated orange nodes, which may represent specialized roles or bottlenecks.
### Limitations
- No textual data or legends prevent precise identification of nodes or relationships.
- Assumptions about node roles are based on visual patterns and common network theory.
**Note**: The image contains no textual information, axis labels, or legends. All interpretations are derived from visual patterns and network theory principles.
</details>
(c) Fuzzy selection
<details>
<summary>x24.png Details</summary>

### Visual Description
## Network Diagram: Node and Connection Analysis
### Overview
The image depicts a complex network diagram with nodes and edges representing connections. The diagram uses color-coded nodes and edges to distinguish between different types of entities and relationships. The network appears to have a hierarchical structure with central and peripheral nodes.
### Components/Axes
- **Nodes**:
- **Gray Nodes**: Labeled as "General Nodes" in the legend (right side of the diagram).
- **Blue Node**: Labeled as "Primary Hub" in the legend.
- **Red Node**: Labeled as "Secondary Hub" in the legend.
- **Orange Node**: Labeled as "External Connection" in the legend.
- **Edges**:
- **Blue Edges**: Labeled as "Internal Links" in the legend.
- **Orange Edges**: Labeled as "External Links" in the legend.
- **Legend**: Positioned on the right side of the diagram, with color-coded labels for nodes and edges.
### Detailed Analysis
1. **Primary Hub (Blue Node)**:
- Located centrally in the diagram.
- Connected to numerous gray nodes via blue edges ("Internal Links").
- Has the highest density of connections, indicating it is the central node of the network.
2. **Secondary Hub (Red Node)**:
- Positioned slightly to the left of the center.
- Connected to a subset of gray nodes via blue edges.
- Has fewer connections than the Primary Hub but more than the External Connection.
3. **External Connection (Orange Node)**:
- Located on the far right of the diagram.
- Connected to gray nodes via orange edges ("External Links").
- Has the fewest connections overall, suggesting a peripheral role.
4. **General Nodes (Gray Nodes)**:
- Scattered throughout the diagram.
- Mostly connected via blue edges to the Primary and Secondary Hubs.
- A subset of gray nodes on the right are connected to the External Connection via orange edges.
### Key Observations
- The network is **centralized**, with the Primary Hub (blue node) acting as the dominant node.
- The Secondary Hub (red node) serves as a secondary point of connection but is less central than the Primary Hub.
- The External Connection (orange node) is isolated to the right side, indicating a distinct external relationship.
- Gray nodes are predominantly internal but show some external connectivity via the orange node.
### Interpretation
This diagram represents a network with a clear hierarchy:
- The **Primary Hub** (blue node) is the core of the network, managing most internal connections.
- The **Secondary Hub** (red node) acts as a secondary node, likely handling overflow or specialized tasks.
- The **External Connection** (orange node) represents an external entity, possibly a partner or third-party system, with limited internal interactions.
- The **General Nodes** (gray nodes) are peripheral entities, mostly engaged in internal communication but with some external links.
The structure suggests a system where the Primary Hub is critical for maintaining network cohesion, while the External Connection operates independently. The Secondary Hub bridges the Primary Hub and peripheral nodes, indicating a tiered relationship. The orange edges highlight a distinct external subsystem, possibly for specialized or isolated operations.
</details>
(d) Precise selection
Figure 11. Visualization of graph reduction and Path selection.
Case study: interpretable reasoning. In this section, we present Table 9, which illustrates PoG’s interpretability through case studies involving questions with one, two, and three entities. These examples demonstrate PoG’s effectiveness in handling multi-entity and multi-hop tasks by providing clear and understandable reasoning paths that lead to accurate answers.
Table 9. Examples of faithful and interpretable reasoning of PoG.
| Field | Content |
| --- | --- |
| Question | Lou Seal is the mascot for the team that last won the World Series when? |
| Answer | 2014 World Series |
| Topic Entity | Lou Seal |
| LLM indicator | ” Lou Seal ” - mascot for - team - last won - World Series - when |
| Split Question | split_question 1: What team is ”Lou Seal” the mascot for? |
| split_question 2: When did the last World Series win occur for the team with Lou Seal as the mascot? | |
| Reasoning Path | {Lou Seal} $→$ sports.mascot.team $→$ {San Francisco Giants} $→$ sports.sports_team.championships $→$ {2014 World Series, 2012 World Series, 2010 World Series} |
| PoG answer | answer: {2014 World Series} |
| reason: From the given path {Lou Seal} $→$ sports.mascot.team $→$ {San Francisco Giants}, {San Francisco Giants} is the team that ”Lou Seal” serves as the mascot for. From {San Francisco Giants} $→$ sports.sports_team.championships $→$ {2014 World Series, 2012 World Series, 2010 World Series}, the last World Series won by the {San Francisco Giants} is in {2014}. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {2014 World Series}. | |
| Question | What European Union country sharing borders with Germany contains the Lejre Municipality? |
| Answer | Denmark |
| Topic Entity | {Germany, Lejre Municipality} |
| LLM indicator | ” Lejre Municipality ” – contained by - European Union country (answer)- shares borders with-” Germany ” |
| Split Question | split_question 1: What European Union country shares borders with ”Germany”? |
| split_question 2: What country contains the Lejre Municipality? | |
| Reasoning Path | {Lejre Municipality} $→$ location.administrative_division.country $→$ {Denmark} $→$ location.location.containedby $→$ {Eurasia} $→$ location.location.contains $→$ {Germany}, |
| {Lejre Municipality} $→$ location.administrative_division.country $→$ {Denmark} $→$ base.locations.countries. continent $→$ {Europe} $→$ location.location.contains $→$ {Western Europe} $→$ location.location.contains $→$ {Germany} | |
| PoG answer | answer: {Denmark} |
| reason: From {Denmark} $→$ location.location.containedby $→$ {Eurasia} $→$ location.location.contains $→$ {Germany}, Denmark is European Union country shares borders with ”Germany”. From {Lejre Municipality} $→$ location.administrative_division.country $→$ {Denmark}, Denmark is a country contains the Lejre Municipality Both paths lead to Denmark, which is the European Union country sharing borders with Germany and containing the Lejre Municipality. Therefore, the provided knowledge graph path is sufficient to answer the main question, and the answer is {Denmark}. | |
| Question | What Central Tiem Zone continent do the Falkland Islands belong to? |
| Answer | Americas |
| Topic Entity | {Falkland Islands, Continent, Central Time Zone} |
| LLM indicator | answer(” Central Time Zone ”) – owns - answer(” Continent ”)- located contain -” Falkland Islands ” |
| Split Question | split_question 1:What continent do the Falkland Islands belong to? |
| split_question 2: What Central Time Zone does the continent belong to? | |
| Reasoning Path | {Central Time Zone} $→$ time.time_zone.locations_in_this_time_zone $→$ {Americas} $→$ common.topic.not able_types $→$ {Continent} $→$ common.topic. notable_types $→$ {Americas} $→$ location.location.con tains $→$ { Falkland Islands} |
| PoG answer | answer: {Americas} |
| reason: From {Continent} $→$ common.topic.notable_types $→$ {Americas} $→$ location.location. contains $→$ { Falkland Islands}, Americas is the continent Falkland Islands belong to. From {Central Time Zone} $→$ time.time_zone.locations_in_this_time_zone $→$ {Americas} $→$ common.topic. notable_types $→$ {Continent}. Americas is the Central Time Zone. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {Americas}. | |
Appendix C Experiment Details
Experiment datasets. To evaluate PoG’s capability in multi-hop knowledge-intensive reasoning tasks, we assess it on four KBQA datasets: three multi-hop datasets (CWQ (Talmor and Berant, 2018), WebQSP (Yih et al., 2016), GrailQA (Gu et al., 2021)) and one single-hop dataset (SimpleQuestions (Petrochuk and Zettlemoyer, 2018)). Additionally, to examine PoG on more general tasks, we include an open-domain QA dataset, WebQuestions. For the evaluation of large datasets such as CWQ, GrailQA, and SimpleQuestions, we utilize a random sample of 1,000 test cases from CWQ and employ the 1,000 samples previously reported by ToG (Sun et al., 2024) to facilitate a comparison with the SOTA while also minimizing computational costs. Freebase serves as the background knowledge graph for all datasets, which encompasses approximately 88 million entities, 20,000 relations, and 126 million triples (Luo et al., 2024; Bollacker et al., 2008). The statistics of the datasets utilized in this study are detailed in Table 10. The source code has been publicly available https://github.com/SteveTANTAN/PoG.
Baselines. Inspired by ToG (Sun et al., 2024), we compare our method with standard prompting (IO), Chain-of-Thought (CoT), and Self-Consistency (SC) promptings with six in-context exemplars and ”step-by-step” reasoning chains. For each dataset, we also include previous SOTA works for comparison. For a fair play, we compare with previous SOTA among all prompting-based methods and previous SOTA among all methods respectively. Since ToG is the current SOTA prompting-based method, we directly refer to their results and those of other baselines reported in their paper for comparisons.
Experimental implementation. Leveraging the plug-and-play convenience of our framework, we experiment with two LLMs: GPT-3.5 and GPT-4. We use the OpenAI API to access GPT-3.5 (GPT-3.5-turbo) and GPT-4. Aligning with ToG, we set the temperature parameter to 0.4 during the exploration process (to increase diversity) and to 0 during the reasoning process (to ensure reproducibility). The maximum token length for generation is set to 256. In all experiments, we set both $W_{\max}$ and $D_{\max}$ to 3 for beam search. All the experiments are conducted on a machine with Intel Xeon Gold 6248R CPU, Nvidia A5000 GPU and 512GB memory.
Table 10. Statistics of the datasets used in this paper. † denotes we randomly select 1,000 samples from the CWQ test set to create the experiment testing set due to the abundance of test samples. ∗ denotes that we utilize the 1,000 samples reported by ToG (Sun et al., 2024) to compare with the state-of-the-art.
| Dataset | Answer Format | Test | Train |
| --- | --- | --- | --- |
| ComplexWebQuestions (CWQ) † | Entity | 1,000 | 27,734 |
| WebQSP | Entity/Number | 1,639 | 3,098 |
| GrailQA ∗ | Entity/Number | 1,000 | 44,337 |
| Simple Question ∗ | Entity/Number | 1,000 | 14,894 |
| WebQuestions | Entity/Number | 2,032 | 3,778 |
Appendix D SPARQL
This section outlines the pre-defined SPARQL queries used for interacting with the knowledge graph and constructing graphs for our experiments.
D.1. 1-hop Entity and Relation Search
To facilitate the retrieval of 1-hop neighbors of entities within the Freebase Knowledge Graph, we have predefined a SPARQL query. This query is designed to be executed by simply substituting the appropriate ID for the query entity ID. It returns the connected entities’ IDs and their associated relations’ IDs, indicating whether the connected entity is at the tail or the head of the relation.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT ? relation ? connectedEntity ? direction
WHERE {
{
ns: ID ? relation ? connectedEntity .
BIND ("tail" AS ? direction)
}
UNION
{
? connectedEntity ? relation ns: ID .
BIND ("head" AS ? direction)
}
}
D.2. Short Textual Description
The following predefined function implements the retrieval of short textual descriptions, $\mathcal{D}(.)$ , for converting the identifiers (IDs) of entities or relations into natural language descriptions.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT DISTINCT ? tailEntity
WHERE {
{
? entity ns: type. object. name ? tailEntity .
FILTER (? entity = ns: ID)
}
UNION
{
? entity < http:// www. w3. org /2002/07/ owlsameAs > ? tailEntity .
FILTER (? entity = ns: ID)
}
}
D.3. 1-hop Subgraph Search
To facilitate subgraph detection in Section 4.1, we implement the 1-hop subgraph detection feature by integrating SPARQL functions described in Appendix D.1 and D.2. The purpose of this function is to retrieve, in a single SPARQL query, the 1-hop neighbors of a given query with their IDs, natural language names, and connected relationships, specifying whether the connected entity is at the tail or the head of the relationship.
⬇
PREFIX ns: < http:// rdf. freebase. com / ns />
SELECT ? relation ? connectedEntity ? connectedEntityName ? direction
WHERE {
{
ns: ID ? relation ? connectedEntity .
OPTIONAL {
? connectedEntity ns: type. object. name ? name .
FILTER (lang (? name) = ’ en ’)
}
BIND (COALESCE (? name, "Unnamed Entity") AS ? connectedEntityName)
BIND ("tail" AS ? direction)
}
UNION
{
? connectedEntity ? relation ns: ID .
OPTIONAL {
? connectedEntity ns: type. object. name ? name .
FILTER (lang (? name) = ’ en ’)
}
BIND (COALESCE (? name, "Unnamed Entity") AS ? connectedEntityName)
BIND ("head" AS ? direction)
}
}
Appendix E Prompts
In this section, we detail the prompts required for our main experimental procedures.
Question Analysis Prompt Template
You will receive a multi-hop question, which is composed of several interconnected queries, along with a list of topic entities that serve as the main keywords for the question. Your task is to break the question into simpler parts, using each topic entity once and provide a Chain of Thought (CoT) that shows how the topic entities are related. Note: Each simpler question should explore how one topic entity connects to others or the answer. The goal is to systematically address each entity to derive the final answer. In-Context Few-shot Q: {Query} Topic Entity: {Topic Entity} A:
LLM Supplement Prompt Template
Using the main question, a possibly uncertain chain of thought generated by a language model, some related split questions, paths from the ”Related_paths” section, and main topic entities: please first provide three predicted results, and second offer three possible chains of thought that could lead to these results, using the provided knowledge paths and your own knowledge. If any answers are unclear, suggest alternative answers to fill in the gaps in the chains of thought, following the same format as the provided examples. In-Context Few-shot Q: {Query} Topic Entity: {Topic Entity} Think Indicator:{Think Indicator} Split Question:{Split Question} A:
where {Think Indicator}, and {Split Question} are obtained in section 4.1. An indicator example is shown in Figure 2.
Precise Path Select Prompt Template
Given a main question, a LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}, Please score and give me the top three lists from the candidate paths set can be highly to be the answer of the question. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Candidate Paths:{Candidate Paths} A:
{Candidate Paths} denotes the retrieved reasoning paths $Final_{p}aths$ to be selected in this request which are formatted as a series of structural sentences:
$\{e_{0x},...,e_{0z}\}→ r_{1_{i}}→\{e_{1x},...,e_{1z}\}→...$ $→ r_{l_{j}}→\{e_{lx},...,e_{lz}\}$
$...$
$\{e_{0x},...,e_{0z}\}→ r_{1_{i}}→\{e_{1x},...,e_{1z}\}→...$ $→ r_{l_{j}}→\{e_{lx},...,e_{lz}\}$ ,
where, $i$ and $j$ in $r_{1_{i}},r_{1_{i}}$ represent the $i$ -th, $j$ -th relation from each relation edge in the clustered question subgraph. And $e$ is constructed by its ID and natural language name $\mathcal{D}(ID)$ .
Path Summarizing Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, the associated accuracy retrieved knowledge paths from the Related paths section, and main topic entities. Your task is to summarize the provided knowledge triple in the Related paths section and generate a chain of thoughts by the knowledge triple related to the main topic entities of the question, which will used for generating the answer for the main question and split question further. You have to make sure you summarize correctly by using the provided knowledge triple, you can only use the entity with id from the given path and you can not skip in steps. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A:
{Related_Paths} has the same format with the {Candidate_Paths} before.
Question Answering Evaluation Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}. Your task is to determine if this knowledge graph path is sufficient to answer the given split question first then the main question. If it’s sufficient, you need to respond {Yes} and provide the answer to the main question. If the answer is obtained from the given knowledge path, it should be the entity name from the path. Otherwise, you need to response {No}, then explain the reason. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A:
Question Answering Generation Prompt Template
Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with ”m.”)} -¿ {set of relationships} -¿ {set of entities(with id start with ”m.”)}, Your task is to generate the answer based on the given knowledge graph path and your own knowledge. In-Context Few-shot Q: {Query} Think Indicator:{Think Indicator} Split Question:{Split Question} Related Paths:{Related Paths} A: