2602.05143
Model: gemini-2.0-flash
# HugRAG: Hierarchical Causal Knowledge Graph Design for RAG
**Authors**: Nengbo Wang, Tuo Liang, Vikash Singh, Chaoda Song, Van Yang, Yu Yin, Jing Ma, Jagdip Singh, Vipin Chaudhary
Abstract
Retrieval augmented generation (RAG) has enhanced large language models by enabling access to external knowledge, with graph-based RAG emerging as a powerful paradigm for structured retrieval and reasoning. However, existing graph-based methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Prior attempts to incorporate causality are typically limited to local or single-document contexts and also suffer from information isolation that arises from modular graph structures, which hinders scalability and cross-module causal reasoning. To address these challenges, we propose HugRAG, a framework that rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. HugRAG explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs. Extensive experiments demonstrate that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics. Our work establishes a principled foundation for structured, scalable, and causally grounded RAG systems.
Machine Learning, ICML
<details>
<summary>x1.png Details</summary>

### Visual Description
## RAG System Comparison Diagram
### Overview
The image presents a comparison of three different Retrieval-Augmented Generation (RAG) systems: Standard RAG, Graph-based RAG, and HugRAG. It illustrates how each system handles the query "Why did citywide commute delays surge right after the blackout?" by showing the flow of information and the connections between different concepts. The diagram highlights the strengths and weaknesses of each approach in terms of context retrieval and causal reasoning.
### Components/Axes
* **Title:** Query: Why did citywide commute delays surge right after the blackout?
Answer: Blackout knocked out signal controllers, intersections went flashing, gridlock spread.
* **RAG Systems:**
* Standard RAG
* Graph-based RAG
* HugRAG
* **Nodes:** Represent concepts or entities.
* **Edges:** Represent relationships between concepts.
* **Modules:** M1: Power Outage, M2: Signal Control, M3: Road Outcomes
* **Legend:** Located at the bottom of the image.
* Knowledge Graph (gray dotted line)
* Seed Node (blue circle)
* N-hop Nodes / Spurious Nodes (yellow circle)
* Graph Modules (light gray shaded area)
* Causal Gate (blue icon of a gate)
* Causal Path (blue arrow)
### Detailed Analysis
**1. Standard RAG (Left)**
* **Description:** This system retrieves text snippets based on keyword matching.
* **Text Snippets:**
* "Substation fault caused a citywide blackout" (highlighted in green)
* "Stop and go backups and gridlock across major corridors"
* "Signal controller network lost power. Many junctions went flashing." (marked as "Missed (No keyword match)")
* **Analysis:** The system successfully retrieves information about the blackout but misses the connection between signal controller failure and traffic delays.
* **Semantic Search:** "X Semantic search misses key context"
**2. Graph-based RAG (Center)**
* **Description:** This system uses a knowledge graph to represent relationships between concepts.
* **Modules:**
* M1: Power Outage (Power restored, Substation fault, Blackout)
* M2: Signal Control (Controllers down, Flashing mode)
* M3: Road Outcomes (Traffic Delays, Unmanaged junctions, Gridlock)
* **Nodes:**
* Blackout (marked with an "X")
* Controllers down
* Flashing mode
* Gridlock
* Unmanaged junctions
* Traffic Delays
* Power restored
* Substation fault
* **Analysis:** The system represents the relationships between power outage, signal control, and road outcomes. However, it struggles to identify the causal path.
* **Community Breaking:** "? Hard to break communities / intrinsic modularity"
**3. HugRAG (Right)**
* **Description:** This system combines knowledge graph representation with causal reasoning.
* **Modules:**
* M1: Power Outage (Power restored, Substation fault, Blackout)
* M2: Signal Control (Controllers down, Flashing mode)
* M3: Road Outcomes (Traffic Delays, Unmanaged junctions, Gridlock)
* **Nodes:**
* Blackout
* Controllers down
* Flashing mode
* Gridlock
* Unmanaged junctions
* Traffic Delays
* Power restored
* Substation fault
* **Causal Path:** A blue arrow indicates the causal path from "Blackout" to "Controllers down" to "Flashing mode" to "Gridlock" to "Traffic Delays".
* **Causal Gate:** A blue gate icon is present between "Blackout" and "Controllers down".
* **Analysis:** The system successfully identifies the causal path from the blackout to traffic delays through signal controller failure.
* **Information Isolation:** "â Break information isolation & Identify causal path"
### Key Observations
* Standard RAG relies on keyword matching and misses contextual information.
* Graph-based RAG represents relationships but struggles with causal reasoning.
* HugRAG combines graph representation with causal reasoning to identify the causal path.
### Interpretation
The diagram demonstrates the evolution of RAG systems from simple keyword-based retrieval to more sophisticated approaches that incorporate knowledge graphs and causal reasoning. HugRAG appears to be the most effective system, as it successfully identifies the causal path between the blackout and traffic delays. This suggests that incorporating causal reasoning into RAG systems can improve their ability to answer complex questions and provide more informative responses. The "Causal Gate" likely represents a point where causal inference is applied to determine the most likely cause-and-effect relationship. The diagram highlights the importance of context and causal reasoning in information retrieval and generation.
</details>
Figure 1: Comparison of three retrieval paradigms, Standard RAG, Graph-based RAG, and HugRAG, on a citywide blackout query. Standard RAG misses key evidence under semantic retrieval. Graph-based RAG can be trapped by intrinsic modularity or grouping structure. HugRAG leverages hierarchical causal gates to bridge modular boundaries, effectively breaking information isolation and explicitly identifying the underlying causal path.
1 Introduction
While Retrieval-Augmented Generation (RAG) effectively extends Large Language Models (LLMs) with external knowledge (Lewis et al., 2021), traditional pipelines predominantly rely on text chunking and semantic embedding search. This paradigm implicitly frames knowledge access as a flat similarity matching problem, overlooking the structured and interdependent nature of real-world concepts. Consequently, as knowledge bases scale in complexity, these methods struggle to maintain retrieval efficiency and reasoning fidelity.
Graph-based RAG has emerged as a promising solution to address these gaps, led by frameworks like GraphRAG (Edge et al., 2024) and extended through agentic search (Ravuru et al., 2024), GNN-guided refinement (Liu et al., 2025b), and hypergraph representations (Luo et al., ). However, three unintended limitations still persist. First, current research prioritizes retrieval policies while overlooking knowledge graph organization. As graphs scale, intrinsic modularity (Fortunato and BarthĂ©lemy, 2007) often restricts exploration within dense modules, triggering information isolation. Common grouping strategies ranging from communities (Edge et al., 2024), passage nodes (GutiĂ©rrez et al., 2025), node-edge sets (Guo et al., 2024) to semantic grouping (Zhang et al., 2025) often inadvertently reinforce these boundaries, severely limiting global recall. Second, most formulations rely on semantic proximity and superficial traversal on graphs without causal awareness, leading to a locality issue where spurious nodes and irrelevant noise degrade precision (see Figure 1). Despite the inherent causal discovery potential of LLMs, this capability remains largely untapped for filtering noise within RAG pipelines. Finally, these systemic flaws are often masked by popular QA datasets evaluation, which reward entity-level âhitsâ over holistic comprehension. Consequently, there is a pressing need for a retrieval framework that reconciles global knowledge accessibility with local reasoning precision to support robust, causally-grounded generation.
To address these challenges, we propose HugRAG, a framework that rethinks knowledge graph organization through hierarchical causal gate structures. HugRAG formulates the knowledge graph as a multi-layered representation where fine-grained facts are organized into higher-level schemas, enabling multi-granular reasoning. This hierarchical architecture, integrated with causal gates, establishes logical bridges across modules, thereby naturally breaking information isolation and enhancing global recall. During retrieval, HugRAG transcends pointwise semantic matching to explicit reasoning over causal graphs. By actively distinguishing genuine causal dependencies from spurious associations, HugRAG mitigates the locality issue and filters retrieval noise to ensure precise, grounded, and interpretable generation.
To validate the effectiveness of HugRAG, we conduct extensive evaluations across datasets in multiple domains, comparing it against a diverse suite of competitive RAG baselines. To address the previously identified limitations of existing QA datasets, we introduce a large-scale cross-domain dataset HolisQA focused on holistic comprehension, designed to evaluate reasoning capabilities in complex, real-world scenarios. Our results consistently demonstrate that causal gating and causal reasoning effectively reconcile the trade-off between recall and precision, significantly enhancing retrieval quality and answer reliability.
| Method Standard RAG (Lewis et al., 2021) Graph RAG (Edge et al., 2024) | Knowledge Graph Organization Flat text chunks, unstructured. $\mathcal{G}_{\text{idx}}=\{d_{i}\}_{i=1}^{N}$ Partitioned communities with summaries. $\mathcal{G}_{\text{idx}}=\{\text{Sum}(c)\mid câ\mathcal{C}\}$ | Retrieval and Generation Process Semantic vector search over chunks. $S=\mathrm{TopK}(\text{sim}(q,d_{i}));\;\;y=\mathsf{G}(q,S)$ Map-Reduce over community summaries. $A_{\text{part}}=\{\mathsf{G}(q,\text{Sum}(m))\};\;\;y=\mathsf{G}(A_{\text{part}})$ |
| --- | --- | --- |
| Light RAG (Guo et al., 2024) | Dual-level indexing (Entities + Relations). $\mathcal{G}_{\text{idx}}=(V_{\text{ent}}\cup V_{\text{rel}},E)$ | Keyword-based vector retrieval + neighbor. $K_{q}=\mathsf{Key}(q);\;\;S=\mathrm{Vec}(K_{q},\mathcal{G}_{\text{idx}})\cup\mathcal{N}_{1}$ |
| HippoRAG 2 (Gutiérrez et al., 2025) | Dense-sparse integration (Phrase + Passage). $\mathcal{G}_{\text{idx}}=(V_{\text{phrase}}\cup V_{\text{doc}},E)$ | PPR diffusion from LLM-filtered seeds. $U_{\text{seed}}=\mathsf{Filter}(q,V);\;\;S=\mathsf{PPR}(U_{\text{seed}},\mathcal{G}_{\text{idx}})$ |
| LeanRAG (Zhang et al., 2025) | Hierarchical semantic clusters (GMM). $\mathcal{G}_{\text{idx}}=\text{Tree}(\text{Semantic Aggregation})$ | Bottom-up traversal to LCA (Ancestor). $U=\mathrm{TopK}(q,V);\;\;S=\mathsf{LCA}(U,\mathcal{G}_{\text{idx}})$ |
| CausalRAG (Wang et al., 2025a) | Flat graph structure. $\mathcal{G}_{\text{idx}}=(V,E)$ | Top-K retrieval + Implicit causal reasoning. $S=\mathsf{Expand}(\mathrm{TopK}(q,V));\;\;y=\mathsf{G}(q,S)$ |
| \rowcolor gray!10 HugRAG (Ours) | Hierarchical Causal Gates across modules. $\mathcal{G}_{\text{idx}}=\mathcal{H}=\{H_{0},...,H_{L}\}$ | Causal Gating + Causal Path Filtering. $S=\underbrace{\mathsf{Traverse}(q,\mathcal{H})}_{\text{Break Isolation}}\cap\underbrace{\mathsf{Filter}_{\text{causal}}(S)}_{\text{Reduce Noise}}$ |
Table 1: Comparison of RAG frameworks based on knowledge organization and retrieval mechanisms. Notation: $\mathcal{M}$ modules, $\text{Sum}(·)$ summary, $\mathsf{PPR}$ Personalized PageRank, $\mathcal{H}$ hierarchy, $\mathcal{N}_{1}$ 1-hop neighborhood.
2 Related Work
2.1 RAG
Retrieval augmented generation grounds LLMs in external knowledge, but chunk level semantic search can be brittle and inefficient for large, heterogeneous, or structured corpora (Lewis et al., 2021). Graph-based RAG has therefore emerged to introduce structure for more informed retrieval.
Graph-based RAG.
GraphRAG constructs a graph structured index of external knowledge and performs query time retrieval over the graph, improving question focused access to large scale corpora (Edge et al., 2024). Building on this paradigm, later work studies richer selection mechanisms over structured graph. Agent driven retrieval explores the search space iteratively (Ravuru et al., 2024). Critic guided or winnowing style methods prune weak contexts after retrieval (Dong et al., ; Wang et al., 2025b). Others learn relevance scores for nodes, subgraphs, or reasoning paths, often with graph neural networks (Liu et al., 2025b). Representation extensions include hypergraphs for higher order relations (Luo et al., ) and graph foundation models for retrieval and reranking (Wang et al., ).
Knowledge Graph Organization.
Despite these advances, limitations related to graph organization remain underexamined. Most work emphasizes retrieval policies, while the organization of the underlying knowledge graph is largely overlooked, which strongly influences downstream retrieval behavior. As graphs scale, intrinsic modularity can emerge (Fortunato and Barthélemy, 2007; Newman, 2018), making retrieval prone to staying within dense modules rather than crossing them, largely limiting the retrieved information. Moreover, many work assume grouping knowledge for efficiency at scale, such as communities (Edge et al., 2024), phrases and passages (Gutiérrez et al., 2025), node edge sets (Guo et al., 2024), or semantic aggregation (Zhang et al., 2025) (see Table 1), which can amplify modular confinement and yield information isolation. This global issue primarily manifests as reduced recall. Some hierarchical approaches like LeanRAG attempt to bridge these gaps via semantic aggregation, but they remain constrained by semantic clustering and rely on tree-structured traversals (Zhang et al., 2025), often failing to capture logical dependencies that span across semantically distinct clusters.
Retrieval Issue.
A second limitation concerns how retrieval is formulated. Much work operates as a multi-hop search over nodes or subgraphs (Gutiérrez et al., 2025; Liu et al., 2025a), prioritizing semantic proximity to the query without explicit awareness of the reasoning in this searching process. This design can pull in topically similar yet causally irrelevant evidence, producing conflated retrieval results. Even when the correct fact node is present, the generator may respond with generic or superficial content, and the extra noise can increase the risk of hallucination. We view this as a locality issue that lowers precision.
QA Evaluation Issue.
These tendencies can be reinforced by common QA evaluation practice. First, many QA datasets emphasize short answers such as names, nationalities, or years (Kwiatkowski et al., 2019; Rajpurkar et al., 2016), so hitting the correct entity in the graph may be sufficient even without reasoning. Second, QA datasets often comprise thousands of independent question-answer-context triples. However, many approaches still rely on linear context concatenation to construct a graph, and then evaluate performance on isolated questions. This setup largely reduces the incentive for holistic comprehension of the underlying material, even though such end-to-end understanding is closer to real-world use cases. Third, some datasets are stale enough that answers may be partially memorized by pretrained LLM models, confounding retrieval quality with parametric knowledge. Therefore, these QA dataset issues are critical for evaluating RAG, yet relatively few works explicitly address them by adopting open-ended questions and fresher materials in controlled experiments.
2.2 Causality
LLM for Identifying Causality.
LLMs have demonstrated exceptional potential in causal discovery. By leveraging vast domain knowledge, LLMs significantly improve inference accuracy compared to traditional methods (Ma, 2024). Frameworks like CARE further prove that fine-tuned LLMs can outperform state-of-the-art algorithms (Dong et al., 2025). Crucially, even in complex texts, LLMs maintain a direction reversal rate under 1.1% (Saklad et al., 2026), ensuring highly reliable results.
Causality and RAG.
While LLMs increasingly demonstrate reliable causal reasoning capabilities, explicitly integrating causal structures into RAG remains largely underexplored. Current research predominantly focuses on internal attribution graphs for model interpretability (Walker and Ewetz, 2025; Dai et al., 2025), rather than external knowledge retrieval. Recent advances like CGMT (Luo et al., 2025) and LACR (Zhang et al., 2024) have begun to bridge this gap, utilizing causal graphs for medical reasoning path alignment or constraint-based structure induction. However, these works inherently differ in scope from our objective, as they prioritize rigorous causal discovery or recovery tasks in specific domain, which limits their scalability to the noisy, open-domain environments that we address. Existing causal-enhanced RAG frameworks either utilize causal feedback implicitly in embedding (Khatibi et al., 2025) or, like CausalRAG (Wang et al., 2025a), are restricted to small-scale settings with implicit causal reasoning. Consequently, a significant gap persists in leveraging causal graphs to guide knowledge graph organization and retrieval across large-scale, heterogeneous knowledge bases. Note that in this work, we use the term causal to denote explicit logical dependencies and event sequences described in the text, rather than statistical causal discovery from observational data.
3 Problem Formulation
We aim to retrieve an optimal subgraph $S^{*}âeq\mathcal{G}$ for a query $q$ to generate an answer $y$ . Graph-based RAG ( $S=\mathcal{R}(q,\mathcal{G})$ ) usually faces two structural bottlenecks.
1. Global Information Isolation (Recall Gap).
Intrinsic modularity often traps retrieval in local seeds, missing relevant evidence $v^{*}$ located in topologically distant modules (i.e., $S\cap\{v^{*}\}=\emptyset$ as no path exists within $h$ hops). HugRAG introduces causal gates across $\mathcal{H}$ , to bypass modular boundaries and bridge this gap. The efficacy of causal gates is empirically verified in Appendix E and further analyzed in the ablation study (see Section 5.3).
2. Local Spurious Noise (Precision Gap).
Semantic similarity $\text{sim}(q,v)$ often retrieves topically related but causally irrelevant nodes $\mathcal{V}_{sp}$ , diluting precision (where $|S\cap\mathcal{V}_{sp}|\gg|S\cap\mathcal{V}_{causal}|$ ). We address this by leveraging LLMs to identify explicit causal paths, filtering $\mathcal{V}_{sp}$ to ensure groundedness. While as discussed LLMs have demonstrated causal identification capabilities surpassing human experts (Ma, 2024; Dong et al., 2025) and proven effectiveness in RAG (Wang et al., 2025a), we further corroborate the validity of identified causal paths through expert knowledge across different domains (see Section 5.1). Consequently, HugRAG redefines retrieval as finding a mapping $\Phi:\mathcal{G}â\mathcal{H}$ and a causal filter $\mathcal{F}_{c}$ to simultaneously minimize isolation and spurious noise.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Causal Reasoning Framework
### Overview
The image presents a diagram illustrating a causal reasoning framework, divided into two main phases: "Graph Construction (Offline)" and "Retrieve and Answer (Online)". The diagram outlines the process of building a knowledge graph from raw texts, identifying causal relationships, and then using this graph to answer queries by distinguishing between causal and spurious connections.
### Components/Axes
**Left Side: Graph Construction (Offline)**
* **Raw Texts:** Represented by an icon of stacked documents.
* **IE (Information Extraction):** A gray label with an arrow pointing from "Raw Texts" to "Knowledge Graph".
* **Knowledge Graph:** Represented by a network icon.
* **Vector Store:** Represented by a database icon.
* **Embed:** A gray label with an arrow pointing from "Knowledge Graph" to "Vector Store".
* **Partition:** A gray label with an arrow pointing from "Raw Texts" to "Hierarchical Graph".
* **Hierarchical Graph:** A multi-layered graph structure.
* **Identify Causality:** A blue label with a person icon, pointing from "Hierarchical Graph" to "Graph with Causal Gates". Labeled "LLM" in blue below.
* **Graph with Causal Gates:** A multi-layered graph structure with some highlighted (blue) connections.
* **Embed:** An arrow pointing from "Graph with Causal Gates" to "Vector Store".
* **Hn, Hn-1, H0:** Labels indicating different layers of the hierarchical graphs.
**Right Side: Retrieve and Answer (Online)**
* **Query:** Represented by a person icon.
* **Embed and Score:** A gray label with an arrow pointing from "Query" to "Top K entities".
* **Top K entities:** Represented by a magnifying glass icon.
* **Answer:** Represented by a checkmark icon.
* **N hop via gates, cross modules:** A black curved arrow pointing from "Query" to "Context Subgraph".
* **Context Subgraph:** A multi-layered graph structure with some highlighted (blue) connections.
* **Distinguish Causal vs Spurious:** A blue label with a robot icon, pointing from "Context Subgraph" to "Context". Labeled "LLM" in blue below.
* **Context:** A multi-layered graph structure with a highlighted (blue) path leading to the "Answer".
* **with Query:** An arrow pointing from "Context" to "Answer".
### Detailed Analysis
**Graph Construction (Offline):**
1. **Raw Texts** are processed using **Information Extraction (IE)** to create a **Knowledge Graph**.
2. The **Knowledge Graph** is embedded into a **Vector Store**.
3. The **Raw Texts** are also partitioned into a **Hierarchical Graph**.
4. **Causality** is identified within the **Hierarchical Graph** using an **LLM (Large Language Model)**, resulting in a **Graph with Causal Gates**.
5. The **Graph with Causal Gates** is embedded into the **Vector Store**.
**Retrieve and Answer (Online):**
1. A **Query** is embedded and scored to identify **Top K entities**.
2. A **Context Subgraph** is extracted from the **Knowledge Graph** based on the **Query**.
3. **Causal** vs. **Spurious** connections are distinguished within the **Context Subgraph** using an **LLM**, resulting in a refined **Context**.
4. The **Context** is used to generate an **Answer** to the **Query**.
### Key Observations
* The diagram highlights the use of Large Language Models (LLMs) in both the offline graph construction and online query answering phases.
* The hierarchical graph structure is used in both phases, suggesting a multi-level representation of knowledge and context.
* The distinction between causal and spurious connections is a key aspect of the framework, ensuring accurate and reliable answers.
### Interpretation
The diagram illustrates a comprehensive framework for causal reasoning, leveraging knowledge graphs and LLMs. The offline phase focuses on building a structured representation of knowledge, while the online phase focuses on retrieving and reasoning about relevant information to answer queries. The use of LLMs to identify causality and distinguish between causal and spurious connections is crucial for ensuring the accuracy and reliability of the answers. The hierarchical graph structure allows for a multi-level representation of knowledge, enabling more nuanced and context-aware reasoning.
</details>
Figure 2: Overview of the HugRAG pipeline. In the offline stage, raw texts are embedded to build a knowledge graph and a vector store, then partitioning forms a hierarchical graph and an LLM identifies causal relations to construct a graph with causal gates. In the online stage, the query is embedded and scored to retrieve top K entities, then N hop traversal uses causal gates to cross modules and assemble a context subgraph; an LLM further distinguishes causal versus spurious relations to produce the final context and answer.
Algorithm 1 HugRAG Algorithm Pipeline
0: Corpus $\mathcal{D}$ , query $q$ , hierarchy levels $L$ , seed budget $\{K_{\ell}\}_{\ell=0}^{L}$ , hop $h$ , gate threshold $\tau$
0: Answer $y$ , Support Subgraph $S^{*}$
1: // Phase 1: Offline Hierarchical Organization
2: $G_{0}=(V_{0},E_{0})â\textsc{BuildBaseGraph}(\mathcal{D})$
3: $\mathcal{H}=\{H_{0},...,H_{L}\}â\textsc{LeidenPartition}(G_{0},L)$ {Organize into modules $\mathcal{M}$ }
4: $\mathcal{G}_{c}â\emptyset$
5: for all pair $(m_{i},m_{j})â\textsc{ModulePairs}(\mathcal{M})$ do
6: $scoreâ\textsc{LLM-EstCausal}(m_{i},m_{j})$
7: if $scoreâ„\tau$ then
8: $\mathcal{G}_{c}â\mathcal{G}_{c}\cup\{(m_{i}â m_{j},score)\}$ {Establish causal gates}
9: end if
10: end for
11: // Phase 2: Online Retrieval & Reasoning
12: $Uâ\bigcup_{\ell=0}^{L}\mathrm{TopK}(\text{sim}(q,u),K_{\ell},H_{\ell})$ {Multi-level semantic seeding}
13: $S_{raw}â\textsc{GatedTraversal}(U,\mathcal{H},\mathcal{G}_{c},h)$ {Break isolation via gates}
14: $S^{*}â\textsc{CausalFilter}(q,S_{raw})$ {Remove spurious nodes $\mathcal{V}_{sp}$ }
15: $yâ\textsc{LLM-Generate}(q,S^{*})$
4 Method
Overview.
As illustrated in Figure 2, HugRAG operates in two distinct phases to address the aforementioned structural bottlenecks. In the offline phase, we construct a hierarchical knowledge structure $\mathcal{H}$ partitioned into modules, which are then interconnected via causal gates $\mathcal{G}_{c}$ to enable logical traversals. In the online phase, HugRAG performs a gated expansion to break modular isolation, followed by a causal filtering step to eliminate spurious noise. The overall procedure is formalized in Algorithm 1, and we detail each component in the subsequent sections.
4.1 Hierarchical Graph with Causal Gating
To address the global information isolation challenge (Section 3), we construct a multi-scale knowledge structure that balances global retrieval recall with local precision.
Hierarchical Module Construction.
We first extract a base entity graph $G_{0}=(V_{0},E_{0})$ from the corpus $\mathcal{D}$ using an information extraction pipeline (see details in Appendix B.1), followed by entity canonicalization to resolve aliasing. To establish the hierarchical backbone $\mathcal{H}=\{H_{0},...,H_{L}\}$ , we iteratively partition the graph into modules using the Leiden algorithm (Traag et al., 2019), which optimizes modularity to identify tightly-coupled semantic regions. Formally, at each level $\ell$ , nodes are partitioned into modules $\mathcal{M}_{\ell}=\{m_{1}^{(\ell)},...,m_{k}^{(\ell)}\}$ . For each module, we generate a natural language summary to serve as a coarse-grained semantic anchor.
Offline Causal Gating.
While hierarchical modularity improves efficiency, it risks trapping retrieval within local boundaries. We introduce Causal Gates to explicitly model cross-module affordances. Instead of fully connecting the graph, we construct a sparse gate set $\mathcal{G}_{c}$ . Specifically, we identify candidate module pairs $(m_{i},m_{j})$ that are topologically distant but potentially logically related. An LLM then evaluates the plausibility of a causal connection between their summaries. We formally define the gate set via an indicator function $\mathbb{I}(·)$ :
$$
\mathcal{G}_{c}=\left\{(m_{i}\to m_{j})\mid\mathbb{I}_{\text{causal}}(m_{i},m_{j})=1\right\}, \tag{1}
$$
where $\mathbb{I}_{\text{causal}}$ denotes the LLMâs assessment (see Appendix B.1 for construction prompts and the Top-Down Hierarchical Pruning strategy we employed to mitigate the $O(N^{2})$ evaluation complexity). These gates act as shortcuts in the retrieval space, permitting the traversal to jump across disjoint modules only when logically warranted, thereby breaking information isolation without causing semantic drift (see Appendix C for visualizations of hierarchical modules and causal gates).
4.2 Retrieve Subgraph via Causally Gated Expansion
Given the hierarchical structure $\mathcal{H}$ and causal gates $\mathcal{G}_{c}$ , HugRAG retrieves a support subgraph $S$ by coupling multi-granular anchoring with a topology-aware expansion. This process is designed to maximize recall (breaking isolation) while suppressing drift (controlled locality).
Multi-Granular Hybrid Seeding.
Graph-based RAG often struggles to effectively differentiate between local details and global contexts within multi-level structures (Zhang et al., 2025; Edge et al., 2024). We overcome this by identifying a seed set $U$ across multiple levels of the hierarchy. We employ a hybrid scoring function $s(q,v)$ that interpolates between semantic embedding similarity and lexical overlap (details in Appendix B.2). This function is applied simultaneously to fine-grained entities in $H_{0}$ and coarse-grained module summaries in $H_{\ell>0}$ . Crucially, to prevent the semantic redundancy problem where seeds cluster in a single redundant neighborhood, we apply a diversity-aware selection strategy (MMR) to ensure the initial seeds $U$ cover distinct semantic facets of the query. This yields a set of anchors that serve as the starting nodes for expansion.
Gated Priority Expansion.
Starting from the seed set $U$ , we model retrieval as a priority-based traversal over a unified edge space $\mathcal{E}_{\text{uni}}$ . This space integrates three distinct types of connectivity: (1) Structural Edges ( $E_{\text{struc}}$ ) for local context, (2) Hierarchical Edges ( $E_{\text{hier}}$ ) for vertical drill-down, and (3) Causal Gates ( $\mathcal{G}_{c}$ ) for cross-module reasoning.
$$
\mathcal{E}_{\text{uni}}={E}_{\text{struc}}\cup E_{\text{hier}}\cup\mathcal{G}_{c}. \tag{2}
$$
The expansion follows a Best-First Search guided by a query-conditioned gain function. For a frontier node $v$ reached from a predecessor $u$ at hop $t$ , the gain is defined as:
$$
\text{Gain}(v)=s(q,v)\cdot\gamma^{t}\cdot w(\text{type}(u,v)), \tag{3}
$$
where $\gammaâ(0,1)$ is a standard decay factor to penalize long-distance traversal. The weight function $w(·)$ adjusts traversal priorities: we simply assign higher importance to causal gates and hierarchical links to encourage logic-driven jumps over random structural walks. By traversing $\mathcal{E}_{\text{uni}}$ , HugRAG prioritizes paths that drill down (via $E_{\text{hier}}$ ), explore locally (via $E_{\text{struc}}$ ), or leap to a causally related domain (via $\mathcal{G}_{c}$ ), effectively breaking modular isolation. The expansion terminates when the gain drops below a threshold or the token budget is exhausted.
| Datasets | Nodes | Edges | Modules | Size (Char) | Domain |
| --- | --- | --- | --- | --- | --- |
| MS MARCO (Bajaj et al., 2018) | 3,403 | 3,107 | 446 | 1,557,990 | Web |
| NQ (Kwiatkowski et al., 2019) | 5,579 | 4,349 | 505 | 767,509 | Wikipedia |
| 2WikiMultiHopQA (Ho et al., 2020) | 10,995 | 8,489 | 1,088 | 1,756,619 | Wikipedia |
| QASC (Khot et al., 2020) | 77 | 39 | 4 | 58,455 | Science |
| HotpotQA (Yang et al., 2018) | 20,354 | 15,789 | 2,359 | 2,855,481 | Wikipedia |
| HolisQA-Biology | 1,714 | 1,722 | 165 | 1,707,489 | Biology |
| HolisQA-Business | 2,169 | 2,392 | 292 | 1,671,718 | Business |
| HolisQA-CompSci | 1,670 | 1,667 | 158 | 1,657,390 | Computer Science |
| HolisQA-Medicine | 1,930 | 2,124 | 226 | 1,706,211 | Medicine |
| HolisQA-Psychology | 2,019 | 1,990 | 211 | 1,751,389 | Psychology |
Table 2: Statistics of the datasets used in evaluation.
4.3 Causal Path Identification and Grounding
The raw subgraph $S_{raw}$ retrieved via gated expansion optimizes for recall but inevitably includes spurious associations (e.g., high-degree hubs or coincidental co-occurrences). To address the local spurious noise challenge (Section 3), HugRAG employs a causal path refinement stage to directly distill $S_{raw}$ into a causally grounded graph $S^{\star}$ . See Appendix D for a full example of the HugRAG pipeline.
Causal Path Refinement.
We formulate the path refinement task as a structural pruning process. We first linearize the subgraph $S_{raw}$ into a token-efficient table where each node and edge is mapped to a unique short identifier (see Appendix B.3). The LLM is then prompted to analyze the topology and output the subset of identifiers that constitute valid causal paths connecting the query to the potential answer. Leveraging the robust causal identification capabilities of LLMs (Saklad et al., 2026), this operation effectively functions as a reranker, distilling the noisy subgraph into an explicit causal structure:
$$
S^{\star}=\textsc{LLM-CausalExpert}(S_{raw},q). \tag{4}
$$
The returned subgraph $S^{\star}$ contains only model-validated nodes and edges, effectively filtering irrelevant context.
Spurious-Aware Grounding.
To further improve the precision of this selection, we employ a spurious-aware prompting strategy (see prompts in Appendix A.1). In this configuration, the LLM is instructed to explicitly distinguish between causal supports and spurious correlations during its reasoning process. While the prompt may ask the model to identify spurious items as an auxiliary reasoning step, the primary objective remains the extraction of the valid causal subset. This explicit contrast helps the model resist hallucinated connections induced by semantic similarity, yielding a cleaner $S^{\star}$ compared to standard selection prompts and consequently improving downstream generation quality. This mechanism specifically targets the precision challenges outlined in Section 4.2. Finally, the answer $y$ is generated by conditioning the LLM solely on the text content corresponding to the pruned subgraph $S^{\star}$ (see prompts in Appendix A.2), ensuring that the generation is strictly grounded in verified evidence.
5 Experiments
Overview.
We conducted extensive experiments on diverse datasets across various domains to comprehensively evaluate and compare the performance of HugRAG against competitive baselines. Our analysis is guided by the following five research questions:
RQ1 (Overall Performance). How does HugRAG compare against state-of-the-art graph-based baselines across diverse, real-world knowledge domains?
RQ2 (QA vs. Holistic Comprehension). Do popular QA datasets implicitly favor the entity-centric retrieval paradigm, thereby inflating graph-based RAG that finds the right node without assembling a support chain?
RQ3 (Trade-off Reconciliation). Can HugRAG simultaneously improve Context Recall (Globality) and Answer Relevancy (Precision), mitigating the classic trade-off via hierarchical causal gating?
RQ4 (Ablation Study). What are the individual contributions of different components in HugRAG?
RQ5 (Scalability Robustness). How does HugRAGâs performance scale and remain robust under varying context lengths?
Table 3: Main results on HolisQA across five domains. We report F1 (answer overlap), CR (Context Recall: how much gold context is covered by retrieved evidence), and AR (Answer Relevancy: evaluator-judged relevance of the answer to the question), all scaled to $\%$ for readability. Bold indicates best per column. NaiveGeneration has CR $=0$ by definition (no retrieval).
| Baselines \rowcolor black!10 Naive Baselines | Medicine F1 | Computer Science CR | Business AR | Biology F1 | Psychology CR | AR | F1 | CR | AR | F1 | CR | AR | F1 | CR | AR |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| NaiveGeneration | 12.63 | 0.00 | 44.70 | 18.93 | 0.00 | 48.79 | 18.58 | 0.00 | 46.14 | 11.71 | 0.00 | 45.76 | 22.91 | 0.00 | 50.00 |
| BM25 | 17.72 | 52.04 | 50.64 | 24.00 | 39.12 | 52.40 | 28.11 | 37.06 | 55.52 | 19.61 | 43.02 | 52.32 | 30.46 | 33.44 | 56.63 |
| StandardRAG | 26.87 | 61.08 | 56.24 | 28.87 | 49.44 | 57.10 | 47.57 | 46.79 | 67.42 | 28.31 | 42.69 | 57.58 | 37.19 | 52.21 | 59.85 |
| \rowcolor black!10 Graph-based RAG | | | | | | | | | | | | | | | |
| GraphRAG Global | 17.13 | 54.56 | 48.19 | 23.75 | 37.65 | 53.17 | 23.62 | 25.01 | 48.12 | 20.67 | 40.90 | 52.41 | 31.09 | 34.26 | 54.62 |
| GraphRAG Local | 19.03 | 56.07 | 49.52 | 25.10 | 39.90 | 53.30 | 25.01 | 27.36 | 49.05 | 22.21 | 41.88 | 52.73 | 32.31 | 35.22 | 55.02 |
| LightRAG | 12.16 | 52.38 | 44.15 | 22.59 | 41.86 | 51.62 | 29.98 | 34.22 | 54.50 | 17.70 | 41.24 | 50.32 | 33.63 | 45.54 | 56.42 |
| \rowcolor black!10 Structural / Causal Augmented | | | | | | | | | | | | | | | |
| HippoRAG2 | 21.12 | 57.50 | 51.08 | 16.94 | 21.05 | 47.29 | 21.10 | 18.34 | 45.83 | 12.60 | 16.85 | 44.56 | 20.10 | 34.13 | 46.77 |
| LeanRAG | 34.25 | 60.43 | 56.60 | 30.51 | 57.61 | 55.45 | 48.30 | 59.29 | 60.35 | 33.82 | 58.43 | 56.10 | 42.85 | 57.46 | 58.65 |
| CausalRAG | 31.12 | 58.90 | 58.77 | 30.98 | 54.10 | 57.54 | 45.20 | 44.55 | 66.10 | 33.50 | 51.20 | 58.90 | 42.80 | 55.60 | 61.90 |
| HugRAG (ours) | 36.45 | 69.91 | 60.65 | 31.60 | 60.94 | 58.34 | 51.51 | 67.34 | 68.76 | 34.80 | 61.97 | 59.99 | 44.42 | 60.87 | 63.53 |
Table 4: Main results on five QA datasets. Metrics follow Section 5: F1, CR (Context Recall), and AR (Answer Relevancy), reported in $\%$ . Bold and underline denote best and second-best per column.
| Baselines \rowcolor black!10 Naive Baselines | MSMARCO F1 | NQ CR | TwoWiki AR | QASC F1 | HotpotQA CR | AR | F1 | CR | AR | F1 | CR | AR | F1 | CR | AR |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| NaiveGeneration | 5.28 | 0.00 | 15.06 | 7.17 | 0.00 | 10.94 | 9.15 | 0.00 | 11.77 | 2.69 | 0.00 | 13.74 | 14.38 | 0.00 | 15.74 |
| BM25 | 6.97 | 45.78 | 20.33 | 4.68 | 49.98 | 9.13 | 9.43 | 37.12 | 13.73 | 2.49 | 6.12 | 13.17 | 15.81 | 41.08 | 16.08 |
| StandardRAG | 14.93 | 48.55 | 31.11 | 7.57 | 45.82 | 11.14 | 10.33 | 32.28 | 13.57 | 2.01 | 5.50 | 13.16 | 6.68 | 43.17 | 14.66 |
| \rowcolor black!10 Graph-based RAG | | | | | | | | | | | | | | | |
| GraphRAG Global | 9.41 | 3.65 | 13.08 | 3.91 | 4.48 | 8.00 | 1.41 | 9.42 | 9.55 | 0.68 | 3.38 | 3.56 | 6.28 | 14.59 | 16.26 |
| GraphRAG Local | 30.87 | 25.71 | 57.76 | 23.56 | 44.56 | 44.68 | 18.85 | 32.03 | 37.29 | 8.30 | 9.54 | 46.59 | 33.14 | 44.07 | 40.82 |
| LightRAG | 37.70 | 54.22 | 63.54 | 24.97 | 60.65 | 50.53 | 14.44 | 40.98 | 36.56 | 8.20 | 20.40 | 44.35 | 28.39 | 48.17 | 43.78 |
| \rowcolor black!10 Structural / Causal Augmented | | | | | | | | | | | | | | | |
| HippoRAG2 | 23.35 | 45.45 | 55.18 | 29.64 | 57.21 | 37.50 | 18.47 | 55.53 | 17.34 | 14.73 | 4.38 | 49.94 | 38.80 | 42.06 | 24.66 |
| LeanRAG | 38.02 | 54.01 | 58.49 | 35.46 | 65.91 | 49.87 | 20.27 | 40.53 | 38.37 | 13.19 | 22.80 | 45.51 | 48.68 | 46.29 | 43.50 |
| CausalRAG | 27.66 | 39.38 | 46.03 | 29.45 | 68.04 | 17.35 | 15.93 | 28.38 | 19.76 | 7.65 | 46.86 | 35.56 | 40.00 | 27.83 | 21.32 |
| HugRAG (ours) | 38.40 | 60.48 | 66.02 | 49.50 | 70.36 | 55.09 | 31.97 | 41.95 | 42.67 | 13.35 | 70.80 | 49.40 | 64.83 | 40.30 | 45.72 |
5.1 Experimental Setup
Datasets.
We evaluate HugRAG on a diverse suite of datasets covering complementary difficulty profiles. For standard evaluation, we use five established datasets: MS MARCO (Bajaj et al., 2018) and Natural Questions (Kwiatkowski et al., 2019) emphasize large-scale open-domain retrieval; HotpotQA (Yang et al., 2018) and 2WikiMultiHop (Ho et al., 2020) require evidence aggregation; and QASC (Khot et al., 2020) targets compositional scientific reasoning. However, these datasets often suffer from entity-centric biases and potential data leakage (memorization by LLMs). To rigorously test the holistic understanding capability of RAG, we introduce HolisQA, a dataset derived from high-quality academic papers sourced (Priem et al., 2022). Spanning over diverse domains (including Biology, Computer Science, Medicine, etc.), HolisQA features dense logical structures that naturally demand holistic comprehension (see more details in Appendix F.2). All dataset statistics are summarized in Table 2. While LLMs have demonstrated strong capabilities in identifying causality (Ma, 2024; Dong et al., 2025) and effectiveness in RAG (Wang et al., 2025a), to ensure rigorous evaluation, we incorporated cross-domain expert review to validate the quality of baseline answers and confirm the legitimacy of the induced causal relations.
Baselines.
We compare HugRAG against eight baselines spanning three retrieval paradigms. First, to cover Naive and Flat approaches, we include Naive Generation (no retrieval) as a lower bound, alongside BM25 (sparse) and Standard RAG (Lewis et al., 2021) (dense embedding-based), representing mainstream unstructured retrieval. Second, we evaluate established graph-based frameworks: GraphRAG (Local and Global) (Edge et al., 2024), utilizing community summaries; and LightRAG (Guo et al., 2024), relying on dual-level keyword-based search. Third, we benchmark against RAGs with structured or causal augmentation: HippoRAG 2 (Gutiérrez et al., 2025), utilizing passage nodes and Personalized PageRank diffusion; LeanRAG (Zhang et al., 2025), employing semantic aggregation hierarchies and tree-based LCA retrieval; and CausalRAG (Wang et al., 2025a), which incorporates causality without explicit causal reasoning. This selection comprehensively covers the spectrum from unstructured search to advanced structure-aware and causally augmented graph methods.
Metrics.
For metrics, we first report the token-level answer quality metric F1 for surface robustness. To measure whether retrieval actually supports generation, we additionally compute grounding metrics, context recall and answer relevancy (Es et al., 2024), which jointly capture coverage and answer quality (see Appendix F.4).
Implementation Details.
For all experiments, we utilize gpt-5-nano as the backbone LLM for both the open IE extraction and generation stages, and Sentence-BERT (Reimers and Gurevych, 2019) for semantic vectorization. For HugRAG, we set the hierarchical seed budget to $K_{L}=3$ for modules and $K_{0}=3$ for entities, causal gate is enabled by default except ablation study. Experiments run on a cluster using 10-way job arrays; each task uses 2 CPU cores and 16 GB RAM (20 cores, 160GB in total). See more implementation details in Appendix F.3.
5.2 Main Experiments
Overall Performance (RQ1).
HugRAG consistently achieves superior performance across all HolisQA domains and standard QA metrics (Section 5, Section 5). While traditional methods (e.g., BM25, Standard RAG) struggle with structural dependencies, graph-based baselines exhibit distinct limitations. GraphRAG-Global relies heavily on high-level community summaries and largely suffers from detailed QA tasks, necessitating its GraphRAG Local variant to balance the granularity trade-off. LightRAG struggles to achieve competitive results, limited by its coarse-grained key-value lookup mechanism. Regarding structurally augmented methods, while LeanRAG (utilizing semantic aggregation) and HippoRAG2 (leveraging phrase/passage nodes) yield slight improvements in context recall, they fail to fully break information isolation compared to our causal gating mechanism. Finally, although CausalRAG occasionally attains high Answer Relevancy due to its causal reasoning capability, it struggles to scale to large datasets due to the lack of efficient knowledge graph organization.
Holistic Comprehension vs. QA (RQ2).
The contrast between the results on HolisQA (Section 5) and standard QA datasets (Section 5) is revealing. On popular QA benchmarks, entity-centric methods like LightRAG, GraphRAG-Local, LeanRAG could occasionally achieve good scores. However, their performance degrades collectively and significantly on HolisQA. A striking counterexample is GraphRAG-Global: while its reliance on community summaries hindered performance on granular standard QA tasks, now it rebounds significantly in HolisQA. This discrepancy strongly suggests that standard QA datasets, which often favor short answers, implicitly reward the entity-centric paradigm. In contrast, HolisQA, with its open-ended questions and dense logical structures, necessitates a comprehensive understanding of the underlying documentâa scenario closer to real-world applications. Notably, HugRAG is the only framework that remains robust across this paradigm shift, demonstrating competitive performance on both entity-centric QA and holistic comprehension tasks.
Reconciling the Accuracy-Grounding Trade-off (RQ3).
HugRAG effectively reconciles the fundamental tension between Recall and Precision. While hierarchical causal gating expands traversal boundaries to secure superior Context Recall (Globality), the explicit causal path identification rigorously prunes spurious noise to maintain high F1 Score and Answer Relevancy (Locality). This dual mechanism allows HugRAG to simultaneously optimize for global coverage and local groundedness, achieving a balance often missed by prior methods.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Different Model Configurations
### Overview
The image is a bar chart comparing the performance of different model configurations across three metrics: F1, CR, and AR. The chart displays the score achieved by each configuration for each metric, allowing for a direct comparison of their effectiveness. The configurations vary based on the inclusion or exclusion of H, CG, and Causal components.
### Components/Axes
* **Y-axis:** "Score", ranging from 0 to 70, with tick marks at intervals of 10.
* **X-axis:** "Metric", with three categories: F1, CR, and AR.
* **Legend:** Located at the top of the chart, it identifies the model configurations represented by different colors:
* **Teal:** w/o H · w/o CG · w/o Causal
* **Yellow:** w/ H · w/o CG · w/o Causal
* **Blue:** w/ H · w/ CG · w/o Causal
* **Pink:** w/o H · w/o CG · w/ Causal
* **Green:** w/ H · w/ CG · w/ Causal
* **Orange:** w/ H · w/ CG · w/ SP-Causal
### Detailed Analysis
Here's a breakdown of the scores for each configuration across the three metrics:
* **F1:**
* w/o H · w/o CG · w/o Causal (Teal): 26.8
* w/ H · w/o CG · w/o Causal (Yellow): 24.0
* w/ H · w/ CG · w/o Causal (Blue): 23.3
* w/o H · w/o CG · w/ Causal (Pink): 30.1
* w/ H · w/ CG · w/ Causal (Green): 36.8
* w/ H · w/ CG · w/ SP-Causal (Orange): 38.6
* **CR:**
* w/o H · w/o CG · w/o Causal (Teal): 54.7
* w/ H · w/o CG · w/o Causal (Yellow): 58.0
* w/ H · w/ CG · w/o Causal (Blue): 60.2
* w/o H · w/o CG · w/ Causal (Pink): 55.4
* w/ H · w/ CG · w/ Causal (Green): 60.0
* w/ H · w/ CG · w/ SP-Causal (Orange): 60.4
* **AR:**
* w/o H · w/o CG · w/o Causal (Teal): 55.7
* w/ H · w/o CG · w/o Causal (Yellow): 53.6
* w/ H · w/ CG · w/o Causal (Blue): 52.6
* w/o H · w/o CG · w/ Causal (Pink): 52.6
* w/ H · w/ CG · w/ Causal (Green): 64.1
* w/ H · w/ CG · w/ SP-Causal (Orange): 67.4
### Key Observations
* For F1, the configurations including "w/ H · w/ CG" (Green and Orange) perform significantly better than the others.
* For CR, the scores are relatively close across all configurations, with "w/ H · w/ CG · w/ SP-Causal" (Orange) showing a slight edge.
* For AR, the "w/ H · w/ CG · w/ SP-Causal" (Orange) configuration achieves the highest score, followed closely by "w/ H · w/ CG · w/ Causal" (Green).
### Interpretation
The data suggests that including both H and CG components generally improves performance, especially for the F1 metric. The inclusion of "SP-Causal" (w/ H · w/ CG · w/ SP-Causal) appears to provide the best overall performance, particularly for the AR metric. The configurations that exclude H and CG tend to have lower scores across all metrics. The specific impact of "Causal" varies depending on the metric and the presence of other components. The "w/ H · w/ CG · w/ SP-Causal" configuration consistently performs well, indicating that it may be the most effective model configuration among those tested.
</details>
Figure 3: Ablation Study. H: Hierarchical Structure; CG: Causal Gates; Causal/SP-Causal: Standard vs. Spurious-Aware Causal Identification. w/o and w/ denote exclusion or inclusion.
5.3 Ablation Study
To address RQ4, we ablate hierarchy, causal gates, and causal path refinement components (see Figure 3), finding that their combination yields optimal results. Specifically, we observe a mutually reinforcing dynamic: while hierarchical gates break information isolation to boost recall, the spurious-aware causal identification is indispensable for filtering the resulting noise and achieving a significant improvement. This mutual reinforcement allows HugRAG to reconcile global coverage with local groundedness, significantly outperforming any isolated component.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: RAG Performance vs. Source Text Length
### Overview
The image is a line chart comparing the performance (Score) of different Retrieval-Augmented Generation (RAG) models against varying lengths of source text (in characters). The chart displays multiple lines, each representing a different RAG model, with the x-axis indicating source text length and the y-axis indicating the performance score.
### Components/Axes
* **Y-axis:** "Score", ranging from 0 to 60, with tick marks at intervals of 10.
* **X-axis:** "Source Text Length (chars)", with values 5K, 10K, 25K, 100K, 300K, 750K, 1M, and 1.5M.
* **Legend:** Located at the top of the chart, listing the RAG models and their corresponding line colors/styles:
* Naive (Gray with circle markers)
* BM25 (Dark Gray with square markers)
* Standard RAG (Light Gray with triangle markers)
* GraphRAG Global (Blue with square markers)
* GraphRAG Local (Dark Blue with star markers)
* LightRAG (Teal with triangle markers)
* HippoRAG2 (Light Blue with circle markers)
* LeanRAG (Blue-Gray dashed line with x markers)
* CausalRAG (Light Blue with diamond markers)
* HugRAG (Red with star markers)
### Detailed Analysis
* **Naive (Gray with circle markers):** The line starts at approximately 8, decreases to approximately 3 at 10K, increases to approximately 5 at 25K, remains relatively flat around 4-5 until 1M, and ends at approximately 6 at 1.5M.
* **BM25 (Dark Gray with square markers):** The line starts at approximately 20, decreases to approximately 17 at 10K, decreases to approximately 15 at 25K, increases to approximately 20 at 100K, decreases to approximately 17 at 300K, remains relatively flat around 16-17 until 1M, and ends at approximately 17 at 1.5M.
* **Standard RAG (Light Gray with triangle markers):** The line starts at approximately 18, increases to approximately 24 at 10K, decreases to approximately 18 at 25K, increases to approximately 20 at 100K, decreases to approximately 18 at 300K, remains relatively flat around 17-18 until 1M, and ends at approximately 17 at 1.5M.
* **GraphRAG Global (Blue with square markers):** The line starts at approximately 8, decreases to approximately 3 at 10K, increases to approximately 9 at 25K, increases to approximately 10 at 100K, decreases to approximately 6 at 300K, remains relatively flat around 5-6 until 1M, and ends at approximately 6 at 1.5M.
* **GraphRAG Local (Dark Blue with star markers):** The line starts at approximately 48, decreases to approximately 38 at 10K, decreases to approximately 28 at 25K, increases to approximately 40 at 100K, decreases to approximately 30 at 300K, increases to approximately 33 at 750K, remains relatively flat around 32-33 until 1.5M.
* **LightRAG (Teal with triangle markers):** The line starts at approximately 44, increases to approximately 47 at 10K, decreases to approximately 43 at 25K, increases to approximately 46 at 100K, decreases to approximately 45 at 300K, remains relatively flat around 45-46 until 1M, and ends at approximately 45 at 1.5M.
* **HippoRAG2 (Light Blue with circle markers):** The line starts at approximately 30, decreases to approximately 28 at 10K, decreases to approximately 24 at 25K, increases to approximately 30 at 100K, increases to approximately 31 at 300K, remains relatively flat around 31-32 until 1M, and ends at approximately 33 at 1.5M.
* **LeanRAG (Blue-Gray dashed line with x markers):** The line starts at approximately 49, increases to approximately 50 at 10K, decreases to approximately 46 at 25K, increases to approximately 52 at 100K, decreases to approximately 51 at 300K, remains relatively flat around 51 until 1M, and ends at approximately 48 at 1.5M.
* **CausalRAG (Light Blue with diamond markers):** The line starts at approximately 14, increases to approximately 24 at 10K, increases to approximately 26 at 25K, increases to approximately 29 at 100K, decreases to approximately 25 at 300K, remains relatively flat around 24-25 until 1M, and ends at approximately 24 at 1.5M.
* **HugRAG (Red with star markers):** The line starts at approximately 54, decreases to approximately 53 at 10K, decreases to approximately 49 at 25K, increases to approximately 57 at 100K, decreases to approximately 56 at 300K, increases to approximately 57 at 750K, remains relatively flat around 57 until 1M, and ends at approximately 55 at 1.5M.
### Key Observations
* HugRAG consistently achieves the highest scores across all source text lengths.
* Naive and GraphRAG Global consistently perform the worst.
* The performance of most RAG models fluctuates with source text length, but HugRAG remains relatively stable.
* There is a general trend of performance increasing from 5K to 100K characters, followed by a slight decrease or stabilization.
### Interpretation
The chart suggests that the HugRAG model is the most robust and effective across different source text lengths compared to the other RAG models tested. The performance fluctuations observed in other models indicate that their effectiveness may be more sensitive to the length of the input text. The consistently low performance of Naive and GraphRAG Global suggests that these models may require further optimization or are not well-suited for the tested task. The initial increase in performance up to 100K characters for some models could indicate an optimal range for source text length, beyond which performance plateaus or slightly declines.
</details>
Figure 4: Scalability analysis of HugRAG and other RAG baselines across varying source text lengths (5K to 1.5M characters).
5.4 Scalability Analysis
Robustness to Information Scale (RQ5).
To assess robustness against information overload, we evaluated performance across varying source text lengths ( $5k$ to $1.5M$ characters) sampled from HolisQA, reporting the mean of F1, Context Recall, and Answer Relevancy (see Figure 4). As illustrated, HugRAG (red line) exhibits remarkable stability across all scales, maintaining high scores even at 1.5M characters. This confirms that our hierarchical causal gating structure effectively encapsulates complexity, enabling the retrieval process to scale via causal gates without degrading reasoning fidelity.
6 Conclusion
We introduced HugRAG to resolve information isolation and spurious noise in graph-based RAG. By leveraging hierarchical causal gating and explicit identification, HugRAG reconciles global context coverage with local evidence grounding. Experiments confirm its superior performance not only in standard QA but also in holistic comprehension, alongside robust scalability to large knowledge bases. Additionally, we introduced HolisQA to evaluate complex reasoning capabilities for RAG. We hope our findings contribute to the ongoing development of RAG research.
Impact Statement
This paper presents work whose goal is to advance the field of machine learning, specifically by improving the reliability and interpretability of retrieval-augmented generation. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.
References
- P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang (2018) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv. External Links: 1611.09268, Document Cited by: Table 2, §5.1.
- X. Dai, K. Guo, C. Lo, S. Zeng, J. Ding, D. Luo, S. Mukherjee, and J. Tang (2025) GraphGhost: Tracing Structures Behind Large Language Models. arXiv. External Links: 2510.08613, Document Cited by: §2.2.
- [3] G. Dong, J. Jin, X. Li, Y. Zhu, Z. Dou, and J. Wen RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation. Cited by: §2.1.
- J. Dong, Y. Liu, A. Aloui, V. Tarokh, and D. Carlson (2025) CARE: Turning LLMs Into Causal Reasoning Expert. arXiv. External Links: 2511.16016, Document Cited by: §2.2, §3, §5.1.
- D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson (2024) From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv. External Links: 2404.16130 Cited by: Figure 8, §B.1, Table 1, §1, §2.1, §2.1, §4.2, §5.1.
- S. Es, J. James, L. Espinosa Anke, and S. Schockaert (2024) RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, N. Aletras and O. De Clercq (Eds.), St. Julians, Malta, pp. 150â158. External Links: Document Cited by: Figure 15, §F.3, §F.4, §5.1.
- S. Fortunato and M. BarthĂ©lemy (2007) Resolution limit in community detection. Proceedings of the National Academy of Sciences 104 (1), pp. 36â41. External Links: Document Cited by: §1, §2.1.
- Z. Guo, L. Xia, Y. Yu, T. Ao, and C. Huang (2024) LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv. External Links: 2410.05779 Cited by: Table 1, §1, §2.1, §5.1.
- B. J. Gutiérrez, Y. Shu, W. Qi, S. Zhou, and Y. Su (2025) From RAG to Memory: Non-Parametric Continual Learning for Large Language Models. arXiv. External Links: 2502.14802, Document Cited by: Table 1, §1, §2.1, §2.1, §5.1.
- X. Ho, A. D. Nguyen, S. Sugawara, and A. Aizawa (2020) Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. arXiv. External Links: 2011.01060, Document Cited by: Table 2, §5.1.
- E. Khatibi, Z. Wang, and A. M. Rahmani (2025) CDF-RAG: Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation. arXiv. External Links: 2504.12560, Document Cited by: §2.2.
- T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal (2020) QASC: A Dataset for Question Answering via Sentence Composition. arXiv. External Links: 1910.11473, Document Cited by: Table 2, §5.1.
- T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 452â466. External Links: Document Cited by: §2.1, Table 2, §5.1.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. KĂŒttler, M. Lewis, W. Yih, T. RocktĂ€schel, S. Riedel, and D. Kiela (2021) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv. External Links: 2005.11401, Document Cited by: Table 1, §1, §2.1, §5.1.
- H. Liu, Z. Wang, X. Chen, Z. Li, F. Xiong, Q. Yu, and W. Zhang (2025a) HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation. arXiv. External Links: 2502.12442, Document Cited by: §2.1.
- H. Liu, S. Wang, and J. Li (2025b) Knowledge Graph Retrieval-Augmented Generation via GNN-Guided Prompting. Cited by: §1, §2.1.
- H. Luo, J. Zhang, and C. Li (2025) Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs. arXiv. External Links: 2501.14892, Document Cited by: §2.2.
- [18] H. Luo, Q. Lin, Y. Feng, Z. Kuang, M. Song, Y. Zhu, and L. A. Tuan HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation. Cited by: §1, §2.1.
- J. Ma (2024) Causal Inference with Large Language Model: A Survey. arXiv. External Links: 2409.09822 Cited by: §2.2, §3, §5.1.
- M. Newman (2018) Networks. Vol. 1, Oxford University Press. External Links: Document, ISBN 978-0-19-880509-0 Cited by: §2.1.
- J. Priem, H. Piwowar, and R. Orr (2022) OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. Cited by: §F.2, §5.1.
- P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv. External Links: 1606.05250, Document Cited by: §2.1.
- C. Ravuru, S. S. Sakhinana, and V. Runkana (2024) Agentic Retrieval-Augmented Generation for Time Series Analysis. arXiv. External Links: 2408.14484, Document Cited by: §1, §2.1.
- N. Reimers and I. Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3980â3990. External Links: Document Cited by: §F.3, §5.1.
- R. Saklad, A. Chadha, O. Pavlov, and R. Moraffah (2026) Can Large Language Models Infer Causal Relationships from Real-World Text?. arXiv. External Links: 2505.18931, Document Cited by: §2.2, §4.3.
- V. Traag, L. Waltman, and N. J. van Eck (2019) From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9 (1), pp. 5233. External Links: 1810.08473, ISSN 2045-2322, Document Cited by: §B.1, §4.1.
- C. Walker and R. Ewetz (2025) Explaining the Reasoning of Large Language Models Using Attribution Graphs. arXiv. External Links: 2512.15663, Document Cited by: §2.2.
- N. Wang, X. Han, J. Singh, J. Ma, and V. Chaudhary (2025a) CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation. In Findings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.), Vienna, Austria, pp. 22680â22693. External Links: Document, ISBN 979-8-89176-256-5 Cited by: Table 1, §2.2, §3, §5.1, §5.1.
- S. Wang, Z. Chen, P. Wang, Z. Wei, Z. Tan, Y. Meng, C. Shen, and J. Li (2025b) Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation. arXiv. External Links: 2511.04700, Document Cited by: §2.1.
- [30] X. Wang, Z. Liu, J. Han, and S. Deng RAG4GFM: Bridging Knowledge Gaps in Graph Foundation Models through Graph Retrieval Augmented Generation. Cited by: §2.1.
- Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv. External Links: 1809.09600, Document Cited by: Table 2, §5.1.
- Y. Zhang, R. Wu, P. Cai, X. Wang, G. Yan, S. Mao, D. Wang, and B. Shi (2025) LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval. arXiv. External Links: 2508.10391, Document Cited by: Table 1, §1, §2.1, §4.2, §5.1.
- Y. Zhang, Y. Zhang, Y. Gan, L. Yao, and C. Wang (2024) Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models. arXiv. External Links: 2402.15301 Cited by: §2.2.
Appendix A Prompts used in Online Retrieval and Reasoning
This section details the prompt engineering employed during the online retrieval phase of HugRAG. We rely on Large Language Models to perform two critical reasoning tasks: identifying causal paths within the retrieved subgraph and generating the final grounded answer.
A.1 Causal Path Identification
To address the local spurious noise issue, we design a prompt that instructs the LLM to act as a âcausality analyst.â The model receives a linearized list of potential evidence (nodes and edges) and must select the subset that forms a coherent causal chain.
Spurious-Aware Selection (Main Setting).
Our primary prompt, illustrated in Figure 5, explicitly instructs the model to differentiate between valid causal supports (output in precise) and spurious associations (output in ct_precise). By forcing the model to articulate what is not causal (e.g., mere correlations or topical coincidence), we improve the precision of the selected evidence.
Standard Selection (Ablation).
To verify the effectiveness of spurious differentiation, we also use a simplified prompt variant shown in Figure 6. This version only asks the model to identify valid causal items without explicitly labeling spurious ones.
A.2 Final Answer Generation
Once the spurious-filtered support subgraph $S^{\star}$ is obtained, it is passed to the generation module. The prompt shown in Figure 7 is used to synthesize the final answer. Crucially, this prompt enforces strict grounding by instructing the model to rely only on the provided evidence context, minimizing hallucination.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Text Document: Task Definition for Causality Analysis
### Overview
The image presents a task definition for a causality analyst, outlining the role, goal, input, output format, and constraints for a retrieval and ranking task. The task involves selecting important context items (forming a causal graph) and less important items (spurious information) based on a given query and context items. The output is specified in JSON format.
### Components/Axes
The document is structured into the following sections:
* **Role:** Defines the persona of the task performer.
* **Goal:** Describes the objective of the task.
* **Inputs:** Specifies the required input data.
* **Output Format (JSON):** Defines the structure of the output.
* **Constraints:** Sets limitations on the output.
### Detailed Analysis or ### Content Details
**Role:**
* You are a careful causality analyst acting as a reranker for retrieval.
**Goal:**
* Given a query and a list of context items (short ID + content), select the most important items consisting of the causal graph and output them in "precise".
* Also, output the least important items as the spurious information in "ct_precise".
* You MUST:
* Use only the provided items.
* Rank `precise` from most important to least important.
* Rank `ct_precise` from least important to more important.
* Output JSON only. Do not add markdown.
* Use the short IDs exactly as shown.
* Do NOT include any IDs in `p_answer`.
**Inputs:**
* Query: `{query}`
* Context Items (short ID | content): `{context_table}`
**Output Format (JSON):**
</details>
Figure 5: Prompt for Causal Path Identification with Spurious Distinction (HugRAG Main Setting). The model is explicitly instructed to segregate non-causal associations into a separate list to enhance reasoning precision.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Text Document: Task Definition for Causality Analysis
### Overview
The image presents a task definition for a causality analyst acting as a reranker for retrieval. It outlines the role, goal, inputs, output format, and constraints for the task. The task involves selecting the most important context items to support answering a query as a causal graph.
### Components/Axes
The document is structured into the following sections:
* **Role:** Defines the persona of the agent.
* **Goal:** Describes the objective of the task.
* **Inputs:** Specifies the required inputs for the task.
* **Output Format (JSON):** Defines the structure of the output.
* **Constraints:** Sets limitations on the output.
### Detailed Analysis or ### Content Details
**Role:**
* You are a careful causality analyst acting as a reranker for retrieval.
**Goal:**
* Given a query and a list of context items (short ID + content), select the most important items that best support answering the query as a causal graph.
* You MUST:
* Use only the provided items.
* Rank the 'precise' list from most important to least important.
* Output JSON only. Do not add markdown.
* Use the short IDs exactly as shown.
* Do NOT include any IDs in `p_answer`.
* If evidence is insufficient, say so in `p_answer` (e.g., "Unknown").
**Inputs:**
* Query: `{query}`
* Context Items (short ID | content): `{context_table}`
**Output Format (JSON):**
```json
{
"precise": ["C1", "N2", "E3"],
"p_answer": "concise draft answer"
}
```
**Constraints:**
* `precise` length: at most `{max_precise_items}` items.
* `p_answer` length: at most `{max_answer_words}` words.
### Key Observations
* The task requires ranking context items based on their importance to a given query.
* The output must be in JSON format.
* The length of the `precise` and `p_answer` fields are constrained.
* The agent should return "Unknown" if evidence is insufficient.
### Interpretation
The document defines a specific task for a causality analyst, focusing on information retrieval and ranking. The goal is to identify and rank context items that best support answering a query within a causal graph framework. The constraints ensure that the output is concise and adheres to a specific format. The instruction to return "Unknown" when evidence is insufficient promotes responsible and transparent behavior. The use of short IDs and the exclusion of IDs in the `p_answer` field suggest a focus on content rather than identifiers in the final answer.
```
</details>
Figure 6: Ablation Prompt: Causal Path Identification without differentiating spurious relationships. This baseline is used to assess the contribution of the spurious filtering mechanism.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Text Block: Instruction Set for an Assistant
### Overview
The image presents a set of instructions or guidelines for an assistant, outlining their role, goal, the context they should use, and the desired format for their answers. It appears to be a template or a set of instructions for a language model or AI assistant.
### Components/Axes
The image is structured as a series of labeled sections, each defining a specific aspect of the assistant's task:
* **Role:** Defines the persona of the assistant.
* **Goal:** Specifies the primary objective of the assistant.
* **Evidence Context:** Indicates the source of information the assistant should use.
* **Draft Answer (optional):** Suggests the possibility of a pre-written answer.
* **Question:** Represents the input query from the user.
* **Answer Format:** Describes the desired style and tone of the assistant's response.
### Detailed Analysis or ### Content Details
The text within each section provides specific instructions:
* **Role:** "You are a helpful assistant answering the user's question."
* **Goal:** "Answer the question using the provided evidence context. A draft answer may be provided; use it only if it is supported by the evidence."
* **Evidence Context:** "{report\_context}" - This suggests a placeholder for the actual evidence context.
* **Draft Answer (optional):** "{draft\_answer}" - This suggests a placeholder for a pre-written answer.
* **Question:** "{query}" - This suggests a placeholder for the user's question.
* **Answer Format:** "Concise, direct, and neutral."
### Key Observations
The instructions emphasize the importance of using provided evidence to answer questions and maintaining a concise, direct, and neutral tone. The use of placeholders like "{report\_context}", "{draft\_answer}", and "{query}" indicates that this is a template to be filled with specific information for each task.
### Interpretation
The image represents a structured approach to guiding an AI assistant in answering questions. By defining the role, goal, context, and format, it aims to ensure that the assistant provides accurate and relevant responses. The optional draft answer suggests a mechanism for pre-approving or guiding the assistant's response, while the emphasis on evidence-based answers promotes reliability and trustworthiness. The instructions are designed to ensure the assistant is helpful, accurate, and maintains a professional tone.
</details>
Figure 7: Prompt for Final Answer Generation. The model is conditioned solely on the filtered causal subgraph $S^{\star}$ to ensure groundedness.
Appendix B Algorithm Details of HugRAG
This section provides granular details on the offline graph construction process and the specific algorithms used during the online retrieval phase, complementing the high-level description in Section 4.
B.1 Graph Construction
Entity Extraction and Deduplication.
The base graph $H_{0}$ is constructed by processing text chunks using LLM. We utilize the prompt shown in Appendix 8, adapted from (Edge et al., 2024), to extract entities and relations (see prompts in Figure 8). Since raw extractions from different chunks inevitably contain duplicates (e.g., âJ. Bidenâ vs. âJoe Bidenâ), we employ a two-stage deduplication strategy. First, we perform surface-level canonicalization using fuzzy string matching. Second, we use embedding similarity to identify semantically identical nodes, merging their textual descriptions and pooling their supporting evidence edges.
Hierarchical Partitioning.
We employ the Leiden algorithm (Traag et al., 2019) to maximize the modularity $Q$ of the partition. We recursively apply this partitioning to build bottom-up levels $H_{1},...,H_{L}$ , stopping when the summary of a module fits within a single context window.
Causal Gates.
The prompt we used to build causal gates is shown in Figure 9. Constructing causal gates via exhaustive pairwise verification across all modules results in a quadratic time complexity $O(N^{2})$ , where $N$ is the total number of modules. Consequently, as the hierarchy depth scales, this becomes computationally prohibitive for LLM-based verification. To address this, we implement a Top-Down Hierarchical Pruning strategy that constructs gates layer-by-layer, from the coarsest semantic level ( $H_{L}$ ) down to $H_{1}$ . The core intuition leverages the transitivity of causality: if a causal link is established between two parent modules, it implicitly covers the causal flow between their respective sub-trees (see full algorithm in Algorithm 2).
The pruning process follows three key rules:
1. Layer-wise Traversal: We iterate from top ( $L$ ) (usually sparse) to bottom ( $1$ ) (usually dense).
1. Intra-layer Verification: We first identify causal connections between modules within the current layer.
1. Inter-layer Look-Ahead Pruning: When searching for connections between a module $u$ (current layer) and modules in the next lower layer ( $l-1$ ), we prune the search space by:
- Excluding $u$ âs own children (handled by hierarchical inclusion).
- Excluding children of modules already causally connected to $u$ . If $uâ v$ is established, we assume the high-level connection covers the relationship, skipping individual checks for $Children(v)$ .
This strategy ensures that we only expend computational resources on discovering subtle, granular causal links that were not captured at higher levels, effectively reducing the complexity from quadratic to near-linear in practice.
Algorithm 2 Top-Down Hierarchical Pruning for Causal Gates
0: Hierarchy $\mathcal{H}=\{H_{0},H_{1},...,H_{L}\}$
0: Set of Causal Gates $\mathcal{G}_{c}$
1: $\mathcal{G}_{c}â\emptyset$
2: for $l=L$ down to $1$ do
3: for each module $uâ H_{l}$ do
4: // 1. Intra-layer Verification
5: $ConnectedPeersâ\emptyset$
6: for $vâ H_{l}\setminus\{u\}$ do
7: if $\text{LLM\_Verify}(u,v)$ then
8: $\mathcal{G}_{c}.\text{add}((u,v))$
9: $ConnectedPeers.\text{add}(v)$
10: end if
11: end for
12: // 2. Inter-layer Pruning (Look-Ahead)
13: if $l>1$ then
14: $Candidatesâ H_{l-1}$
15: // Prune own children
16: $Candidatesâ Candidates\setminus Children(u)$
17: // Prune children of connected parents
18: for $vâ ConnectedPeers$ do
19: $Candidatesâ Candidates\setminus Children(v)$
20: end for
21: // Only verify remaining candidates
22: for $wâ Candidates$ do
23: if $\text{LLM\_Verify}(u,w)$ then
24: $\mathcal{G}_{c}.\text{add}((u,w))$
25: end if
26: end for
27: end if
28: end for
29: end forreturn $\mathcal{G}_{c}$
B.2 Online Retrieval
Hybrid Scoring and Diversity.
To robustly anchor the query, our scoring function combines semantic and lexical signals:
$$
s_{\alpha}(q,x)=\alpha\cdot\cos(\mathrm{Enc}(q),\mathrm{Enc}(x))+(1-\alpha)\cdot\mathrm{Lex}(q,x), \tag{5}
$$
where $\mathrm{Lex}(q,x)$ computes the normalized token overlap between the query and the nodeâs textual attributes (title and summary). We empirically set $\alpha=0.7$ to favor semantic matching while retaining keyword sensitivity for rare entities. To ensure seed diversity, we apply Maximal Marginal Relevance (MMR) selection. Instead of simply taking the Top- $K$ , we iteratively select seeds that maximize $s_{\alpha}$ while minimizing similarity to already selected seeds, ensuring the retrieval starts from complementary viewpoints.
Edge Type Weights.
In Equation 3, the weight function $w(\text{type}(e))$ controls the traversal behavior. We assign higher weights to Causal Gates ( $w=1.2$ ) and Hierarchical Links ( $w=1.0$ ) to encourage the model to leverage the organized structure, while assigning a lower weight to generic Structural Edges ( $w=0.8$ ) to suppress aimless local wandering.
B.3 Causal Path Reasoning
Graph Linearization Strategy.
To reason over the subgraph $S_{raw}$ within the LLMâs context window, we employ a linearization strategy that compresses heterogeneous graph evidence into a token-efficient format. Each evidence item $xâ S_{raw}$ is mapped to a unique short identifier $\mathrm{ID}(x)$ . The LLM is provided with a compact list mapping these IDs to their textual content (e.g., âN1: [Entity Description]â). This allows the model to perform selection by outputting a sequence of valid identifiers (e.g., â[âN1â, âR3â, âN5â]â), minimizing token overhead.
Spurious-Aware Prompting.
To mitigate noise, we design two variants of the selection prompt (in Appendix A.1):
- Standard Selection: The model is asked to output only the IDs of valid causal paths.
- Spurious-Aware Selection (Ours): The model is explicitly instructed to differentiate valid causal links from spurious associations (e.g., coincidental co-occurrence) . By forcing the model to articulate (or internally tag) what is not causal, this strategy improves the precision of the final output list $S^{\star}$ .
In both cases, the output is directly parsed as the final set of evidence IDs to be retained for generation.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Text Document: Entity and Relationship Extraction Task
### Overview
The image presents a set of instructions for extracting entities and relationships from a given text document, along with examples and placeholders for real data input and output. The goal is to identify entities of specified types, extract relevant information about them, identify relationships between pairs of entities, and output the results in a structured format.
### Components/Axes
* **Goal:** Describes the overall objective of the task.
* **Steps:** Outlines the procedure for entity and relationship extraction.
* Step 1: Entity Identification and Information Extraction
* `entity_name`: Name of the entity, capitalized.
* `entity_type`: One of the following types: `{entity_types}`.
* `entity_description`: Comprehensive description of the entity's attributes and activities.
* Format: `"entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>`
* Step 2: Relationship Identification and Information Extraction
* `source_entity`: Name of the source entity (from Step 1).
* `target_entity`: Name of the target entity (from Step 1).
* `relationship_description`: Explanation of the relationship.
* `relationship_strength`: Numeric score indicating relationship strength.
* Format: `"relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>`
* Step 3: Output Generation
* Output in English as a single list.
* Use `{record_delimiter}` as the list delimiter.
* Step 4: Completion
* Output `{completion_delimiter}` when finished.
* **Examples:** Provides sample input and output.
* Example 1:
* `Entity_types`: ORGANIZATION, PERSON
* `Text`: The Verdantis's C...
* `Output`: `"entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which...`
* Example 2: (Incomplete)
* Example 3: (Incomplete)
* **Real Data:** Placeholders for actual input and output.
* `Entity_types`: `{entity_types}`
* `Text`: `{input_text}`
* `Output`: (Placeholder)
### Detailed Analysis or ### Content Details
The document defines a structured approach to extract entities and relationships from text. It specifies the information to be extracted for each entity and relationship, along with the format for representing them. The examples provide a basic illustration of the expected input and output. The "Real Data" section indicates where the actual data should be inserted for processing.
### Key Observations
* The task involves identifying entities and relationships based on predefined entity types.
* The output format is clearly defined using placeholders like `{tuple_delimiter}`, `{record_delimiter}`, and `{completion_delimiter}`.
* The example provided is incomplete, suggesting that the full output would contain more entities and relationships.
* The `relationship_strength` is a numeric score, implying a quantitative assessment of the relationship.
### Interpretation
The document outlines a natural language processing (NLP) task focused on information extraction. The goal is to convert unstructured text into structured data by identifying entities, their types, and the relationships between them. This type of task is crucial for various applications, including knowledge graph construction, information retrieval, and text summarization. The use of delimiters ensures that the extracted information can be easily parsed and processed by downstream systems. The inclusion of a relationship strength score allows for ranking and filtering of relationships based on their importance or confidence level.
</details>
Figure 8: Prompt for LLM-based Information Extraction (modified from GraphRAG (Edge et al., 2024)). Used in Step 1 of Offline Construction.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Text Extraction: Causal Relationship Decision Task
### Overview
The image presents a set of instructions for determining whether a plausible causal relationship exists between two text snippets. It outlines the goal, steps, required output format, and provides a template for real data input.
### Components/Axes
The image is structured into the following sections:
1. **Goal:** Defines the objective of the task.
2. **Steps:** Provides a numbered list of instructions.
3. **Output:** Specifies the expected output format.
4. **Real Data:** Shows the format for input data.
### Detailed Analysis or ### Content Details
**Goal:**
Given two text snippets A and B, decide whether there is any plausible causal relationship between them (either direction) under some reasonable context.
**Steps:**
1. Read A and B, and consider whether one could plausibly influence the other (directly or indirectly).
2. Require a plausible mechanism; ignore mere correlation or co-occurrence.
3. If uncertain or only associative, choose "no".
**Output:**
Return exactly one token: "yes" or "no". No extra text.
`######################`
**Real Data:**
A: {a\_text}
B: {b\_text}
`######################`
**Output:**
### Key Observations
The instructions emphasize the need for a plausible mechanism to establish a causal relationship, discouraging decisions based solely on correlation. The output is restricted to a binary "yes" or "no" response.
### Interpretation
The image describes a task designed to assess causal reasoning between textual inputs. The steps are designed to guide a user or system to consider underlying mechanisms rather than superficial associations. The strict output format suggests an automated evaluation or standardized reporting. The "Real Data" section indicates that the task is intended to be applied to actual text snippets, represented by the placeholders {a\_text} and {b\_text}.
</details>
Figure 9: Prompt for Binary Causal Gate Verification. Used to determine the existence of causal links between module summaries.
Appendix C Visualization of HugRAGâs Hierarchical Knowledge Graph
To provide an intuitive demonstration of HugRAGâs structural advantages, we present 3D visualizations of the constructed knowledge graphs for two datasets: HotpotQA (see Figure 11) and HolisQA-Biology (see Figure 10). In these visualizations, nodes and modules are arranged in vertical hierarchical layers. The base layer ( $H_{0}$ ), consisting of fine-grained entity nodes, is depicted in grey. The higher-level semantic modules ( $H_{1}$ to $H_{4}$ ) are colored by their respective hierarchy levels. Crucially, the Causal Gates âwhich bridge topologically distant modulesâare rendered as red links. To ensure visual clarity and prevent edge occlusion in this dense representation, we downsampled the causal gates, displaying only a representative subset ( $r=0.2$ ).
<details>
<summary>x10.png Details</summary>

### Visual Description
## Network Diagram: Layered Network Visualization
### Overview
The image presents a layered network diagram. Nodes are grouped into five distinct layers, labeled H0 through H4, each represented by a different color. Connections between nodes are indicated by lines, with a higher density of connections within each layer and some connections spanning across layers. The layers are visually separated by translucent, colored ovals.
### Components/Axes
* **Layers:** The network is structured into five layers, labeled H0, H1, H2, H3, and H4.
* **Nodes:** Each layer contains multiple nodes, represented by small circles.
* **Connections:** Lines connect nodes, indicating relationships or interactions.
* **Color Coding:**
* H4: Blue
* H3: Green
* H2: Yellow/Orange
* H1: Red/Pink
* H0: Gray
### Detailed Analysis or ### Content Details
* **Layer H4 (Blue):** Located at the top of the diagram, this layer contains a cluster of nodes with dense interconnections.
* **Layer H3 (Green):** Situated below H4, this layer also exhibits a cluster of nodes with dense interconnections. There are red lines connecting this layer to the H4 layer above.
* **Layer H2 (Yellow/Orange):** Positioned below H3, this layer shows a less dense cluster of nodes compared to H3 and H4. There are red lines connecting this layer to the H3 layer above.
* **Layer H1 (Red/Pink):** Located below H2, this layer has a sparse distribution of nodes, with some nodes appearing in clusters.
* **Layer H0 (Gray):** Situated at the bottom, this layer contains the largest number of nodes, distributed in several clusters.
**Node Distribution and Connectivity:**
* The top layers (H4 and H3) have a more concentrated node distribution.
* The bottom layers (H0 and H1) have a more dispersed node distribution.
* Connections between layers are primarily indicated by red lines.
### Key Observations
* The network exhibits a hierarchical structure, with layers stacked vertically.
* Node density and connectivity vary across layers.
* The color-coding scheme helps distinguish between layers.
* The red lines indicate connections between layers.
### Interpretation
The network diagram likely represents a system with multiple levels of organization or abstraction. The layers (H0-H4) could represent different levels of a hierarchy, stages in a process, or categories of entities. The connections between nodes indicate relationships or interactions within and between these layers. The higher density of connections in the upper layers (H4 and H3) might suggest a greater degree of integration or coordination at those levels. The sparser distribution of nodes in the lower layers (H0 and H1) could indicate a more diverse or fragmented set of entities at those levels. The red lines connecting the layers suggest dependencies or information flow between the levels.
</details>
Figure 10: A 3D view of the Hierarchical Graph with Causal Gates constructed from HolisQA-biology dataset.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Network Diagram: Hierarchical Network Visualization
### Overview
The image presents a hierarchical network diagram, visually representing relationships between different layers or levels. The diagram consists of several layers of nodes, each layer distinguished by color and labeled as H0, H1, H2, H3, and H4. Connections between nodes across different layers are indicated by lines. The background is colored to emphasize the hierarchical structure.
### Components/Axes
* **Layers:** The diagram has five distinct layers, each represented by a different color.
* H4: Blue
* H3: Green
* H2: Orange
* H1: Pink/Red
* H0: Gray
* **Nodes:** Each layer contains multiple nodes, represented by small circles.
* **Connections:** Lines connect nodes between different layers, indicating relationships or interactions.
* **Background:** The background is colored with horizontal bands of color corresponding to the layers, providing visual separation.
### Detailed Analysis
* **H4 (Blue):** Located at the top of the diagram. The nodes are densely clustered together. Connections extend downwards to the H3 layer.
* **H3 (Green):** Situated below the H4 layer. The nodes are also densely clustered. Connections extend downwards to the H2 layer.
* **H2 (Orange):** Located below the H3 layer. The nodes are more dispersed compared to H3. Connections extend downwards to the H1 layer.
* **H1 (Pink/Red):** Located below the H2 layer. The nodes are more dispersed and less numerous. Connections extend downwards to the H0 layer.
* **H0 (Gray):** Located at the bottom of the diagram. The nodes are grouped into several distinct clusters.
The connections between layers appear to be directed downwards, suggesting a flow or dependency from higher layers to lower layers. The density of nodes decreases from top to bottom, with H4 and H3 having the highest density and H0 having the lowest.
### Key Observations
* The diagram illustrates a hierarchical structure with distinct layers.
* The density of nodes varies across layers, with higher layers having more nodes.
* Connections between layers suggest a flow or dependency from higher to lower layers.
* The H0 layer exhibits clustering, indicating potential subgroups or modules within that layer.
### Interpretation
The network diagram likely represents a system or process with multiple levels of organization. The layers (H4 to H0) could represent different stages, components, or levels of abstraction. The connections between nodes indicate interactions or dependencies between these elements. The decreasing density of nodes from top to bottom might suggest a funneling effect, where information or resources are consolidated as they flow through the system. The clustering in the H0 layer could indicate the presence of distinct functional units or modules at the lowest level of the hierarchy.
</details>
Figure 11: A 3D view of the Hierarchical Graph with Causal Gates constructed from HotpotQA dataset.
Appendix D Case Study: A Real Example of the HugRAG Full Pipeline
To concretely illustrate the HugRAG full pipeline, we present a step-by-step execution trace on a query from the HolisQA-Biology dataset in Figure 12. The query asks for a comparison of specific enzyme activities (Apase vs. Pti-interacting kinase) in oil palm genotypes under phosphorus limitationâa task requiring the holistic comprehension of biology knowledge in HolisQA dataset.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Flowchart: LLM-Based Question Answering Process
### Overview
The image depicts a flowchart outlining the steps in a Language Model (LLM)-based question answering process. It starts with a query and progresses through several stages, including seed stage, subgraph extraction, causal LLM output, answer generation, and finally, a gold answer. Each stage involves specific processing and information retrieval steps.
### Components/Axes
The flowchart consists of the following components:
1. **Query:** The initial question posed to the system.
2. **Seed Stage:** Matching seeds based on a short ID map.
3. **Post n-hop Subgraph:** Extraction of relevant subgraphs.
4. **Causal LLM output:** Identification of causal graphs and spurious information.
5. **Answer LLM output:** Generation of the final answer.
6. **Gold Answer:** The reference or ideal answer.
Each stage is represented by a rectangular box with text describing the process. Arrows indicate the flow of information from one stage to the next.
### Detailed Analysis or ### Content Details
**1. Query:**
* Text: "How does the activity of acid phosphatase (Apase) and Pti-interacting serine/threonine kinase differ in oil palm genotypes under phosphorus limitation, and what are the implications for their adaptability?"
**2. Seed Stage:**
* Description: Seed matching via short\_id\_map.
* Seeds: \[T2, T4, T6, SP, CAT, ES, ADA, INDONESIA....]
* Examples:
* T2: \[text\_unit, score=0.4615] ols of PE direction and intensity, context-dependent microbial strategies, and the scarcity of long-term C balance assessments...
* T4: \[text\_unit, score=0.4615] activity in P-optimum was higher than starvation and deficiency in leaf and root tissues in both genotypes, whereas Pti serine/t.......
**3. Post n-hop Subgraph:**
* Description: Top subgraph nodes (by combined score).
* Examples:
* E:dce66303-2b2c-472f-a964-da0b5529817d | SP (combined=0.4127)
* E:345eb0d6-56fb-4878-a05c-99f01d528cd8 | CAT (combined=0.3832)
* Sample subgraph edges: ('up' means lower level module to higher level module.)
* struct:7->69:up | Xixia County Cornus Officinalis Pathogen Study in Henan -> Hebei-Liaoning-Yunnan Research Corridor (structural)
* struct:7->69 | Hebei-Liaoning-Yunnan Research Corridor -> Xixia County Cornus Officinalis Pathogen Study in Henan (structural)
* struct:7->68:up | Jehol Biota & Yixian Formation in NE China -> Hebei-Liaoning-Yunnan Research Corridor (structural)
**4. Causal LLM output:**
* Description: Identifying causal graph and spurious information (Full prompt available in appendix).
* Content:
* "precise": \["T1", "T4",.........],
* "ct\_precise": \["T7", "N9", ........],
* "p\_answer": "Under phosphorus limitation in oil palm, Apase activity is expected to be induced (P deprivation), ...................."
**5. Answer LLM output:**
* Description: Generate final answer (Full prompt available in appendix).
* Content:
* Acid phosphatase (Apase): In oil palm, Apase activity in leaves and roots is higher under P optimum than under starvation/deficiency in both genotypes. This indicates Apase is not upregulated by phosphorus limitation in the studied conditions, suggesting it may not be the main adaptive response to P deprivation here.
* Pti-interacting serine/threonine kinase: Activity is higher in prolific versus non-prolific genotypes specifically under P-deficient conditions, indicating...
**6. Gold Answer:**
* Text: Apase activity was higher in P-optimal conditions than in starvation or deficiency for both leaf and root tissues across both oil palm genotypes. Conversely, Pti serine/threonine kinase activity was higher in prolific genotypes compared to non-prolific ones under P-deficient dosage. Additionally, abscisic acid content was higher in prolific genotypes during starvation and deficiency. These findings suggest that the prolific genotype is more adaptable to phosphorus deficiency, potentially.
### Key Observations
* The process starts with a complex query about the activity of acid phosphatase and Pti-interacting serine/threonine kinase in oil palm genotypes under phosphorus limitation.
* The seed stage uses a short ID map to identify relevant seeds.
* The post n-hop subgraph stage extracts subgraphs based on combined scores.
* The causal LLM output identifies causal graphs and spurious information.
* The answer LLM output generates a final answer based on the processed information.
* The gold answer provides a reference or ideal answer for comparison.
### Interpretation
The flowchart illustrates a multi-stage process for answering a complex biological question using LLMs. The process involves identifying relevant information, extracting subgraphs, identifying causal relationships, and generating a final answer. The inclusion of a "gold answer" suggests a benchmark or reference point for evaluating the performance of the LLM-based question answering system. The process highlights the complexity of using LLMs to answer scientific questions and the need for multiple stages of processing and analysis.
</details>
Figure 12: A real example of HugRAG on a biology-related query. The diagram visualizes the data flow from initial seed matching and hierarchical graph expansion to the causal reasoning stage, where the model explicitly filters spurious nodes to produce a grounded, high-fidelity answer.
Appendix E Experiments on the Effectiveness of Causal Gates
To isolate the real effectiveness of the causal gate in HugRAG, we conduct a controlled A/B test comparing gold context access with the gate disabled (off) versus enabled (on). The evaluation is performed on two datasets: NQ (Standard QA) and HolisQA. We define âGold Nodesâ as the graph nodes mapping to the gold context. Metrics are computed only on examples where gold nodes are mappable to the graph. While this section focuses on structural retrieval metrics, we evaluate the downstream impact of causal gates on final answer quality in our ablation study in Section 5.3.
Metrics.
We report four structural metrics to evaluate retrieval quality and efficiency. Shaded regions in Figure 13 denote 95% bootstrap confidence intervals. Reachability: The fraction of examples where at least one gold node is retrieved in the subgraph. Weighted Reachability (Depth-Weighted): A distance-sensitive metric defined as $\mathrm{DWR}=\frac{1}{1+\mathrm{min\_hops}}$ (0 if unreachable), rewarding retrieval at smaller graph distances. Coverage: The average proportion of total gold nodes retrieved per example. Min Hops: The mean shortest path length to gold nodes, computed on examples reachable in both off and on settings.
As shown in Figure 13, enabling the causal gate yields distinct behaviors across datasets. On the more complex HolisQA dataset, the gate provides a statistically significant improvement in reachability and coverage. This confirms that causal edges effectively bridge structural gaps in the graph that are otherwise traversed inefficiently. The increase in Weighted Reachability and decrease in min hops indicate that the gate not only finds more evidence but creates structural shortcuts, allowing the retrieval process to access evidence at shallower depths.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Chart: Dataset Comparison on Various Metrics
### Overview
The image presents four line charts comparing the performance of the HolisQA Dataset (blue) and the Standard QA Dataset (red) across four metrics: Reachability, W. Reachability, Coverage, and Min Hops. Each chart shows the metric's value when a certain feature is "off" and "on." Shaded regions around the lines indicate the uncertainty or variance in the data.
### Components/Axes
* **Legend:** Located at the top of the image.
* Blue line: HolisQA Dataset
* Red line: Standard QA Dataset
* **X-axis:** Categorical, with two values: "off" and "on."
* **Y-axis:** Numerical, with different scales for each chart.
* **Reachability:** Ranges from approximately 0.7 to 0.95.
* **W. Reachability:** Ranges from approximately 0.5 to 0.85.
* **Coverage:** Ranges from approximately 0.2 to 0.45.
* **Min Hops:** Ranges from approximately 0.3 to 1.5.
* **Chart Titles (X-axis labels):**
* Reachability
* W. Reachability
* Coverage
* Min Hops
### Detailed Analysis
**1. Reachability:**
* **HolisQA Dataset (Blue):** The line slopes upward from "off" to "on."
* "off": approximately 0.79
* "on": approximately 0.85
* **Standard QA Dataset (Red):** The line slopes upward from "off" to "on."
* "off": approximately 0.91
* "on": approximately 0.94
**2. W. Reachability:**
* **HolisQA Dataset (Blue):** The line slopes upward from "off" to "on."
* "off": approximately 0.54
* "on": approximately 0.58
* **Standard QA Dataset (Red):** The line is relatively flat from "off" to "on."
* "off": approximately 0.81
* "on": approximately 0.81
**3. Coverage:**
* **HolisQA Dataset (Blue):** The line slopes upward from "off" to "on."
* "off": approximately 0.27
* "on": approximately 0.37
* **Standard QA Dataset (Red):** The line is relatively flat from "off" to "on."
* "off": approximately 0.18
* "on": approximately 0.19
**4. Min Hops:**
* **HolisQA Dataset (Blue):** The line slopes downward from "off" to "on."
* "off": approximately 1.05
* "on": approximately 0.82
* **Standard QA Dataset (Red):** The line is relatively flat from "off" to "on."
* "off": approximately 0.38
* "on": approximately 0.40
### Key Observations
* For Reachability and W. Reachability, the Standard QA Dataset generally has higher values than the HolisQA Dataset.
* For Coverage, the HolisQA Dataset has higher values than the Standard QA Dataset.
* For Min Hops, the HolisQA Dataset has significantly higher values than the Standard QA Dataset.
* The "on" setting generally improves Reachability and Coverage for the HolisQA Dataset.
* The "on" setting decreases Min Hops for the HolisQA Dataset.
* The "on" setting has little impact on the Standard QA Dataset across all metrics.
### Interpretation
The charts compare the performance of two question-answering datasets (HolisQA and Standard QA) across four different metrics when a certain feature is either "off" or "on." The data suggests that the HolisQA dataset benefits more from the "on" setting in terms of Reachability, Coverage, and Min Hops, while the Standard QA dataset remains relatively stable regardless of the setting. The higher Reachability and W. Reachability values for the Standard QA dataset might indicate a more comprehensive knowledge base or better retrieval capabilities. The higher Coverage for HolisQA might suggest a broader range of topics or question types it can handle. The higher Min Hops for HolisQA could indicate that it requires more steps or reasoning to answer questions. The shaded regions provide a visual representation of the variability in the data, which should be considered when interpreting the results.
</details>
Figure 13: Experiments on Causal Gate effectiveness. We compare graph traversal performance with the causal gate disabled (off) versus enabled (on). Shaded areas represent 95% bootstrap confidence intervals. The causal gate significantly improves evidence accessibility (Reachability, Coverage) and traversal efficiency (lower Min Hops, higher Weighted Reachability).
Appendix F Evaluation Details
F.1 Detailed Graph Statistics
We provide the complete statistics for all knowledge graphs constructed in our experiments. Table 5 details the graph structures for the five standard QA datasets, while Table 6 covers the five scientific domains within the HolisQA dataset.
Table 5: Graph Statistics for Standard QA Datasets. Detailed breakdown of nodes, edges, and hierarchical module distribution.
| Dataset | Nodes | Edges | L3 | L2 | L1 | L0 | Modules | Domain | Chars |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| HotpotQA | 20,354 | 15,789 | 27 | 1,344 | 891 | 97 | 2,359 | Wikipedia | 2,855,481 |
| MS MARCO | 3,403 | 3,107 | 2 | 159 | 230 | 55 | 446 | Web | 1,557,990 |
| NQ | 5,579 | 4,349 | 2 | 209 | 244 | 50 | 505 | Wikipedia | 767,509 |
| QASC | 77 | 39 | - | - | - | 4 | 4 | Science | 58,455 |
| 2WikiMultiHop | 10,995 | 8,489 | 8 | 461 | 541 | 78 | 1,088 | Wikipedia | 1,756,619 |
Table 6: Graph Statistics for HolisQA Datasets. Graph structures constructed from dense academic papers across five scientific domains.
| Holis-Biology | 1,714 | 1,722 | - | 30 | 104 | 31 | 165 | Biology | 1,707,489 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Holis-Business | 2,169 | 2,392 | 8 | 77 | 166 | 41 | 292 | Business | 1,671,718 |
| Holis-CompSci | 1,670 | 1,667 | 7 | 28 | 91 | 30 | 158 | CompSci | 1,657,390 |
| Holis-Medicine | 1,930 | 2,124 | 7 | 56 | 129 | 34 | 226 | Medicine | 1,706,211 |
| Holis-Psychology | 2,019 | 1,990 | 5 | 45 | 126 | 35 | 211 | Psychology | 1,751,389 |
F.2 HolisQA Dataset
We introduce HolisQA, a comprehensive dataset designed to evaluate the holistic comprehension capabilities of RAG systems, explicitly addressing the ânode findingâ bias prevalent in existing QA datasetsâwhere retrieving a single entity (e.g., a year or name) is often sufficient. Our goal is to enforce holistic comprehension, compelling models to synthesize coherent evidence from multi-sentence contexts.
We collected high-quality scientific papers across multiple domains as our primary source (Priem et al., 2022), focusing exclusively on recent publications (2025) to minimize parametric memorization by the LLM. The dataset spans five distinct domainsâBiology, Business, Computer Science, Medicine, and Psychologyâto ensure domain robustness (see full statistics in Table 6). To necessitate cross-sentence reasoning, we avoid random sentence sampling; instead, we extract contiguous text slices from papers within each domain. Each slice is sufficiently long to encapsulate multiple interacting claims (e.g., Problem $â$ Method $â$ Result) yet short enough to remain self-contained, thereby preserving the logical coherence and contextual foundation required for complex reasoning. Subsequently, we employ a rigorous LLM-based generation pipeline to create Question-Answer-Context triples, imposing two strict constraints (as detailed in Figure 14):
1. Integration Constraint: The question must require integrating information from at least three distinct sentences. We explicitly reject trivia-style questions that can be answered by a single named entity (e.g., âWho founded X?â).
1. Evidence Verification: The generation process must output the IDs of all supporting sentences. We validate the dataset via a necessity check, verifying that the correct answer cannot be derived if any of the cited sentences are removed.
Through this strict construction pipeline, HolisQA effectively evaluates the modelâs ability to perform holistic comprehension and isolate it from parametric knowledge, providing a cleaner signal for evaluating the effectiveness of structured retrieval mechanisms.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Text Document: Reading Comprehension Dataset Instructions
### Overview
The image is a text document providing instructions for building a reading comprehension dataset. It outlines the format of the data, the type of questions to generate, and the structure of the JSON items.
### Components/Axes
The document contains the following sections:
1. **Introduction**: Explains the purpose of the dataset and the format of the input sentences.
2. **Question Generation**: Specifies the type of questions to generate and what to avoid.
3. **JSON Item Structure**: Defines the required fields for each JSON item.
4. **Sentences**: Indicates where the slice text should be placed.
### Detailed Analysis or ### Content Details
The document contains the following text:
"You are building a reading-comprehension dataset.
You will receive a slice of sentences from a long document. Each line starts with a sentence ID, a tab, then the sentence text.
Generate {qas_per_run} question-answer pairs in JSON array format. Questions must require multi-sentence reasoning and an understanding of the overall slice. Avoid short factual questions, named-entity trivia, or single-sentence lookups.
Each JSON item must include:
* "question": string
* "answer": string (2-4 sentences)
* "context_sentence_ids": array of {min_context}-{max_context} IDs drawn only from the provided slice
Return JSON only, no extra text.
Sentences:
{slice_text}"
### Key Observations
* The dataset consists of question-answer pairs generated from slices of sentences.
* Questions should require multi-sentence reasoning.
* Each JSON item must include "question", "answer", and "context_sentence_ids".
* The answer should be 2-4 sentences long.
* The context sentence IDs should be drawn from the provided slice.
### Interpretation
The document provides clear instructions for creating a reading comprehension dataset. The emphasis on multi-sentence reasoning suggests that the dataset is designed to evaluate a model's ability to understand context and relationships between sentences. The JSON format ensures that the data is structured and easily parsable. The use of placeholders like `{qas_per_run}`, `{min_context}`, `{max_context}`, and `{slice_text}` indicates that these values will be dynamically populated during the dataset creation process.
</details>
Figure 14: Prompt for generating the Holistic Comprehension Dataset (Question-Answer-Context Triplets) from academic papers.
F.3 Implementation
Backbone Models.
We consistently use OpenAIâs gpt-5-nano with a temperature of 0.0 to ensure deterministic generation. For vector embeddings, we employ the Sentence-BERT (Reimers and Gurevych, 2019) version of all-MiniLM-L6-v2 with a dimensionality of 384. All evaluation metrics involving LLM-as-a-judge are implemented using the Ragas framework (Es et al., 2024), with Gemini-2.5-Flash-Lite serving as the underlying evaluation engine.
Baseline Parameters.
To ensure a fair comparison among all graph-based RAG methods, we utilize a unified root knowledge graph (see Appendix B.1 for construction details). For the retrieval stage, we set a consistent initial $k=3$ across all baselines. Other parameters are kept at their default values to maintain a neutral comparison, with the exception of method-specific configurations (e.g., global vs. local modes in GraphRAG) that are essential for the algorithmâs execution. All experiments were conducted on a high-performance computing cluster managed by Slurm. Each evaluation task was allocated uniform resources consisting of 2 CPU cores and 16 GB of RAM, utilizing 10-way job arrays for concurrent query processing.
F.4 Grounding Metrics and Evaluation Prompts
We assess performance using two categories of metrics: (i) Lexical Overlap (F1 score), which measures surface-level similarity between model outputs and gold answers; and (ii) LLM-as-judge metrics, specifically Context Recall and Answer Relevancy, computed using a fixed evaluator model to ensure consistency (Es et al., 2024). To guarantee stable and fair comparisons across baselines with varying retrieval outputs, we impose a uniform cap on the retrieved context length and the number of items passed to the evaluator. The specific prompt template used for assessing Answer Relevancy is illustrated in Figure 15.
<details>
<summary>x15.png Details</summary>

### Visual Description
## Text Extraction: Prompt Engineering Examples
### Overview
The image presents examples of prompt engineering for a system designed to generate questions from given answers and identify if an answer is noncommittal. It includes a core template, an answer relevancy prompt, and several input-output examples.
### Components/Axes
The image is structured into three main sections:
1. **Core Template:** Defines the instruction for the system to return output in JSON format, adhering to a specified JSON schema. It emphasizes the use of double quotes and proper escaping.
2. **Answer Relevancy Prompt:** Explains how to generate a question from a given answer and identify if the answer is noncommittal. It defines noncommittal answers as evasive, vague, or ambiguous, assigning a value of 1, while substantive answers are assigned a value of 0. Examples of noncommittal answers are provided.
3. **Examples:** Provides input-output pairs demonstrating the system's functionality. Each input consists of a "response," and the corresponding output includes a "question" and a "noncommittal" value (0 or 1).
### Detailed Analysis or ### Content Details
**Core Template:**
* `{instruction}`: Placeholder for instructions.
* "Please return the output in a JSON format that complies with the following schema as specified in JSON Schema: {output\_schema}Do not use single quotes in your response but double quotes, properly escaped with a backslash."
* `{examples}`: Placeholder for examples.
* "Now perform the same with the following input"
* `input: {input_json}`
* `Output:`
**Answer Relevancy Prompt:**
* "Generate a question for the given answer and identify if the answer is noncommittal."
* "Give noncommittal as 1 if the answer is noncommittal (evasive, vague, or ambiguous) and 0 if the answer is substantive."
* "Examples of noncommittal answers: "I don't know", "I'm not sure", "It depends"."
**Examples:**
* **Example 1:**
* Input: `{"response": "Albert Einstein was born in Germany."}`
* Output: `{"question": "Where was Albert Einstein born?", "noncommittal": 0}`
* **Example 2:**
* Input: `{"response": "The capital of France is Paris, a city known for its architecture and culture."}`
* Output: `{"question": "What is the capital of France?", "noncommittal": 0}`
* **Example 3:**
* Input: `{"response": "I don't know about the groundbreaking feature of the smartphone invented in 2023 as I am unaware of information beyond 2022."}`
* Output: `{"question": "What was the groundbreaking feature of the smartphone invented in 2023?", "noncommittal": 1}`
### Key Observations
* The core template sets the format for the system's output.
* The answer relevancy prompt defines the criteria for identifying noncommittal answers.
* The examples demonstrate how the system generates questions and assigns noncommittal values based on the input responses.
* The system correctly identifies substantive answers (e.g., "Albert Einstein was born in Germany") and assigns a noncommittal value of 0.
* The system correctly identifies noncommittal answers (e.g., "I don't know...") and assigns a noncommittal value of 1.
### Interpretation
The image illustrates a prompt engineering approach for building a system that can generate questions from answers and assess the relevancy or commitment level of those answers. The system is designed to distinguish between substantive and noncommittal responses, which could be useful in various applications such as question answering, information retrieval, and dialogue systems. The examples provided demonstrate the system's ability to generate relevant questions and accurately classify answers as either substantive or noncommittal.
</details>
Figure 15: Example prompt used in RAGAS: Core Template and Answer Relevancy (Es et al., 2024).