# Grounding LLM Reasoning with Knowledge Graphs
**Authors**: Alfonso Amayuelas, UC Santa Barbara
> Work completed during an internship at JP Morgan AI Research
Foundations of Reasoning in Language Models
## Abstract
Large Language Models (LLMs) excel at generating natural language answers, yet their outputs often remain unverifiable and difficult to trace. Knowledge Graphs (KGs) offer a complementary strength by representing entities and their relationships in structured form, providing a foundation for more reliable reasoning. We propose a novel framework that integrates LLM reasoning with KGs by linking each step of the reasoning process to graph-structured data. This grounding turns intermediate âthoughtsâ into interpretable traces that remain consistent with external knowledge. Our approach incorporates multiple reasoning strategies, Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT), and is evaluated on GRBench, a benchmark for domain-specific graph reasoning. Our experiments show state-of-the-art (SOTA) performance, with at least 26.5% improvement over CoT baselines. Beyond accuracy, we analyze how step depth, branching structure, and model size influence reasoning quality, offering insights into the conditions that support effective reasoning. Together, these contributions highlight how grounding LLMs in structured knowledge enables both higher accuracy and greater interpretability in complex reasoning tasks.
## 1 Introduction
LLMs have shown remarkable versatility in answering questions posed in natural language. This is mainly due to their ability to generate text, their broad internal knowledge, and their capacity to access external information (toolqa; rag). However, a significant area for improvement is their tendency to produce information that, while plausible-sounding, is often unverifiable and lacks traceable origins and sources (hallucination_survey). This limitation highlights a deeper issue in how LLMs organize and apply reasoning, especially when reliable and accountable outputs are required.
The LLM generation process heavily relies on their internal parameters, making it difficult to link their outputs to external sources (llm_internal_knowledge; llm_kb). This limitation challenges their reliability in industrial applications (llm_explainability_survey). In applied settings, where LLMs handle critical operations, integrating them with domain-specific knowledge is essential. Fine-tuning LLMs for new domains is labor-intensive, especially for companies with proprietary data facing privacy and legal issues. As a result, there is a need for interventions that guide reasoning processes so that outputs remain accurate and transparent without requiring exhaustive re-training.
Methods such as Retrieval-Augmented Generation (RAG) (rag) and SQL-based querying (llm_eval_sql) address this gap partially. However, they often fail to capture the dynamic relationships between concepts that are necessary for comprehensive understanding. These approaches typically assume that knowledge is well-represented in discrete units, such as documents or tables, which can lead to incomplete insights when dealing with interconnected knowledge that spans multiple sources. This limits the ability to support reasoning over complex queries.
<details>
<summary>figures/general.png Details</summary>

### Visual Description
## Diagram: Process Flow for Answering a Knowledge Graph Query
### Overview
The image is a flowchart diagram illustrating a two-column, multi-step process for answering the question: **"What is the most frequently cited artwork by Mark Brunswick?"** It depicts a parallel workflow between an "Agent" (left column) and an "Automatic Exploration" system (right column), both interacting with a central "Knowledge Graph" to iteratively find the answer.
### Components/Axes
The diagram is structured into three main vertical sections:
1. **Left Column (Agent):** A light gray rounded rectangle labeled "Agent". It contains a sequence of steps (Step 1, Step 2, ... Step n), each with an icon, an `[Action]` box, and an `[Observation]` box.
2. **Center Column (Knowledge Graph):** A vertical series of identical "Knowledge Graph" icons (a network of connected nodes). Double-headed arrows (orange pointing left, blue pointing right) connect each Knowledge Graph icon to the corresponding step in both the Agent and Automatic Exploration columns.
3. **Right Column (Automatic Exploration):** A light gray rounded rectangle labeled "Automatic Exploration". It mirrors the step structure of the Agent column, containing `[Entities]` and `[Triples]` boxes for each step.
**Connecting Elements:**
* Horizontal gray arrows labeled "Thought 1", "Thought 2", ... "Thought n" connect the Agent's observation/thought to the Automatic Exploration's corresponding step.
* A large, curved gray arrow at the top points from the initial question to the start of both columns.
### Detailed Analysis
The process is broken down into discrete steps:
**Step 1:**
* **Agent:**
* **Thought/Context:** "1. The question is asking for information about Mark Brunswick"
* **Action:** `RetrieveNode[Mark Brunswick]`
* **Observation:** "The node ID is 83029"
* **Automatic Exploration:**
* **Thought/Context:** "1. The question is asking for information about Mark Brunswick"
* **Entities:** `Mark Brunswick`
* **Triples:** `(Mark Brunswick, authorOf, Nocturne and Rondo) (Mark Brunswick, authorOf, Symphony in Bâ)`
**Step 2:**
* **Agent:**
* **Thought/Context:** "2. The node corresponding to Mark Brunswick is 8309. Now we need to find his works"
* **Action:** `NeighborCheck[8309, works]`
* **Observation:** "The Neighbors are Nocturne and Rondo, Symphony in Bâ ..."
* **Automatic Exploration:**
* **Thought/Context:** "2. Mark Brunswick authored The Master Builder, Symphony in Bâ, ..."
* **Entities:** `The Master Builder, Symphony in Bâ, ...`
* **Triples:** `(The Master Builder, citedBy, ) (Mark Brunswick, authorOf, Symphony in Bâ)`
**Step n (Final Step):**
* **Agent:**
* **Thought/Context:** "n. The most cited work of Mark Brunswick is The Master Builder"
* **Action:** `Finish[The Master Builder]`
* **Automatic Exploration:**
* **Thought/Context:** "n. The work most frequently cited of Mark Brunswick is The Master Builder"
* **[Is the end?]:** `The Master Builder`
### Key Observations
1. **Parallel Processing:** The diagram emphasizes a dual-track approach. The "Agent" performs explicit, programmatic actions (RetrieveNode, NeighborCheck, Finish) on a specific node ID, while "Automatic Exploration" performs semantic analysis, extracting entities and relationship triples from the knowledge graph.
2. **Iterative Refinement:** The process is not a single lookup. It starts with identifying the entity (Mark Brunswick), then explores his relationships (authorOf works), and finally determines the most cited work, implying a subsequent step (not fully detailed) to analyze "citedBy" relationships for the discovered works.
3. **Knowledge Graph as Central Hub:** The Knowledge Graph is the shared data source. Both processes query it, but the Agent uses a structured, ID-based interface, while Automatic Exploration uses a semantic, triple-based interface.
4. **Convergence on Answer:** Both columns independently arrive at the same conclusion in Step n: "The Master Builder" is the most frequently cited work.
5. **Visual Coding:** Actions in the Agent column are in orange-bordered boxes. Entities and Triples in the Automatic Exploration column are in blue-bordered boxes. Thoughts are in dark gray boxes.
### Interpretation
This diagram illustrates a **hybrid human-AI or multi-agent system for complex question answering over a knowledge graph**. It demonstrates how a high-level natural language question is decomposed into a series of concrete, executable sub-tasks.
* **The "Agent"** represents a procedural, step-by-step reasoning engine that manipulates graph nodes by their identifiers. Its actions are precise but require knowing the internal node structure (e.g., the ID for "Mark Brunswick" is 83029).
* **The "Automatic Exploration"** represents a semantic understanding layer. It interprets the question in context, extracts relevant entities and relationships (triples) from the graph, and helps identify the next logical sub-question (e.g., after finding works, the next implicit question is "which is most cited?").
* **The Workflow** shows that answering "most frequently cited" requires at least two hops in the graph: first, find works by the author (authorOf), then, for each work, find citation counts (citedBy). The diagram simplifies the final aggregation step but clearly shows the path to the answer.
* **The Notable Anomaly/Insight:** The initial triples in Step 1 list "Nocturne and Rondo" and "Symphony in Bâ", but the final answer is "The Master Builder". This implies that "The Master Builder" was discovered in a subsequent, unshown step of neighbor exploration or triple analysis, and that it had a higher citation count than the initially listed works. This highlights the iterative and sometimes non-linear nature of graph traversal.
In essence, the diagram is a blueprint for an investigative process where structured data retrieval and semantic reasoning collaborate to solve a multi-faceted query.
</details>
Figure 1: Methods for Question-Answering in KGs (Section 4). Left: Agent. LLM decides to take one of the predefined actions to connect with the graph. Right: Automatic Graph Exploration. Entities are extracted in each reasoning step, triggering a search for each identified entity.
KGs capture such relationships by organizing entities and their connections in a structured representation. Recent work has begun to explore how KGs can guide reasoning in LLMs (rog). Building on this direction, we propose a framework that integrates reasoning strategies with domain-specific KGs from the GRBench dataset. Each reasoning step is explicitly connected to the graph, producing answers that are both accurate and traceable. We evaluate three strategies, CoT, ToT, and GoT, which provide different ways of organizing reasoning traces (Figure 2). Our framework achieves SOTA performance on GRBench, improving by at least 26.5% over CoT baselines. Beyond accuracy, we analyze how reasoning behaviors vary with step depth, branching, and model size (Section 7), offering insights into how reasoning evolves during inference and how it can be shaped by structured knowledge. Our contributions can be summarized as follows:
- We present a versatile framework that links reasoning steps with graph search, providing an intervention that can be applied across domains and knowledge settings.
- We show substantial gains over existing reasoning strategies, with state-of-the-art performance on GRBench and an improvement of at least 26.5% compared to CoT baselines.
- We conduct a detailed study of how reasoning quality is influenced by step depth, branching structure, and model size, offering insights into the conditions under which different interventions succeed (Section 7).
- By systematically grounding each step in a knowledge graph, we improve both the accuracy and the traceability of outputs, creating reasoning traces that are interpretable and verifiable.
## 2 Related Work
LLMs require substantial data and resources for training villalobos2024will. Retrieval-Augmented Generation (RAG) enable the models to incorporate external evidence at inference time (rag; demonstrate-search-predict). Recent work further combines RAG with structured knowledge, such as ontologies and KGs, to improve factuality and reasoning li2024structrag, underscoring the growing importance of structured data for robust and domain-adaptable LLMs.
Structured Knowledge
Structured knowledge, such as databases or KGs, provides organizations with reliable sources of information that can be systematically maintained and automatically updated. KGs in particular offer an adaptable knowledge model that captures complex relationships between interconnected concepts. Research has explored models that can interact with multiple types of structured knowledge, such as StructLM (structLM), and approaches that incorporate structured knowledge during pretraining to improve model performance (skill_llm_structured_knowledge_infusion).
Integrating KGs with LLMs
The integration of KGs with LLMs has emerged as a promising direction to strengthen reasoning capabilities and reliability (peng2024graph). In general, four primary methods can be distinguished: (1) learning graph representations (let_graph_do_thinking; graphllm), though latent representations often underperform compared to text-based methods on Knowledge Graph Question Answering (KGQA) tasks; (2) using Graph Neural Network (GNN) retrievers to extract relevant entities and feeding them as text-based input to the model (g-retriever; gnn-rag); (3) generating code, such as SPARQL queries, to directly retrieve information from graphs (kb-binder); and (4) step-by-step interaction methods that allow iterative reasoning over graphs (rog; decaf; chatkbqa), which currently achieve the strongest results on KGQA benchmarks.
LLM Reasoning with Graphs
Beyond retrieval, KGs have also been studied as a means to structure and analyze the reasoning processes of LLMs themselves wang2024understanding. This integration enables more coherent and contextually relevant outputs while also supporting the tracing and verification of reasoning steps. The most effective methods typically rely on interactive, step-by-step engagement between LLMs and graphs, as discussed above. Examples of this approach include systems such as think-on-graph; mindmap; rog; kg-gpt; li2025cot, which demonstrate improved reasoning performance through graph-based scaffolding. More recent work has further investigated the integration of traditional reasoning strategies, such as CoT and tree-structured reasoning, into KG-based interaction (graphCoT; tree-of-traversals).
Building on these advances, our framework integrates established reasoning strategies directly with domain-specific KGs. Unlike previous methods that treat KGs as retrieval tools or rely on latent representations, our approach systematically links each reasoning step to graph entities and relations.
## 3 Background
In this section, we formalize the prerequisite knowledge relevant to this paper. We use $p_θ$ to denote a pre-trained language model with parameters $θ$ , and letters $x,y,z$ to refer to a language sequence. $x=(x_1,x_2,...,x_n)$ , where each generated token is $x_i$ is a such that $p_θ(x)=â_i=1^np_θ(x_i|x_1...i-1)$ .
Knowledge Graphs (KGs)
A KG is a heterogeneous directed graph that contains factual knowledge to model structured information. Nodes represent entities, events, or concepts, while edges represent the connection and types of relations between them. Formally, a KG is represented as $G$ , defined by a set of triples $G=\{(h,r,t)\mid h,tâE,râR\}$ , where $E$ , $R$ denote the set of entities and relations, respectively. KGs provide a structured framework that can guide reasoning processes by explicitly representing the relationships between concepts.
Knowledge Graph Question-Answering (KGQA)
It is a reasoning task that leverages KGs. Given a natural language question, $q$ , and an associated KG, $G$ , the goal is to develop a method that retrieves the correct answer, $a$ , based on the knowledge extracted from the KG: $a=f(q,G)$ . Beyond retrieving facts, KGQA often requires integrating multiple reasoning steps that traverse the graph to connect related concepts.
Step-by-step Reasoning with LLMs
To improve the reasoning capabilities of LLMs at inference time, a common approach is to generate intermediate reasoning steps. The key idea is the introduction of intermediate steps, $Z_p_{θ}=z_1,...,z_n$ , to add inference sources to bridge the $q$ and $a$ . This decomposition allows models to tackle complex, multi-step problems incrementally, focusing computational effort on parts of reasoning chain that require deeper analysis. Stepwise reasoning over KGs offers a natural mechanism to track, guide, and interpret the reasoning process.
## 4 Method
This work demonstrates how progressively conditioning LLM reasoning at each step can enhance performance on domain-specific question answering over knowledge graphs. By structuring reasoning into incremental steps that interact with graph data, the model can manage complex dependencies and dynamically refine its conclusions. Our method combines reasoning strategies for LLMs: CoT, ToT, GoT with 2 graph interaction methods: (1) Agent, an agent to navigate the graph; and (2) Automatic Graph Exploration, an automatic graph traversal mechanism based on the generated text.
<details>
<summary>figures/cot.jpg Details</summary>

### Visual Description
## Flowchart: Iterative Thinking Process
### Overview
The image is a vertical flowchart illustrating a sequential, multi-stage thinking process. It begins with an âInput,â undergoes three identical âThoughtâ iterations (each involving a network-like diagram), and concludes with an âAnswer.â
### Components
- **Input**: A top-placed oval with a gray border, light gray fill, and the text âInputâ in black (serving as the start of the process).
- **Thought Boxes**: Three rectangular boxes (with gray borders and white fills) are aligned vertically. Each box contains the text âThoughtâ (in black) and a yellow plus sign (+) to the right.
- **Network Diagrams**: Adjacent to each âThoughtâ box is a network-style diagram. It consists of nodes: a central green node, surrounded by light blue and dark blue nodes, all connected by gray lines (suggesting a neural network or knowledge-graph-like structure for information processing).
- **Arrows**: Gray downward-pointing arrows connect the components in the order: `Input â Thought 1 â Thought 2 â Thought 3 â Answer`, defining the flow direction.
- **Answer**: A bottom-placed rounded rectangle with a gray border, light gray fill, and the text âAnswerâ in dark blue (representing the processâs output).
### Detailed Analysis
- **Flow and Repetition**: The process is vertical (top-to-bottom). The three âThoughtâ stages are visually identical (same box design, same plus sign, and same network diagram), indicating an iterative processing approach with a consistent input structure for each âThoughtâ phase.
- **Visual Consistency**: The color scheme is consistent: gray for borders and arrows, yellow for the plus sign, dark blue/light blue/green for network nodes, black for âThoughtâ text, and dark blue for âAnswerâ text.
- **Network Diagrams**: The network diagrams (with a central green node, light blue middle-layer nodes, and dark blue outer-layer nodes) imply a layered information-processing model (e.g., core idea, supporting information, and external knowledge integration).
### Key Observations
- **Iterative Design**: The three âThoughtâ stages are identical, suggesting repeated processing with the same input-processing framework.
- **Color-Coded Nodes**: The node colors (green center, light blue, dark blue) likely represent different data layers or types of information (e.g., core concepts, supplementary data, and external knowledge).
- **Flowchart Conventions**: The oval (Input), rectangles (Thought), and rounded rectangle (Answer) follow standard flowchart symbol conventions for start, process, and end/result, respectively.
### Interpretation
This diagram models a multi-step reasoning process (e.g., AI chain-of-thought reasoning or human-like problem-solving). The âInputâ is processed through three rounds of âThoughtâ (each using a network-based structure to integrate information) to generate the âAnswer.â The repetition of âThoughtâ implies iterative refinement: each âThoughtâ phase builds on the previous one, using the network diagram to process and integrate data (like a neural network or knowledge graph) to move closer to the final answer. The consistent design of the three âThoughtâ stages suggests a structured, repeatable processing method (e.g., a model that refines its reasoning over multiple iterations).
</details>
(a) Chain of Thought (CoT)
<details>
<summary>figures/tot.jpg Details</summary>

### Visual Description
## Diagram: Multi-Stage Decision or Validation Flowchart
### Overview
The image is a technical flowchart illustrating a multi-stage, branching decision or validation process. It depicts an "Input" at the top, which is processed through three hierarchical layers of nodes. Each node is represented by a square containing the letter "T" and an associated neural network icon. The process flows downward, with paths diverging and converging based on outcomes marked by green checkmarks (â) and red crosses (Ă). A single successful path, highlighted in green, ultimately leads to a final "Answer" at the bottom. The diagram visually represents a system where multiple parallel attempts are made, with failures being pruned and successful paths being propagated forward.
### Components/Axes
* **Primary Flow Direction:** Top to bottom.
* **Key Text Labels:**
* **"Input"**: Located in an oval at the top-center. This is the starting point.
* **"T"**: The label inside every square node in the three main layers. This likely stands for "Task," "Test," "Transformer," or a similar unit of processing.
* **"Answer"**: Located in a rounded rectangle at the bottom-center. This is the final output.
* **Symbols & Icons:**
* **Green Checkmark (â) in a square**: Indicates a successful or validated outcome from a node.
* **Red Cross (Ă) in a circle**: Indicates a failed or rejected outcome from a node.
* **Neural Network Icon**: A stylized graphic of interconnected nodes (dark blue, light blue, green circles) attached to the left side of every "T" square. This signifies that each "T" unit involves a neural network or complex model.
* **Connectors:**
* **Green Arrows**: Represent the successful, active path of processing that leads to the final answer.
* **Gray Arrows**: Represent alternative or failed processing paths that do not contribute to the final answer.
* **Structural Layers:** The diagram is divided by three horizontal gray lines, creating four distinct vertical zones:
1. **Header Zone:** Contains only the "Input" oval.
2. **Layer 1 Zone:** Contains three "T" nodes.
3. **Layer 2 Zone:** Contains five "T" nodes.
4. **Layer 3 Zone:** Contains five "T" nodes.
5. **Footer Zone:** Contains only the "Answer" rectangle.
### Detailed Analysis
**Spatial Grounding & Path Tracing:**
The process begins at the top-center "Input." Three green arrows diverge from it, pointing to the three nodes in Layer 1.
* **Layer 1 (Top row of nodes):**
* **Left Node (T):** Receives a green arrow from Input. Has a green checkmark (â). Emits two green arrows downward to Layer 2.
* **Center Node (T):** Receives a gray arrow from Input. Has a red cross (Ă). Emits a single gray arrow downward to Layer 2.
* **Right Node (T):** Receives a green arrow from Input. Has a green checkmark (â). Emits two green arrows downward to Layer 2.
* **Layer 2 (Middle row of nodes):**
* **Node 1 (Far Left):** Receives a gray arrow from Layer 1's Left Node. Has a red cross (Ă). Emits a gray arrow to Layer 3.
* **Node 2 (Center-Left):** Receives a green arrow from Layer 1's Left Node. Has a green checkmark (â). Emits a green arrow downward to Layer 3.
* **Node 3 (Center):** Receives a gray arrow from Layer 1's Center Node. Has a red cross (Ă). Emits a gray arrow to Layer 3.
* **Node 4 (Center-Right):** Receives a green arrow from Layer 1's Right Node. Has a green checkmark (â). Emits a gray arrow to Layer 3.
* **Node 5 (Far Right):** Receives a gray arrow from Layer 1's Right Node. Has a red cross (Ă). Emits a gray arrow to Layer 3.
* **Layer 3 (Bottom row of nodes):**
* **Node 1 (Far Left):** Receives a gray arrow from Layer 2, Node 1. Has a red cross (Ă). No outgoing arrow.
* **Node 2 (Center-Left):** Receives a green arrow from Layer 2, Node 2. Has a green checkmark (â). Emits the final green arrow downward to the "Answer."
* **Node 3 (Center):** Receives a gray arrow from Layer 2, Node 3. Has a red cross (Ă). No outgoing arrow.
* **Node 4 (Center-Right):** Receives a gray arrow from Layer 2, Node 4. Has a red cross (Ă). No outgoing arrow.
* **Node 5 (Far Right):** Receives a gray arrow from Layer 2, Node 5. Has a red cross (Ă). No outgoing arrow.
**The Successful Path (Green):**
Input â Layer 1, Left Node (â) â Layer 2, Node 2 (â) â Layer 3, Node 2 (â) â Answer.
### Key Observations
1. **Pruning of Failure:** Every node marked with a red cross (Ă) becomes a terminal point for its branch; no further processing flows from it. This is a clear visual representation of failure pruning.
2. **Path Divergence and Convergence:** The successful path from Layer 1's Left Node splits into two in Layer 2, but only one of those sub-branches (Node 2) succeeds and continues. The successful path from Layer 1's Right Node also splits, but both of its sub-branches in Layer 2 ultimately fail (Node 4 gets a checkmark but its child in Layer 3 fails; Node 5 fails immediately).
3. **Single Success Out of Many:** Out of 13 total "T" nodes across three layers, only 4 are marked with green checkmarks. Only one complete path from Input to Answer is fully green.
4. **Neural Network Ubiquity:** The presence of the neural network icon on every "T" node indicates that each decision or processing step is mediated by a complex model, not a simple rule.
### Interpretation
This diagram illustrates a **robust, multi-attempt validation or search process**, likely for a complex problem-solving system (e.g., in AI planning, theorem proving, or model-based reasoning).
* **What it demonstrates:** The system tackles a problem ("Input") by launching multiple parallel processing threads (the three initial paths). Each thread undergoes successive stages of evaluation or refinement (the "T" nodes). The green checkmarks represent passing a validation gate or achieving a sub-goal, while red crosses represent dead ends or invalid solutions.
* **How elements relate:** The flowchart shows a **hierarchical, depth-first search with aggressive pruning**. The system explores several promising avenues simultaneously but quickly abandons paths that show early signs of failure, conserving computational resources. The final "Answer" is not the first solution found, but the one that successfully navigated all validation stages.
* **Notable patterns/anomalies:** The most striking pattern is the **high attrition rate**. The funneling from three initial paths to one final answer, with most branches ending in failure, suggests the problem space is difficult and the validation criteria are strict. The fact that a checkmark in Layer 2 (Node 4) does not guarantee success in Layer 3 indicates that validation is **stage-dependent**; passing one test does not ensure passing the next.
* **Underlying meaning:** This is a visual metaphor for **iterative refinement and critical selection**. It argues that for complex tasks, generating multiple hypotheses and subjecting them to layered, rigorous testing is a reliable path to a correct answer, even if most hypotheses fail. The "T" nodes could represent different models, different reasoning strategies, or different candidate solutions being evaluated by a critic network.
</details>
(b) Tree of Thought (ToT)
<details>
<summary>figures/got.jpg Details</summary>

### Visual Description
## Diagram: Parallel Processing Decision Tree with Neural Network Nodes
### Overview
The image is a technical flowchart depicting a multi-layered parallel processing system. It illustrates how an initial "Input" is processed through multiple pathways across three distinct layers, with some pathways terminating early (marked with a red "X") and others continuing (marked with a green checkmark). Successful pathways from the final layer converge to produce a single "Answer." Each processing node is represented by a square containing the letter "T" and is accompanied by a stylized neural network graphic.
### Components/Axes
* **Primary Flow Elements:**
* **Input:** A single oval at the top-center of the diagram, labeled "Input". It is the origin point for the process.
* **Processing Nodes:** Twelve identical square nodes, each labeled with a capital "T". Each "T" node is visually paired with a network graphic composed of interconnected circles (nodes) in dark gray, light blue, and light green.
* **Answer:** A single rounded rectangle at the bottom-right of the diagram, labeled "Answer". It is the convergence point for successful pathways.
* **Flow Indicators:**
* **Arrows:** Gray and green arrows indicate the direction of data or process flow between nodes and layers.
* **Status Icons:**
* **Green Checkmark (â):** A square green icon with a white checkmark, placed on the output path of a node, indicating a successful or valid processing step.
* **Red Cross (â):** A circular red icon with a white "X", placed on the output path of a node, indicating a failed or terminated processing step.
* **Structural Layout:** The diagram is organized into three horizontal layers, separated by faint gray lines. The flow progresses vertically downward from Input to Answer.
### Detailed Analysis
**Layer 1 (Top Layer, directly below Input):**
* **Nodes:** 3 "T" nodes.
* **Flow from Input:** The Input sends three green arrows, one to each node in this layer.
* **Node Outcomes:**
* **Left Node:** Has a green checkmark. Sends one gray arrow down to the leftmost node of Layer 2 and one green arrow down to the second node of Layer 2.
* **Middle Node:** Has a red cross. Sends one gray arrow down to the third node of Layer 2.
* **Right Node:** Has a green checkmark and a self-referential loop arrow. Sends one green arrow down to the fourth node of Layer 2 and one green arrow down to the fifth (rightmost) node of Layer 2.
**Layer 2 (Middle Layer):**
* **Nodes:** 5 "T" nodes.
* **Node Outcomes:**
* **Node 1 (Leftmost):** Has a red cross. Receives a gray arrow from Layer 1's left node. No outgoing arrows.
* **Node 2:** Has a green checkmark. Receives a green arrow from Layer 1's left node. Sends one gray arrow down to the first node of Layer 3 and one green arrow down to the second node of Layer 3.
* **Node 3:** Has a green checkmark. Receives a gray arrow from Layer 1's middle node. Sends one green arrow down to the second node of Layer 3.
* **Node 4:** Has a green checkmark. Receives a green arrow from Layer 1's right node. Sends one green arrow down to the third node of Layer 3.
* **Node 5 (Rightmost):** Has a green checkmark and a self-referential loop arrow. Receives a green arrow from Layer 1's right node. Sends one green arrow down to the fourth node of Layer 3.
**Layer 3 (Bottom Layer, above Answer):**
* **Nodes:** 4 "T" nodes.
* **Node Outcomes:**
* **Node 1 (Leftmost):** Has a red cross. Receives a gray arrow from Layer 2's node 2. No outgoing arrows.
* **Node 2:** Has a green checkmark and a self-referential loop arrow. Receives green arrows from Layer 2's node 2 and node 3. Sends one green arrow down to the Answer.
* **Node 3:** Has a green checkmark and a self-referential loop arrow. Receives a green arrow from Layer 2's node 4. Sends one green arrow down to the Answer.
* **Node 4 (Rightmost):** Has a green checkmark and a self-referential loop arrow. Receives a green arrow from Layer 2's node 5. Sends one green arrow down to the Answer.
**Convergence:**
* Three green arrows from Layer 3's nodes 2, 3, and 4 converge into the "Answer" box.
### Key Observations
1. **Pathway Attrition:** The process begins with 3 parallel paths in Layer 1. By Layer 3, only 3 out of an initial potential 12 nodes (across all layers) are actively contributing to the final answer. Many pathways are terminated by red crosses.
2. **Self-Loops:** Several nodes (Layer 1 right, Layer 2 rightmost, Layer 3 nodes 2, 3, and 4) feature a curved arrow looping back into themselves, suggesting an iterative or recursive sub-process within that node before passing its output forward.
3. **Color-Coded Flow:** Green arrows consistently originate from nodes with green checkmarks, indicating the flow of "successful" data. Gray arrows originate from nodes or lead to nodes that ultimately fail (red cross).
4. **Asymmetric Success:** The rightmost branch of the tree appears most robust, with its initial node and all subsequent descendant nodes in the chain receiving green checkmarks and contributing to the final answer.
### Interpretation
This diagram models a fault-tolerant or exploratory computational system, likely representing an ensemble method, a search algorithm, or a neural network architecture with dropout-like failure modes.
* **Parallel Exploration:** The system explores multiple solution pathways simultaneously from a single input.
* **Validation & Pruning:** Each node ("T" could stand for "Test," "Transformer," or "Task") performs a validation step. Pathways that fail validation (red cross) are pruned and do not propagate further, conserving computational resources.
* **Iterative Refinement:** The self-referential loops indicate that individual nodes can perform internal, possibly recursive, computations to refine their output before passing it on.
* **Ensemble Answer:** The final "Answer" is not derived from a single pathway but is an aggregation or consensus from multiple successful, independent processing chains (the three converging green arrows). This suggests a robust output less dependent on any single processing path.
The overall structure emphasizes resilience through redundancy and selective propagation, where the final output is the product of multiple validated sub-processes.
</details>
(c) Graph of Thought (GoT)
Figure 2: Reasoning Strategies: This figure illustrates different LLM reasoning strategies to navigate the potential answer space: CoT, ToT, GoT. Each strategy consists of "thoughts" connected to the KG through search methods (Section 4.2) illustrating stepwise reasoning over structured knowledge.
### 4.1 Reasoning Strategies
Chain-of-Thought (CoT)
CoT is a well-known reasoning method that involves generating a sequence of logical steps, where each step builds upon previous ones, ultimately leading to a conclusion. Formally, it generates a sequence of reasoning steps $Z_p_{θ}(q)=\{z_1,z_2,âŚ,z_n\}$ , where each step $z_i$ is sampled sequentially given the input query $q$ , all previous steps and graph information from all steps , $G^\prime$ , as $z_iâź p_θ^CoT(z_i|q,G^\prime,z_1⌠i-1)$ . The final answer $a$ is derived from this reasoning process given all the generated thoughts $aâź p_θ^CoT(a|q,G^\prime,z_1⌠n)$ . In practice, it is sampled as a continuous language sequence. Figure 2(a) represents this method, where each step is linked to the KG.
Tree-of-Thought (ToT)
ToT generalizes CoT by modeling the reasoning process as a tree, enabling simultaneous exploration of multiple reasoning paths. Starting from an initial state $s_0=[q]$ , where $q$ is the input, ToT incrementally expands each state by generating multiple candidate thoughts:
$$
z_i+1^(j)⟠p_θ(z_i+1\mid s_i), j=1,\dots,k \tag{1}
$$
Each candidate thought represents a node in the tree, forming new states. These states are evaluated by a heuristic scoring function $V(p_θ,s)$ , guiding the selection and pruning of branches. Search strategies, such as breadth-first search (BFS), systematically explore this tree:
$$
S_t=argmax_S^\primeâ\hat{S_t,|S^\prime|=b}â_sâ S^\primeV(p_θ,s) \tag{2}
$$
where $\hat{S}_t$ denotes candidate states at step $t$ , and $b$ limits the breadth. We implement two versions of heuristic functions $V$ to select the top $t$ states:
1. Selection: The LLM directly chooses the top $t$ states to proceed, discarding the others.
1. Score: The states are ranked by a heuristic voting mechanism: $V(p_θ,S)(s)=â[s=s^*]$ where the LLM is prompted to estimate probability of the current state solving the given input question.
This structured search and pruning strategy allows the model to evaluate multiple candidate reasoning paths, enabling more deliberate and interpretable reasoning.
Graph-of-Thought (GoT)
GoT extends ToT by organizing reasoning into a directed graph structure $G=(V,E)$ , where each node represents a thought and edges reflect dependencies. Starting from initial thought, new thoughts are generated similarly to ToT and added to the graph. Each new thought is connected to its parent, and additional reasoning chains can be formed through merging operations:
$$
z_i+1=A(z_i^(a),z_i^(b)) \tag{3}
$$
where $A$ denotes a merge operation that integrates two thought chains into a single coherent reasoning step. The merged thought is added as a new node with edges from both parents. In our implementation, thoughts are evaluated using either Selection- or Score-based strategy as in ToT. Merged thoughts inherit information from both parents and can enhance robustness. At each depth, a fixed number of thoughts are retained using breadth-first traversal and evaluated for progression. The graph-based organization captures dependencies and merges information from multiple reasoning chains, supporting dynamic refinement and structured exploration of the reasoning space.
### 4.2 LLM + KG Interaction Methods
We implement methods to connect reasoning strategies with KGs. The LLM interacts with the KG at every step. This retrieves new information and conditions the model for subsequent steps. We present 2 methods to achieve this interaction, both illustrated in Appendix B.
#### 4.2.1 Agent
This approach creates an agent that interacts with the graph, following the methodology initially described in ReACT (react). After generating a thought, the LLM selects from a set of actions based on the given thought. Each step in the reasoning chain consists of an interleaved sequence: thought $â$ action $â$ retrieved data. This method implements four actions as described in GraphCoT (graphCoT): (a) RetrieveNode (Text): Identifies the related node in the graph using semantic search, (b) NodeFeature (NodeID, FeatureName): Retrieves textual information for a specific node from the graph, (c) NeighborCheck (NodeID, EdgeType): Retrieves neighborsâ information for a specific node, (d) NodeDegree (NodeID, EdgeType): Returns the degree (#neighbors) for a given node and edge type. These actions collectively enable the agent to navigate and extract meaningful information from the graph, enhancing the reasoning capabilities of the LLM by grounding its thoughts in structured, retrievable data.
#### 4.2.2 Automatic Graph Exploration
This method incrementally searches the graph by interleaving language generation with structured retrieval. At each step, the LLM generates a new "thought" based on previous thoughts and retrieved triples. Entities mentioned in the generated text are automatically extracted using LLM prompts and serve as anchors for further graph exploration.
<details>
<summary>figures/graph_search.png Details</summary>

### Visual Description
## Diagram: Information Extraction Pipeline to Knowledge Triples
### Overview
The image is a technical flowchart illustrating a three-stage pipeline for extracting structured knowledge triples from unprocessed text. The process flows from left to right, beginning with a source document and ending with formatted output triples. The diagram uses icons, labeled boxes, colored arrows, and status indicators to depict each step.
### Components/Axes
The diagram is segmented into four distinct functional regions, arranged horizontally:
1. **Input & Entity Extraction (Leftmost Region):**
* **Input Icon:** A document icon labeled with a capital "T" represents the source text.
* **Process Icon:** A robot head icon symbolizes the automated extraction system.
* **Output of Stage:** Three rectangular boxes labeled `Entity1`, `Entity2`, and `Entity3` in a vertical list. `Entity2` is highlighted with a blue background.
2. **Relation Selection (Center-Left Region):**
* **Process:** Arrows originate from the `Entity2` box, indicating it is the subject for relation extraction.
* **Relation Labels:** Two red arrows are labeled `Rel1` (pointing upward) and `Rel2` (pointing downward).
* **Target:** Both arrows point toward a cluster of three green boxes, each labeled `EntityA`. This suggests multiple potential object entities for the relations.
3. **Neighbor Selection (Center-Right Region):**
* **Process:** This stage filters the candidate `EntityA` nodes from the previous step.
* **Status Indicators:** The top two `EntityA` boxes have green checkmarks (â) next to them, indicating they are valid or selected neighbors. The bottom `EntityA` box has a red cross (â), indicating it is invalid or rejected.
4. **Triples (Rightmost Region):**
* **Output Container:** A light gray rounded rectangle labeled "Triples" at the top.
* **Output Content:** Inside, three structured triples are displayed vertically. Each triple consists of a blue square (subject), a red arrow (predicate/relation), and a green square (object). This visually represents the final structured output: `(Entity2, Rel1, EntityA)` and `(Entity2, Rel2, EntityA)`.
### Detailed Analysis
* **Flow & Connectivity:** The pipeline is strictly sequential: Document â Entity Extraction â Relation Selection â Neighbor Selection â Triples.
* **Color Coding:**
* **Blue:** Represents the primary subject entity (`Entity2`) and the subject component in the final triples.
* **Green:** Represents candidate and selected object entities (`EntityA`) and the object component in the final triples.
* **Red:** Represents the relations (`Rel1`, `Rel2`) and the predicate component in the final triples.
* **Spatial Grounding:** The legend (color meaning) is implicit but consistent. The `Entity2` box (blue) is the central node in the Relation Selection stage. The `EntityA` boxes (green) are the targets. The final triples in the "Triples" container (far right) precisely mirror the color scheme and logical connections established in the middle stages.
* **Trend/Process Verification:** The diagram does not show numerical trends but a logical process flow. The trend is a transformation from unstructured data (document icon) to a filtered, structured format (triples). The "Neighbor Selection" stage acts as a critical filter, reducing three candidate relations to two valid ones.
### Key Observations
1. **Entity Disambiguation:** The "Neighbor Selection" stage is crucial. It shows that not all extracted relations (`Rel1`, `Rel2` to `EntityA`) are valid. The system performs a validation step, likely using context or knowledge base constraints, to discard incorrect links (the bottom `EntityA` with the â).
2. **One-to-Many Relation:** The diagram explicitly shows that a single subject entity (`Entity2`) can have multiple relations (`Rel1`, `Rel2`) pointing to the same object entity type (`EntityA`), resulting in multiple distinct triples.
3. **Abstraction:** The use of generic labels (`Entity1`, `Rel1`) indicates this is a conceptual model of a process, not a specific example with real-world data.
### Interpretation
This diagram models a core task in Natural Language Processing (NLP) and Knowledge Graph construction: **Open Information Extraction**. It demonstrates how a system can:
1. **Identify** entities (`Entity1,2,3`) from raw text.
2. **Hypothesize** potential relationships (`Rel1, Rel2`) between a chosen subject (`Entity2`) and other entities.
3. **Validate** these hypotheses by checking the "neighborhood" or context of the linked entities, filtering out spurious connections.
4. **Output** clean, structured facts in the form of `(subject, predicate, object)` triples, which are the fundamental building blocks of a knowledge graph.
The presence of the "Neighbor Selection" stage is particularly significant. It implies the system goes beyond simple pattern matching and incorporates a verification mechanismâperhaps using semantic similarity, knowledge base coherence, or syntactic constraintsâto improve the precision of the extracted knowledge. The final "Triples" are the actionable, machine-readable output that can be used for search, reasoning, or AI training.
</details>
Figure 3: Automatic Graph Exploration. It extracts entities from text (query/thought), then select relevant relations and neighbors with the LLM. The resulting entity-relation-entity combinations form triples to expand the reasoning chain.
Graph exploration proceeds through a multi-step Search + Prune pipeline, inspired by the process described in think-on-graph. For each unvisited entity, the system first retrieves and prunes relation types using LLM guidance. Then, for each selected relation, neighboring entities are discovered and filtered using a second round of pruning. The model selects only the most relevant neighbors based on their contextual fit with the question and previous reasoning steps. This hierarchical pruning â first on relations, then on entities â ensures the method remains computationally tractable while preserving interpretability. The overall traversal follows a breadth-first search (BFS) pattern, with pruning decisions at each level directed by LLM. This process is shown in Figure 3. This iterative reasoning and retrieval process allows the model to condition future steps on progressively relevant subgraphs, shaping the reasoning trajectory. Unlike agentic methods that rely on predefined actions, the automatic approach operates in the graph space guided by the natural language, providing more freedom in the generation. The mechanism is designed to maximize information gain at each step while avoiding graph overgrowth. More details are provided in Algorithm 1.
## 5 Experiments
Benchmark
We use the GRBench dataset to evaluate our methods. This dataset is specifically designed to evaluate how effectively LLMs can perform stepwise reasoning over domain-specific graphs. It includes several graphs spanning various general domains. For our evaluation, we selected 7 graphs across multiple domains, excluding those with excessively high RAM requirements that exceed our available resources. Comprehensive graph statistics are provided in Appendix A.
Baselines
The proposed methods, Agent and Automatic Graph Exploration, applied to CoT, ToT, and GoT, are compared against the following baseline methods: (1) Zero-Shot: Directly querying the model to answer the question without additional context. (2) Text RAG rag_sruvey: Text-retrieval method that uses text representation of nodes as input for query, with retrieved data serving as context for the model. (3) Graph RAG: Includes node neighbors (1-hop) for additional context beyond Text RAG. (4) Graph CoT (Agent): Implements Graph CoT as an agent for CoT reasoning, utilizing the actions described in Section 4.2. These baselines allow us to measure impact of stepwise, knowledge-grounded reasoning versus simple retrieval-augmented or zero-shot approaches.
Experimental methods
We implement the methods described in Section 4, extending (1) Agent and (2) Automatic Graph Exploration with various reasoning strategies during inference: (1) CoT, (2) ToT, and (3) GoT. For ToT and GoT, we evaluate the impact of stepwise decision-making using State Evaluation methods: (1) Selection and (2) Score. In the results presented in Table 1, we set $n=10$ steps for all methods. ToT and GoT use a branching factor and Selection of $k=t=3$ . Our experiments focus on the effect of structured reasoning interventions on performance and stepwise refinement of answers. We use only open-access Llama 3.1 (Instruct) llama3models as the backend models, which enhances reproducibility and allows for unlimited free calls. Specifically, we employ the 8B, 70B, and 405B versions, using the FP8 variant for the 405B model.
Evaluation
We use rule-based and model-based metrics to evaluate the models, following GRBench paper graphCoT. For the rule-based metric, we use Rouge-L (R-L) (rouge_metric), which measures the longest sequence of words appearing in same order in both generated text and ground truth answer. For model-based metric, we prompt GPT-4o to assess if the modelâs output matches ground truth answer. GPT4Score is percentage of answers that GPT-4o identifies as correct. These evaluation methods capture not only final answer accuracy but also the fidelity of reasoning steps, reflecting the effectiveness of our interventions in guiding LLM reasoning over structured knowledge.
Implementation Details
The experiments are run on NVIDIA TITAN RTX or NVIDIA A100 using Python 3.8. The models are deployed with vLLM vllm, a memory-efficient library for LLM inference and serving. For the baseline, Mpnet-v2 is used as the retriever, and FAISS faiss is employed for indexing.
## 6 Results
The main results from both the baselines and experimental methods, evaluated using R-L, are presented in Table 1. For brevity, additional results using GPT4Score can be found in Appendix D. Together, these findings allow us to compare different forms of reasoning interventions, agentic action selection, automatic graph exploration, and structured multi-path search, on their ability to guide LLMs toward accurate answers. We highlight three key insights from the findings: (1) The agentic method generally outperformed automatic graph exploration, indicating that targeted interventions on knowledge graph traversal enhance answer accuracy. (2) The ToT strategy demonstrated superior performance by effectively exploring multiple reasoning paths, showcasing the benefits of inference-time interventions that diversify reasoning trajectories. (3) Although GoT strategy showed potential, it did not significantly outperform ToT, suggesting that merging divergent reasoning paths remains a challenging intervention design problem. These results show the importance of reasoning strategies in enabling LLMs to navigate multiple paths in the graph, while also illustrating the limits of current intervention techniques.
| | Method | | Model | Healthcare | Goodreads | Biology | Chemistry | Materials Science | Medicine | Physics |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Baselines | | Llama 3.1 8B-Ins | 7.32 | 6.18 | 10.68 | 11.69 | 8.95 | 8.77 | 11.52 | |
| Base | Llama 3.1 70B-Ins | 9.74 | 9.79 | 11.49 | 12.58 | 10.40 | 12.21 | 12.61 | | |
| Llama 3.1 405B-Ins | 8.66 | 12.49 | 10.52 | 13.51 | 11.73 | 11.82 | 11.63 | | | |
| Llama 3.1 8B-Ins | 8.24 | 14.69 | 12.43 | 11.42 | 9.46 | 10.75 | 11.29 | | | |
| Text-RAG | Llama 3.1 70B-Ins | 10.32 | 18.81 | 11.87 | 16.35 | 12.25 | 12.77 | 12.54 | | |
| Llama 3.1 405B-Ins | 11.61 | 16.23 | 16.11 | 13.82 | 14.23 | 15.16 | 16.32 | | | |
| Llama 3.1 8B-Ins | 12.94 | 22.30 | 30.72 | 34.46 | 30.20 | 25.81 | 33.49 | | | |
| Graph-RAG | Llama 3.1 70B-Ins | 17.95 | 25.36 | 38.88 | 40.90 | 41.09 | 31.43 | 39.75 | | |
| Llama 3.1 405B-Ins | 16.12 | 23.13 | 37.57 | 42.58 | 37.74 | 33.34 | 40.98 | | | |
| Graph CoT | | Llama 3.1 8B-Ins | 16.83 | 30.91 | 20.15 | 18.43 | 26.29 | 14.95 | 21.41 | |
| Agent | Llama 3.1 70B-Ins | 33.48 | 40.98 | 50.00 | 51.53 | 49.6 | 48.27 | 44.35 | | |
| Llama 3.1 405B-Ins | 28.41 | 36.56 | 41.35 | 48.36 | 47.81 | 42.54 | 35.24 | | | |
| Graph Explore | Llama 3.1 8B-Ins | 25.58 | 32.34 | 36.65 | 35.33 | 31.06 | 31.05 | 35.96 | | |
| Llama 3.1 70B-Ins | 29.41 | 29.60 | 44.63 | 49.49 | 39.23 | 38.87 | 45.52 | | | |
| Llama 3.1 405B-Ins | 28.45 | 43.06 | 36.93 | 38.71 | 47.49 | 55.66 | 32.73 | | | |
| Graph ToT | Agent | Score | Llama 3.1 8B-Ins | 28.91 | 52.25 | 43.81 | 44.18 | 43.49 | 36.07 | 39.56 |
| Llama 3.1 70B-Ins | 38.51 | 51.58 | 64.44 | 61.13 | 55.19 | 63.00 | 55.33 | | | |
| Llama 3.1 405B-Ins | 47.51 | 50.73 | 70.34 | 64.9 | 49.02 | 65.40 | 44.63 | | | |
| Select | Llama 3.1 8B-Ins | 28.67 | 50.59 | 42.33 | 37.07 | 40.81 | 33.17 | 36.50 | | |
| Llama 3.1 70B-Ins | 40.26 | 52.59 | 64.53 | 66.84 | 61.42 | 61.21 | 55.89 | | | |
| Llama 3.1 405B-Ins | 46.90 | 51.68 | 70.27 | 67.95 | 63.74 | 64.23 | 59.56 | | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 24.49 | 36.80 | 35.81 | 36.41 | 34.28 | 34.49 | 37.69 | |
| Llama 3.1 70B-Ins | 32.79 | 38.19 | 53.83 | 58.25 | 48.55 | 52.18 | 48.07 | | | |
| Llama 3.1 405B-Ins | 33.90 | 42.68 | 46.87 | 57.43 | 50.46 | 55.56 | 48.73 | | | |
| Select | Llama 3.1 8B-Ins | 25.04 | 37.8 | 36.34 | 38.5 | 32.44 | 33.31 | 34.85 | | |
| Llama 3.1 70B-Ins | 33.40 | 39.13 | 54.78 | 58.53 | 47.19 | 51.13 | 47.51 | | | |
| Llama 3.1 405B-Ins | 33.82 | 43.63 | 44.47 | 59.06 | 48.52 | 55.62 | 46.07 | | | |
Table 1: Rouge-L (R-L) performance results on GRBench, comparing standard LLMs, Text-RAG, Graph-RAG, Graph-CoT, and Graph-ToT. Experiments are described in Section 5, using LLama 3.1 - Instruct backbone models with sizes 8B, 70B, and 405B-FP8.
Agent vs Graph Search
In our experimental results, the agentic method outperformed graph exploration approach across most datasets and reasoning strategies. The agent-based method, which involves LLM selecting specific actions to interact with KG, consistently improves performance as the number of reasoning steps increases, as shown in Section 7. This highlights that explicit, model-driven interventions are more effective than passive expansion strategies, as they promote iterative refinement and selective focus. While graph exploration can quickly provide broad coverage, the agentic methodâs targeted, stepwise interactions yield more accurate and comprehensive answers over longer sequences of reasoning.
Tree of Thought (ToT)
The ToT reasoning strategy showed superior performance across its various interaction methods and state evaluators, as summarized in Table 1. ToT achieved performance improvements of 54.74% in agent performance and 11.74% in exploration mode compared to the CoT baseline. However, this improvement comes with the trade-off of increased inference time, highlighting the cost of inference-time reasoning interventions. The success of ToT illustrates how branching interventions that explore multiple candidate paths can substantially enhance reasoning accuracy, especially when coupled with evaluators that prune unpromising trajectories. We also compared the two State Evaluation methods (Selection and Score), finding complementary benefits depending on dataset and scale.
| Method | Model | Healthcare | Biology | |
| --- | --- | --- | --- | --- |
| Agent | Score | Llama 3.1 8B-Ins | 29.11 | 33.25 |
| Llama 3.1 70B-Ins | 30.88 | 56.64 | | |
| Llama 3.1 405B-Ins | 43.53 | 48.1 | | |
| Select | Llama 3.1 8B-Ins | 29.05 | 40.37 | |
| Llama 3.1 70B-Ins | 40.74 | 65.59 | | |
| Llama 3.1 405B-Ins | 47.63 | 71.49 | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 24.96 | 21.72 |
| Llama 3.1 70B-Ins | 31.24 | 50.70 | | |
| Llama 3.1 405B-Ins | 35.00 | 39.10 | | |
| Select | Llama 3.1 8B-Ins | 25.06 | 21.84 | |
| Llama 3.1 70B-Ins | 36.95 | 52.32 | | |
| Llama 3.1 405B-Ins | 33.74 | 54.64 | | |
Table 2: Graph-GoT results on GRBench using Rouge-L
Graph of Thought (GoT)
The results for GoT strategy are summarized in Table 2. Due to the additional computational cost, we report results for two datasets only. GoT did not outperform ToT. Our initial hypothesis was that LLMs could integrate divergent results from multiple branches, but in practice the models struggled to merge these effectively. Specifically, in the graph exploration setting, models often failed to combine different triples found in separate branches. This reveals a current limitation of reasoning interventions based on aggregation: while branching helps discover diverse facts, robust mechanisms for synthesis and reconciliation are still underdeveloped. This finding opens a direction for future research into more advanced intervention strategies for merging partial reasoning outcomes.
## 7 Analysis & Ablation studies
In this section, we want to better understand the nuances of our methods for LLM and KG grounding. We conduct an analysis on the Academic datasets from the benchmark, as they all contain the same number of samples and feature questions generated from similar templates to ensure a controlled comparison.
<details>
<summary>figures/analysis_steps_academic_rl.png Details</summary>

### Visual Description
## Line Chart: Agent vs. Explore Performance Over Steps
### Overview
The image is a line chart comparing the performance of two entities, labeled "Agent" and "Explore," across a series of 10 steps. Performance is measured by the "Rouge-L" metric on the y-axis. The chart includes shaded regions around each line, likely representing confidence intervals or standard deviation.
### Components/Axes
* **Chart Type:** Line chart with markers and shaded confidence bands.
* **X-Axis:** Labeled "Steps". It has major tick marks and numerical labels at every integer from 1 to 10.
* **Y-Axis:** Labeled "Rouge-L". It has major tick marks and numerical labels at intervals of 10, from 0 to 60.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains two entries:
* A blue line with a circle marker labeled **"Agent"**.
* An orange line with a square marker labeled **"Explore"**.
* **Grid:** A light gray grid is present, with vertical lines at each step and horizontal lines at each 10-unit increment on the y-axis.
### Detailed Analysis
**Data Series 1: Agent (Blue line, circle markers)**
* **Trend:** The line shows a steep, positive slope, indicating rapid improvement. It starts at the lowest point on the chart and ends at the highest.
* **Data Points (Approximate):**
* Step 1: 0
* Step 3: ~15
* Step 5: ~45
* Step 10: ~55
* **Confidence Band:** The shaded blue area is widest at the beginning (Steps 1-3), suggesting higher variance or uncertainty in early performance, and narrows significantly as the steps increase.
**Data Series 2: Explore (Orange line, square markers)**
* **Trend:** The line shows a moderate, positive slope that plateaus. It starts at a moderate performance level and shows gradual improvement before leveling off.
* **Data Points (Approximate):**
* Step 1: ~42
* Step 3: ~49
* Step 5: ~50
* Step 10: ~50
* **Confidence Band:** The shaded orange area is relatively consistent in width throughout all steps, suggesting stable variance in performance.
**Spatial Relationship:**
* The "Explore" line is positioned above the "Agent" line from Step 1 until approximately Step 5.
* The lines intersect between Step 5 and Step 6.
* From Step 6 onward, the "Agent" line is positioned above the "Explore" line.
### Key Observations
1. **Performance Crossover:** The most significant event is the crossover point between Step 5 and Step 6, where the "Agent" series surpasses the "Explore" series in Rouge-L score.
2. **Divergent Trajectories:** The two series exhibit fundamentally different learning or performance curves. "Agent" demonstrates a classic "slow start, rapid gain" pattern, while "Explore" shows a "strong start, quick plateau" pattern.
3. **Final Outcome:** By Step 10, "Agent" achieves a higher final score (~55) compared to "Explore" (~50).
4. **Uncertainty Patterns:** The uncertainty (shaded area) for "Agent" is dynamic, decreasing as performance improves. The uncertainty for "Explore" is static.
### Interpretation
The chart likely illustrates a comparison between two different strategies, algorithms, or models in a sequential decision-making or learning task. The "Rouge-L" metric is commonly used in natural language processing to evaluate text summarization or generation, suggesting this could be a learning curve for AI agents performing a language-based task.
* **What the data suggests:** The "Explore" strategy may represent a method that leverages prior knowledge or a broad initial search, yielding good immediate results but limited capacity for further refinement. The "Agent" strategy may represent a method that starts with no knowledge (score 0) but employs a more effective learning or optimization process, allowing it to rapidly improve and ultimately exceed the performance of the "Explore" strategy.
* **How elements relate:** The x-axis ("Steps") represents time or iterations of learning. The crossover point is critical, indicating the step at which the investment in the "Agent's" learning process begins to pay off relative to the "Explore" baseline.
* **Notable anomalies:** The near-zero starting point for "Agent" is notable, indicating a complete lack of initial capability on this specific metric. The plateau of "Explore" after Step 5 suggests it has reached its performance ceiling under the given conditions. The narrowing confidence band for "Agent" implies that its performance becomes more consistent and reliable as it learns.
</details>
Figure 4: Effect of the number of steps in the LLM-KG Interaction Methods. The Agent requires more steps to obtain the performance of the Graph Exploration, while the Graph Exploration only needs the anchor entities to perform the search within the graph.
<details>
<summary>figures/search_depth_academic_rl.png Details</summary>

### Visual Description
## Line Chart with Error Band: Rouge-L Score vs. Search Depth
### Overview
The image is a line chart displaying the relationship between "Search Depth" (x-axis) and a performance metric called "Rouge-L" (y-axis). The chart includes a single data series labeled "Explore" and a shaded region representing the standard deviation (SD, denoted by Ď) of the measurements. The overall trend shows an initial rapid improvement in the Rouge-L score that plateaus as search depth increases.
### Components/Axes
* **Chart Type:** Line chart with a shaded error band.
* **X-Axis:**
* **Label:** "Search Depth"
* **Scale:** Linear, with integer markers from 1 to 5.
* **Y-Axis:**
* **Label:** "Rouge-L"
* **Scale:** Linear, ranging from 30 to 70, with major tick marks every 5 units (30, 35, 40, 45, 50, 55, 60, 65, 70).
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Elements:**
1. **"Explore":** Represented by a solid, dark blue line with circular markers at each data point.
2. **"SD (Ď)":** Represented by a light blue, semi-transparent shaded area surrounding the main line.
* **Grid:** A light gray grid is present, with horizontal and vertical lines aligning with the major axis ticks.
### Detailed Analysis
**Data Series: "Explore" (Dark Blue Line with Markers)**
* **Trend Verification:** The line slopes steeply upward from Search Depth 1 to 2, continues to rise more gradually to Depth 4, and then shows a very slight downward slope or plateau at Depth 5.
* **Approximate Data Points:**
* Search Depth 1: Rouge-L â 36
* Search Depth 2: Rouge-L â 42
* Search Depth 3: Rouge-L â 43
* Search Depth 4: Rouge-L â 44
* Search Depth 5: Rouge-L â 43
**Error Band: "SD (Ď)" (Light Blue Shaded Area)**
* The shaded region represents the standard deviation around the mean "Explore" line.
* **Width Analysis:** The band is narrowest at Search Depth 1 (spanning approximately 34 to 38). It widens significantly at Depth 2 and remains relatively wide for Depths 3, 4, and 5, indicating greater variability in the Rouge-L scores at higher search depths. The band appears widest around Depths 3 and 4.
### Key Observations
1. **Diminishing Returns:** The most significant gain in Rouge-L score occurs when increasing Search Depth from 1 to 2 (an increase of ~6 points). Subsequent increases in depth yield progressively smaller improvements.
2. **Performance Plateau:** The metric appears to peak at Search Depth 4 (â44) and shows a slight decrease at Depth 5 (â43), suggesting a potential performance ceiling or slight degradation with very deep searches.
3. **Increasing Variability:** The standard deviation (shaded area) expands as Search Depth increases. This indicates that while the average performance improves, the consistency or reliability of the results decreases with deeper searches.
4. **Visual Anchors:** The data point at Search Depth 2 (â42) sits just above the 40 grid line. The point at Depth 4 (â44) is just below the 45 grid line.
### Interpretation
This chart likely evaluates the performance of an information retrieval, text generation, or similar AI system where "Search Depth" is a controllable parameter. "Rouge-L" is a common metric for evaluating text summarization or translation by measuring longest common subsequence overlap with a reference.
The data suggests a clear trade-off:
* **Efficiency vs. Depth:** A shallow search (Depth 1) is fast but yields lower quality results. Increasing depth to 2 or 3 provides a substantial quality boost.
* **Optimal Range:** The optimal operating point appears to be around Search Depth 3 or 4, where the Rouge-L score is maximized before plateauing.
* **Cost of Depth:** The increasing standard deviation at higher depths implies that while the system *can* achieve high scores, its performance becomes less predictable. This could be due to the system retrieving more diverse or noisy information, leading to greater variance in output quality.
* **Practical Implication:** For deployment, using a Search Depth of 3 might offer the best balance of high average performance, reasonable variability, and lower computational cost compared to Depth 4 or 5. The slight dip at Depth 5 could indicate over-searching, where the system begins to incorporate less relevant information, harming consistency.
</details>
Figure 5: Effect of the Search depth in Graph Exploration interaction method for a fixed steps number. The method can achieve relatively good performance with the anchor entities extracted from the question.
How does the number of steps affect the results?
We observe in Figure 5 the effect of varying the number of steps in the KG interaction methods (Agent, Explore) across all academic datasets. The plots indicate that graph exploration performs better with fewer steps, as it automatically traverses the graph for the identified anchor entities. Conversely, the agentic methods improve as the number of steps increases, eventually achieving better performance. This validates our frameworkâs design choice to support both exploration and agentic strategies, each excels in complementary regimes.
<details>
<summary>figures/tree_width_academic_rl.png Details</summary>

### Visual Description
## Line Chart: Rouge-L Score vs. Tree Width
### Overview
This is a line chart comparing the performance of two methods, "ToT-Explore" and "CoT," across different "Tree Width" values. Performance is measured by the "Rouge-L" score. The chart includes a line with markers for ToT-Explore, a shaded region representing its standard deviation (SD), and a single marker for CoT.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Tree Width". It has discrete integer markers at positions 1, 2, 3, 4, and 5.
* **Y-Axis (Vertical):** Labeled "Rouge-L". It has numerical markers from 30 to 70 in increments of 5 (30, 35, 40, 45, 50, 55, 60, 65, 70).
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **ToT-Explore:** Represented by a solid orange line with circular markers.
* **SD (Ď):** Represented by a light orange shaded area, indicating the standard deviation around the ToT-Explore line.
* **CoT:** Represented by a single orange "X" marker.
### Detailed Analysis
**Data Series: ToT-Explore (Orange Line with Circles)**
* **Trend:** The line shows a steep upward slope from Tree Width 1 to 2, followed by a much shallower, near-plateauing increase from Tree Width 2 to 5.
* **Data Points (Approximate):**
* Tree Width 1: Rouge-L â 44
* Tree Width 2: Rouge-L â 57
* Tree Width 3: Rouge-L â 60
* Tree Width 4: Rouge-L â 61
* Tree Width 5: Rouge-L â 61
**Data Series: SD (Ď) (Light Orange Shaded Region)**
* This region represents the variability or uncertainty around the ToT-Explore mean line.
* **Width/Trend:** The band is widest at Tree Width 1 (spanning roughly 40 to 48) and narrows significantly as Tree Width increases, becoming tightest around Tree Width 4 and 5 (spanning roughly 58 to 64).
**Data Point: CoT (Orange 'X')**
* There is only a single data point for CoT.
* **Position:** Located at Tree Width 1, with a Rouge-L score of approximately 44. This point aligns vertically with the first point of the ToT-Explore series.
### Key Observations
1. **Performance Gap:** At the only comparable point (Tree Width 1), the CoT method and the ToT-Explore method have nearly identical Rouge-L scores (~44).
2. **Diminishing Returns:** The ToT-Explore method shows a dramatic performance improvement when increasing Tree Width from 1 to 2. Further increases in Tree Width (2 to 5) yield only marginal gains, suggesting a point of diminishing returns.
3. **Reduced Variability:** The standard deviation (SD) for ToT-Explore decreases as Tree Width increases. This indicates that the method's performance becomes more consistent and reliable with a wider tree structure.
4. **Missing Data:** The CoT method is only plotted for Tree Width 1. No performance data is provided for CoT at Tree Widths 2-5, preventing a full comparison.
### Interpretation
The chart suggests that the "ToT-Explore" method benefits significantly from increasing the "Tree Width" parameter, but primarily in the initial step from 1 to 2. Beyond a Tree Width of 2, the method reaches a performance plateau around a Rouge-L score of 60-61. Concurrently, the method's output becomes more stable (lower SD) with greater tree width.
The single "CoT" data point serves as a baseline at the simplest configuration (Tree Width 1). The fact that ToT-Explore matches it at this point and then surpasses it (as inferred by the upward trend) implies that the ToT-Explore approach, with increased computational budget (wider trees), can achieve superior summarization quality (as measured by Rouge-L) compared to the CoT baseline. The narrowing error band further suggests that ToT-Explore is not only more effective but also more robust at higher tree widths. The absence of CoT data for wider trees leaves open the question of whether CoT would also improve or plateau under similar conditions.
</details>
Figure 6: Impact of tree width on Agentic ToT performance. It shows a general trend of performance improvement with increasing tree width.
<details>
<summary>figures/state_evaluators.png Details</summary>

### Visual Description
## Grouped Bar Chart: Agent vs. Explore Performance by Model Size
### Overview
The image displays a grouped bar chart comparing the performance of two different methods or tasks, labeled "Agent" and "Explore," across three different model sizes (8B, 70B, 405B). Performance is measured by the "Rouge-L" metric. Each model size has two bars representing "Score" and "Select" conditions.
### Components/Axes
* **Chart Type:** Grouped bar chart with error bars.
* **Panels:** Two distinct panels side-by-side.
* **Left Panel Title:** "Agent"
* **Right Panel Title:** "Explore"
* **Y-Axis:**
* **Label:** "Rouge-L"
* **Scale:** Linear, ranging from 30 to 70, with major gridlines at intervals of 10 (30, 40, 50, 60, 70).
* **X-Axis (within each panel):** Model sizes.
* **Categories:** "8B", "70B", "405B"
* **Legend:** Located at the bottom center of the entire figure.
* **Solid Fill Box:** Labeled "Score"
* **Hatched Fill Box (diagonal lines):** Labeled "Select"
* **Bar Colors (by model size):**
* **8B:** Blue
* **70B:** Orange
* **405B:** Green
### Detailed Analysis
**Panel: Agent**
* **8B Model:**
* **Score (Solid Blue):** Approximately 41. Error bar spans roughly from 38 to 44.
* **Select (Hatched Blue):** Approximately 38. Error bar spans roughly from 35 to 41.
* **70B Model:**
* **Score (Solid Orange):** Approximately 60. Error bar spans roughly from 56 to 64.
* **Select (Hatched Orange):** Approximately 62. Error bar spans roughly from 58 to 66.
* **405B Model:**
* **Score (Solid Green):** Approximately 59. Error bar spans roughly from 48 to 70 (very large range).
* **Select (Hatched Green):** Approximately 65. Error bar spans roughly from 61 to 69.
**Panel: Explore**
* **8B Model:**
* **Score (Solid Blue):** Approximately 36. Error bar spans roughly from 35 to 37.
* **Select (Hatched Blue):** Approximately 35. Error bar spans roughly from 33 to 37.
* **70B Model:**
* **Score (Solid Orange):** Approximately 52. Error bar spans roughly from 48 to 56.
* **Select (Hatched Orange):** Approximately 51. Error bar spans roughly from 47 to 55.
* **405B Model:**
* **Score (Solid Green):** Approximately 52. Error bar spans roughly from 48 to 56.
* **Select (Hatched Green):** Approximately 50. Error bar spans roughly from 45 to 55.
### Key Observations
1. **Performance Trend with Model Size:** In the "Agent" panel, both "Score" and "Select" show a clear upward trend as model size increases from 8B to 70B to 405B. In the "Explore" panel, performance increases from 8B to 70B but then plateaus or slightly decreases for the 405B model.
2. **Score vs. Select Comparison:**
* In the "Agent" panel, the "Select" (hatched) bar is consistently higher than the "Score" (solid) bar for each corresponding model size.
* In the "Explore" panel, the "Score" and "Select" bars are very close in height for each model size, with no consistent advantage for either condition.
3. **Variability (Error Bars):** The error bars for the "Agent" panel, particularly for the 405B model's "Score" condition, are notably larger than those in the "Explore" panel. This indicates greater variability or uncertainty in the "Agent" task results.
4. **Absolute Performance:** The "Agent" task achieves higher peak Rouge-L scores (up to ~65) compared to the "Explore" task (peak ~52).
### Interpretation
The data suggests a fundamental difference in how model scaling affects performance on the "Agent" versus "Explore" tasks.
* **Agent Task:** This task benefits significantly from increased model scale. The consistent superiority of the "Select" condition over "Score" implies that a selection-based approach within the agent framework is more effective than a scoring-based one, and this advantage grows with model capability. The large error bar for the 405B "Score" suggests that while the model has high potential, its performance in this specific mode is unstable.
* **Explore Task:** Performance improves when scaling from a small (8B) to a medium (70B) model but shows diminishing returns or even a slight regression at the largest (405B) scale. The negligible difference between "Score" and "Select" indicates that the method of evaluation or action selection is not a critical factor for this task. The lower overall scores and smaller error bars suggest the "Explore" task may be inherently more constrained or less responsive to raw model scale than the "Agent" task.
**In summary,** the chart demonstrates that model scaling is highly task-dependent. The "Agent" task appears to be a "scaling-friendly" problem where larger models and specific strategies ("Select") yield substantial gains, while the "Explore" task hits a performance ceiling earlier, and the choice of strategy is less impactful.
</details>
Figure 7: Influence of the State Evaluators in ToT. The Select method obtains better results over Score method.
What is the effect of Search Depth in Automatic Graph Exploration?
We observe the effect of search depth in Figure 5, which presents performance results across various depths, with fixed step size of one. The results demonstrate that the performance of depth-first search plateaus at depth of 3, highlighting the relevance of search exploration with respect to the given query. Beyond this point, deeper traversal yields no significant gains, likely due to diminishing relevance of distant nodes. This shows why shallow, targeted exploration is sufficient in our framework, keeping search efficient without sacrificing accuracy.
What is the effect of tree width in the reasoning strategy (ToT)?
Based on experimental results across all academic datasets, we observe performance variations among different methods. To gain further insight, we observe in Figure 7 the effect of tree width on results. We notice a slight upward trend in performance as the tree width increases, although the difference is more pronounced between CoT and ToT itself, going from one branch to two. The added computational time and resources likely contribute to this performance enhancement.
What is the influence of the state evaluator?
We observe in Figure 7 the impact of state evaluators, specifically Score and Select, within the ToT framework. The analysis indicates that, while there is no significant difference between the two methods, the Select evaluator generally yields slightly better results. This trend is especially evident in the context of the Agentâs performance, though the advantage is less pronounced in automatic graph exploration.
How are errors different for each strategy?
<details>
<summary>figures/pie_cot.png Details</summary>

### Visual Description
## Pie Chart: Percentage Distribution
### Overview
The image displays a simple pie chart divided into three segments, each labeled with a percentage value. The chart has no title, legend, or axis labels. The segments are distinguished by color and are separated by thin white borders.
### Components/Axes
* **Chart Type:** Pie Chart.
* **Segments:** Three colored segments.
* **Labels:** Each segment contains a percentage value directly on it.
* **Legend:** None present.
* **Title/Axis Labels:** None present.
### Detailed Analysis
The chart is composed of three segments with the following properties, listed clockwise from the top:
1. **Orange Segment:**
* **Color:** Orange.
* **Label/Value:** "5.5%".
* **Spatial Position:** A small slice located at the top of the chart, slightly left of the vertical center.
* **Visual Trend:** This is the smallest segment by a significant margin.
2. **Blue Segment:**
* **Color:** Medium blue.
* **Label/Value:** "34.1%".
* **Spatial Position:** Occupies the right side of the chart, from the top-right to the bottom-right.
* **Visual Trend:** This is the second-largest segment.
3. **Green Segment:**
* **Color:** Teal or sea-green.
* **Label/Value:** "60.4%".
* **Spatial Position:** Occupies the entire left side and bottom of the chart, making it the dominant visual element.
* **Visual Trend:** This is the largest segment, representing a clear majority.
**Data Verification:** The sum of the percentages (5.5% + 34.1% + 60.4%) equals 100.0%, confirming the data is consistent for a whole-part relationship.
### Key Observations
* **Dominant Majority:** The green segment (60.4%) constitutes more than half of the total, indicating a clear majority share for its associated category.
* **Significant Minority:** The blue segment (34.1%) represents a substantial portion, roughly one-third of the whole.
* **Minor Component:** The orange segment (5.5%) is a very small fraction, suggesting it represents a minor or niche category.
* **Missing Context:** The chart lacks a title, legend, or any descriptive text to explain what the percentages or colors represent. This is the most critical missing information for full interpretation.
### Interpretation
This pie chart visually communicates a proportional distribution among three categories. The data suggests a scenario where one category (green) holds a dominant position, another (blue) holds a strong secondary position, and a third (orange) is marginal.
**What the data suggests:** The distribution is highly unequal. The relationship between the parts is not balanced; it's a hierarchy with a clear leader, a challenger, and a distant third.
**Why it matters (with missing context):** Without labels, the real-world significance is unknown. However, the structure is common in analyses such as:
* **Market Share:** Company A (green) dominates the market, Company B (blue) is a major competitor, and others (orange) have minimal share.
* **Survey Results:** A majority opinion (green), a significant opposing view (blue), and a small undecided or alternative group (orange).
* **Resource Allocation:** One department or project consumes the majority of resources (green), another a significant portion (blue), and a third very little (orange).
**Notable Anomalies:** The primary anomaly is the complete lack of descriptive metadata (title, legend). This renders the chart informative only in the abstract sense of showing proportions, but useless for conveying specific, actionable information. The small size of the orange segment could be an outlier depending on the context, representing a negligible factor or a new, emerging category.
**Conclusion:** The chart effectively shows a 60/35/5 split between three unnamed entities. To derive any meaningful insight, the categories corresponding to the green, blue, and orange segments must be identified.
</details>
(a) CoT
<details>
<summary>figures/pie_tot.png Details</summary>

### Visual Description
## Pie Chart: Proportional Distribution of Three Categories
### Overview
The image displays a simple pie chart divided into three segments, each representing a percentage of a whole. The chart contains no title, legend, or axis labels, presenting only the colored segments and their corresponding percentage values.
### Components/Axes
- **Chart Type**: Pie Chart (circular statistical graphic)
- **Segments**: Three distinct colored segments.
- **Labels**: Each segment has a percentage value directly overlaid on it.
- **Legend**: None present.
- **Title/Axis Labels**: None present.
### Detailed Analysis
The chart is composed of three segments with the following characteristics, listed from largest to smallest:
1. **Teal/Green Segment**
* **Color**: A medium teal or sea-green.
* **Percentage**: **67.8%**
* **Spatial Placement**: This is the dominant segment, occupying the entire bottom half and extending up the right side of the circle. It starts from approximately the 3 o'clock position and sweeps clockwise to roughly the 10 o'clock position.
* **Visual Trend**: This segment represents a clear majority, comprising over two-thirds of the total.
2. **Orange Segment**
* **Color**: A burnt orange or rust color.
* **Percentage**: **21.3%**
* **Spatial Placement**: Located in the upper-left quadrant of the pie. It is adjacent to the teal segment on its left side and the blue segment on its right.
* **Visual Trend**: Represents a significant minority, roughly one-fifth of the total.
3. **Blue Segment**
* **Color**: A medium blue.
* **Percentage**: **10.9%**
* **Spatial Placement**: Located in the upper-right quadrant, forming a wedge between the orange and teal segments.
* **Visual Trend**: Represents the smallest portion, approximately one-tenth of the total.
**Data Integrity Check**: The sum of the percentages (67.8% + 21.3% + 10.9%) equals **100.0%**, confirming the chart represents a complete whole.
### Key Observations
* **Dominant Majority**: The teal segment (67.8%) is overwhelmingly the largest category, more than three times the size of the next largest segment.
* **Clear Hierarchy**: There is a distinct and significant drop-off between each category: from majority (67.8%) to a substantial minority (21.3%) to a small minority (10.9%).
* **Missing Context**: The most critical observation is the complete absence of categorical labels. The chart shows *how much* but provides zero information on *what* is being measured. There is no title, legend, or annotation to define what the teal, orange, and blue segments represent.
### Interpretation
This pie chart effectively communicates a proportional relationship where one category holds a commanding majority share. The visual weight of the teal segment immediately draws the eye, emphasizing its dominance.
However, without categorical labels, the chart's informational value is purely abstract. It demonstrates a **distribution pattern** (a large majority, a medium minority, and a small minority) but cannot convey any specific meaning. To be useful, it would require a title (e.g., "Market Share by Company," "Survey Response Distribution," "Budget Allocation by Department") and a legend mapping colors to specific categories (e.g., Teal = "Product A," Orange = "Product B," Blue = "Product C").
The design is minimalist, using direct labeling on the segments for clarity, which is effective for a simple three-category chart. The color choices are distinct and provide good visual separation between the segments.
</details>
(b) ToT
<details>
<summary>figures/pie_got.png Details</summary>

### Visual Description
## Pie Chart: Percentage Distribution
### Overview
The image displays a simple pie chart divided into three segments, each labeled with a percentage value. The chart has no title, legend, or categorical labels, presenting only the proportional distribution of three unnamed categories.
### Components/Axes
* **Chart Type:** Pie Chart.
* **Segments:** Three distinct segments, differentiated by color.
* **Labels:** Each segment contains a numerical percentage label.
* **Legend:** None present.
* **Title:** None present.
* **Spatial Layout:** The chart is centered on a white background. The segments are arranged clockwise from the top.
### Detailed Analysis
The chart is composed of the following three segments, listed clockwise starting from the top-right:
1. **Blue Segment (Top-Right Quadrant):**
* **Color:** Medium blue.
* **Label:** "20.0%"
* **Approximate Size:** Represents one-fifth of the total circle.
2. **Orange Segment (Top-Left Quadrant):**
* **Color:** Burnt orange.
* **Label:** "12.2%"
* **Approximate Size:** The smallest segment, representing roughly one-eighth of the total.
3. **Green Segment (Bottom Half):**
* **Color:** Teal green.
* **Label:** "67.8%"
* **Approximate Size:** The dominant segment, occupying more than two-thirds of the entire chart area.
**Data Verification:** The sum of the labeled percentages is 20.0% + 12.2% + 67.8% = 100.0%, confirming the chart represents a complete whole.
### Key Observations
* **Dominant Category:** The green segment (67.8%) is overwhelmingly the largest, indicating a clear majority share.
* **Minority Categories:** The blue (20.0%) and orange (12.2%) segments together constitute the remaining 32.2%, with the orange segment being the smallest minority.
* **Missing Context:** The chart provides no information about what the categories represent (e.g., market share, survey responses, budget allocation). The data is purely numerical and proportional.
### Interpretation
This pie chart visually communicates a **highly skewed distribution** among three unnamed entities. The primary insight is the existence of a dominant majority (green, ~68%) and two smaller minorities (blue ~20%, orange ~12%).
**What the data suggests:** Without categorical labels, the data's meaning is abstract. However, the structure implies a scenario where one option, group, or outcome is significantly more prevalent than the other two combined. This could represent:
* A market with one leading product and two smaller competitors.
* The results of a vote or poll with a clear winner.
* The composition of a whole (like a budget or resource pool) where one component consumes the vast majority.
**Notable Anomalies:** The complete absence of a title, legend, or category names is the most significant anomaly. This renders the chart illustrative of a *type* of distribution but useless for conveying specific, actionable information. It functions as a template or a generic example of proportional data.
**Peircean Investigative Reading:** The chart is an **icon** (it resembles a pie chart) and an **index** (it points to the existence of a three-part whole). However, it fails as a **symbol** because the symbols (colors, segments) are not mapped to specific meanings. The viewer can deduce the *relationship* between the parts (one is large, two are small) but not the *identity* of the parts. The chart's value is in demonstrating a lopsided proportional relationship, not in communicating specific facts.
</details>
(c) GoT
<details>
<summary>figures/pie_legend.png Details</summary>

### Visual Description
## Legend Key: Process Outcome Categories
### Overview
The image is a horizontal legend key, likely extracted from a larger chart or diagram. It defines three categorical outcomes using colored rectangles paired with descriptive text labels. The legend is presented on a plain white background with no other chart elements visible.
### Components/Axes
The legend consists of three horizontally aligned entries, each with a colored rectangle on the left and a text label on the right.
1. **First Entry (Leftmost):**
* **Color:** Solid blue rectangle.
* **Label Text:** "Reached limit"
2. **Second Entry (Center):**
* **Color:** Solid green rectangle.
* **Label Text:** "Found answer but not returned"
3. **Third Entry (Rightmost):**
* **Color:** Solid orange/brown rectangle.
* **Label Text:** "Wrong reasoning"
**Spatial Grounding:** The three legend items are positioned in a single row, spaced evenly from left to right. The colored rectangles are vertically centered with their respective text labels.
### Detailed Analysis
* **Color-to-Label Mapping:**
* **Blue** is explicitly mapped to the outcome "Reached limit."
* **Green** is explicitly mapped to the outcome "Found answer but not returned."
* **Orange/Brown** is explicitly mapped to the outcome "Wrong reasoning."
* **Text Transcription:** All text is in English and is transcribed verbatim above. The font appears to be a standard serif typeface (e.g., Times New Roman).
* **Data Structure:** This is a categorical legend, not a data chart. It does not contain numerical values, axes, or trends. Its sole purpose is to define a color-coding scheme.
### Key Observations
* The image contains **only** the legend. There is no accompanying chart, graph, or diagram to provide context for what these categories measure or represent.
* The categories describe potential failure or termination states in a process, likely related to problem-solving, search algorithms, or reasoning systems.
* The color choices (blue, green, orange) are distinct and would provide clear visual differentiation in a corresponding chart.
### Interpretation
This legend defines a taxonomy for classifying the outcomes of a process. The labels suggest a system that attempts to find answers or solutions but can terminate in different ways:
1. **"Reached limit" (Blue):** Implies the process was halted due to a constraint (e.g., time, computational steps, memory) before a conclusion was reached.
2. **"Found answer but not returned" (Green):** Suggests a successful internal discovery that was not communicated or outputted, indicating a potential failure in the reporting or delivery mechanism.
3. **"Wrong reasoning" (Orange):** Points to a logical or procedural error within the process itself, leading to an incorrect conclusion.
**Peircean Investigation:** The legend is an *index* pointing to a missing *icon* (the chart). It signifies that the associated visual data is categorized by these three failure modes. The absence of a "Success" or "Correct Answer" category is notable, implying the larger chart may specifically analyze error types or process limitations rather than overall performance. The color green, often associated with success, is here used for a partial failure ("found but not returned"), which is an interesting subversion of common color semantics and would be critical for correctly interpreting the missing chart.
</details>
Figure 8: Error distribution across strategies. ToT and GoT reduce unanswered cases but increase logical errors due to more complex reasoning.
To understand failure patterns, we define three error types: (1) Reached limit â the reasoning hit the step limit; (2) Answer found but not returned â the correct answer appeared but was not output; (3) Wrong reasoning step â the model followed an illogical step. Using GPT-4o, we labeled a larger set of answers and traces. We observe in Figure 8 that ToT and GoT show more âanswer found but not returnedâ cases than CoT, suggesting better retrieval but occasional failures in synthesis. This comes with a slight rise in logical errors, likely due to the complexity of multiple reasoning paths.
## 8 Limitations
In this work, we demonstrate how LLMs can be used to explore a graph while conditioning the next steps based on the graphâs results. We show that the two approaches presented achieve superior results in graph exploration. Integrating KGs with LLMs can provide complex relational knowledge for LLMs to leverage. However, the overall effectiveness depends heavily on both the coverage and quality of the underlying graph, as well as the capabilities of the language model.
Extending inference-time reasoning methods for LLMs is significantly constrained by computational resources and the time available to the user. We analyze the computational complexity of the methods in Appendix E, where we show the exponential growth of ToT and GoT due to its branching structure. GoT further compounds this by allowing merges between reasoning paths, which increases the total number of evaluations. Additionally, loading large graphs into memory introduces substantial RAM overhead, limiting applicability to resource-rich environments.
While LLMs conditioned on external knowledge can generate outputs based on accessed content, their generated output is not strictly limited to that information. Thus, they may still hallucinate. Our framework helps mitigate this risk by grounding reasoning in explicit graph structure, but does not eliminate it.
## 9 Conclusion
We present a framework for grounding LLM reasoning in KGs by integrating each step with structured graph retrieval. By combining strategies like CoT, ToT, and GoT with adaptive graph search, our method achieves state-of-the-art performance on GRBench. Beyond performance, we find that explicitly linking reasoning steps to graph structure offers a more interpretable view of how LLMs navigate knowledge. The approach enables inference-time reasoning, offering flexibility across domains, and suggests a path toward reasoning interventions that are both systematic and transparent. Future work includes extending this framework to larger and more heterogeneous graphs, and exploring how structured retrieval can guide reasoning in domains where accuracy and verifiability are critical.
Disclaimers
This [paper/presentation] was prepared for informational purposes [âin partâ if the work is collaborative with external partners] by the Artificial Intelligence Research group of JPMorganChase and its affiliates ("JP Morganâ) and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.
Š2025 JPMorganChase. All rights reserved
## Appendix A GRBench Statistics
Detailed statistics of the graphs in GRBench graphCoT are shown in Table 3. Academic Graphs contain 3 types of nodes: paper, author, venue. Literature Graphs contain 4 types of nodes: book, author, publisher and series. Healthcare Graph contains 11 types of nodes: anatomy, biological process, cellular component, compound, disease, gene, molecular function, pathway, pharmacologic class, side effect, and symptom. Questions are created according to multiple templates labeled as easy, medium, and hard, depending on the number of nodes required to give the answer.
| Domain | Topic | Graph Statistics | Data | | |
| --- | --- | --- | --- | --- | --- |
| # Nodes | # Edges | # Templates | # Questions | | |
| Academic | Biology | $âź$ 4M | $âź$ 39M | 14 | 140 |
| Chemistry | $âź$ 4M | $âź$ 30M | 14 | 140 | |
| Material Science | $âź$ 3M | $âź$ 22M | 14 | 140 | |
| Medicine | $âź$ 6M | $âź$ 30M | 14 | 140 | |
| Physics | $âź$ 2M | $âź$ 33M | 14 | 140 | |
| Literature | Goodreads | $âź$ 3M | $âź$ 22M | 24 | 240 |
| Healthcare | Disease | $âź$ 47K | $âź$ 4M | 27 | 270 |
| SUM | - | - | - | 121 | 1210 |
Table 3: Detailed statistics of the GRBench graphCoT.
## Appendix B LLM <â> KG Interaction Pipelines
Description of the two LLM + KG Interaction Pipelines in their CoT form:
1. Agent â Figure 9 A pipeline where the LLM alternates between generating a reasoning step, selecting an explicit action (e.g., retrieving a node, checking neighbors), and observing results from the KG until termination.
1. Automatic Graph Exploration â Figure 10. A pipeline where entities are automatically extracted from the LLMâs generated text and used to guide iterative graph traversal with pruning, progressively expanding the reasoning chain.
<details>
<summary>figures/agent_pipeline.png Details</summary>

### Visual Description
## Diagram: Agent-Based Reasoning Process with Knowledge Graph
### Overview
The image is a flowchart diagram illustrating a cyclical, agent-driven reasoning process. It depicts a system where a query initiates a loop of thought, action, and observation, with actions performed on a knowledge graph. The diagram uses simple boxes, arrows, and icons to represent components and data flow.
### Components/Axes
The diagram consists of four primary rectangular boxes connected by directional arrows, forming a loop. Below the "Action" box is a list of specific functions. To the right, connected to the "Observation" box, is a network graph icon labeled "Knowledge Graph."
**Primary Flow Components (from left to right, top):**
1. **Box 1:** Labeled "Query". A right-pointing arrow connects it to the next box.
2. **Box 2:** Labeled "Thought". Contains a small icon of a robot head (blue) with a yellow thought bubble. A right-pointing arrow connects it to the next box.
3. **Box 3:** Labeled "Action". Contains a small icon of a robot head (blue) with a yellow gear or action symbol. A right-pointing arrow connects it to the next box.
4. **Box 4:** Labeled "Observation". A left-pointing arrow originates from this box and loops back to the "Thought" box, completing the cycle.
**Action Sub-Components (listed below the "Action" box):**
A list of five functions or operations is presented, each in a `Function[Parameters]` format:
* `RetrieveNode[Text]`
* `NodeFeature[N,F]`
* `NeighborCheck[N,E]`
* `NodeDegree[N,E]`
* `Finish[answer]`
**Knowledge Graph Component (bottom-right):**
* An icon depicting a network graph with interconnected nodes (circles) of varying sizes and colors (green and blue).
* A label below the icon reads: "Knowledge Graph".
* An upward-pointing arrow connects this icon to the "Observation" box.
### Detailed Analysis
The diagram outlines a specific, iterative workflow:
1. **Initiation:** The process begins with a "Query".
2. **Reasoning Loop:** The core is a three-stage loop:
* **Thought:** The agent processes information (indicated by the thought bubble icon).
* **Action:** The agent executes one of five possible operations on a knowledge graph. The parameters suggest graph-based operations: `N` likely represents a Node, `F` a Feature, and `E` an Edge or neighbor.
* **Observation:** The agent receives feedback or results from the action, which is informed by the "Knowledge Graph".
3. **Feedback:** The "Observation" feeds back into the "Thought" stage, allowing for iterative reasoning and refinement.
4. **Termination:** The final function in the list, `Finish[answer]`, indicates a concluding action that produces the final output, breaking the loop.
### Key Observations
* **Cyclical Flow:** The most prominent feature is the feedback loop from "Observation" back to "Thought", emphasizing an iterative, non-linear process.
* **Agent-Centric Design:** The use of robot icons in the "Thought" and "Action" boxes personifies the system as an active agent performing cognitive and physical tasks.
* **Structured Action Space:** The five listed functions under "Action" define a clear, limited set of operations the agent can perform, all related to querying and analyzing a graph structure.
* **External Knowledge Source:** The "Knowledge Graph" is depicted as an external data source that the "Action" functions operate on and the "Observation" is derived from.
### Interpretation
This diagram models a **ReAct (Reasoning + Acting) style agent** designed for **graph-based question answering or analysis**. The process suggests the agent doesn't just retrieve static information but actively explores a knowledge graph.
* **What it demonstrates:** The system is built for complex queries requiring multi-step reasoning over connected data. The agent formulates a thought (e.g., "I need to find nodes related to X"), chooses an action (e.g., `RetrieveNode` or `NeighborCheck`), observes the result from the graph, and then uses that new information to inform its next thought.
* **Relationships:** The "Action" functions are the bridge between the agent's internal reasoning ("Thought") and the external structured data ("Knowledge Graph"). The "Observation" is the direct result of applying an action to the graph.
* **Notable Design Choice:** The inclusion of `NodeDegree` and `NeighborCheck` as primitive actions indicates the agent can perform structural analysis of the graph (assessing connectivity and importance of nodes), not just content retrieval. This implies capability for tasks like finding central entities or tracing relationship paths.
* **Underlying Paradigm:** The loop embodies an investigative, trial-and-error approach to problem-solving, where the agent refines its understanding through successive interactions with the data source. The `Finish[answer]` action represents the point of sufficient confidence to generate a final response.
</details>
Figure 9: Agent Pipeline: (1) Input Query, (2) Thought Generation (3) Action Selection, (4) Environment Observation from the Knowledge Graph. The process is repeated until termination action is generated or limit reached.
<details>
<summary>figures/graph_explore_pipeline.png Details</summary>

### Visual Description
## Diagram: Iterative Knowledge Graph Query Processing Flowchart
### Overview
The image displays a horizontal flowchart illustrating an iterative, AI-assisted process for answering a query using a knowledge graph. The process involves sequential steps with decision points and feedback loops, emphasizing an automated, reasoning-based approach.
### Components/Axes
The diagram consists of five primary rectangular process boxes connected by directional arrows, along with supporting icons and a data source representation.
**Process Boxes (from left to right):**
1. **Query**: The starting point of the process.
2. **Thought**: The second step, accompanied by a small robot icon and a smaller, orange, cat-like icon beneath it.
3. **Entity Extraction**: The third step, also accompanied by the same robot and orange icon pair.
4. **Graph Search**: The fourth step, accompanied by the robot and orange icon pair. A thick, upward-pointing arrow connects this box to an external element.
5. **End?**: The final decision box, phrased as a question.
**Supporting Elements:**
* **Icons**: A blue robot head icon appears beneath the "Thought," "Entity Extraction," and "Graph Search" boxes. A smaller, orange, cat-like or creature icon appears to the left of each robot icon.
* **Knowledge Graph**: Positioned directly below the "Graph Search" box. It is represented as a network diagram of interconnected nodes (circles) in shades of green and blue, with lines representing edges. The label "Knowledge Graph" is written below this network in a serif font.
* **Arrows/Flow**:
* A standard arrow points from "Query" to "Thought."
* A standard arrow points from "Thought" to "Entity Extraction."
* A standard arrow points from "Entity Extraction" to "Graph Search."
* A standard arrow points from "Graph Search" to "End?"
* A **feedback loop arrow** originates from the right side of the "End?" box, travels upward, then left across the top of the diagram, and finally downward to point into the top of the "Thought" box. This indicates the process can iterate.
### Detailed Analysis
The workflow is linear with a conditional loop:
1. A **Query** is input into the system.
2. The system engages in a **Thought** process (likely reasoning or planning), assisted by AI (indicated by the robot icon).
3. It then performs **Entity Extraction**, identifying key subjects or concepts from the query or intermediate thoughts, again with AI assistance.
4. The extracted entities are used for a **Graph Search** against a **Knowledge Graph**. This step is explicitly linked to the external knowledge base.
5. The process reaches a decision point: **End?**. If the answer is "no" (implied by the loop), the flow returns to the **Thought** stage for refinement, creating an iterative cycle. If "yes," the process concludes.
### Key Observations
* **Iterative Nature**: The most prominent feature is the feedback loop from "End?" back to "Thought," suggesting the system refines its approach until a satisfactory answer is found.
* **AI Integration**: The consistent robot icon under three core processing steps ("Thought," "Entity Extraction," "Graph Search") highlights that these are AI-driven operations.
* **Knowledge Graph as Core Resource**: The "Graph Search" step is the only one directly connected to an external data structure (the Knowledge Graph), positioning it as the central repository of factual information for answering the query.
* **Ambiguity at Termination**: The "End?" box is a question, making the termination condition the only ambiguous part of the flowchart. The criteria for ending the loop are not specified.
### Interpretation
This diagram models a sophisticated, agentic AI system designed for complex question answering. It doesn't just retrieve data; it *reasons* ("Thought"), structures the problem ("Entity Extraction"), and then queries a structured knowledge base ("Graph Search"). The loop implies that initial attempts may be insufficient, requiring the system to re-think and re-search, mimicking a human research process of hypothesis, investigation, and refinement.
The separation of "Thought" and "Entity Extraction" suggests a two-stage reasoning process: first, understanding the query's intent and context, and second, formalizing that understanding into concrete entities suitable for a graph database query. The system's intelligence lies in this translation from natural language to structured search and the ability to iteratively improve based on results. The Knowledge Graph is not just a database but an active component that shapes the search and, by extension, the subsequent thinking in the loop.
</details>
Figure 10: Automatic Graph Exploration Pipeline: (1) Input Query, (2) Thought Generation, (3) Entity Extraction (from query or thought with LLM), (4) Automatic Graph Search as described in Algorithm 1 (5) Query LLM for answer or continue
Algorithm 1 presents the pseudocode for the Automatic Graph Exploration described in Section 4.2.2.
Algorithm 1 Graph Exploration Algorithm
1: procedure GraphExplore ( $LLM,seen\_entities,search\_depth$ )
2: $relevant\_attributes$ , $found\_triplesâ 0$
3: for $depth$ in $search\_depth$ do
4: for $entity$ in $seen\_entities$ do
5: if $seen\_entities[entity\_id].visited$ == True then
6: Continue
7: else
8: $seen\_entities[entity]â$ Visited
9: end if
10: $head\_entity\_name$ , $entity\_attributes$ , $neighborsâ$ Graph[ $entity$ ]
11: $pruned\_neighborsâ$ prune_relations( $LLM,neighbors$ )
12: $pruned\_neighborsâ$ prune_entities( $LLM,pruned\_neighbors$ )
13: $found\_triplesâ$ generate_triples( $entity,pruned\_neighbros$ )
14: end for
15: $seen\_entitiesâ$ Update( $seen\_entities$ , neighbors)
16: if End?( $LLM,found\_triples,relevant\_attributes$ ) == True then
17: break
18: end if
19: end for
20: return $found\_triples,relevant\_attributes,seen\_entities$
21: end procedure
## Appendix C Performance results in plots
Figures 11 and 12 illustrate the performance results using the Rouge-L and GPT4score metrics, respectively, for the healthcare graph for all methods. The results were run on the LLama 3.1 Instruct models (8B, 70B, and 405B-FP8) and demonstrate the improved performance achieved through more complex reasoning and search strategies during inference.
<details>
<summary>figures/healthcare_rl_all.png Details</summary>

### Visual Description
## Grouped Bar Chart: Rouge-L Score Comparison Across Models and Methods
### Overview
This image is a grouped bar chart comparing the performance of different AI models (8B, 70B, 405B parameters) across various reasoning methods, as measured by the Rouge-L Score. The chart evaluates two types of state evaluators ("Score" and "Select") for each method.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:** Labeled "Rouge-L Score". Scale ranges from 0 to 50, with major gridlines at intervals of 10.
* **X-Axis:** Lists 9 distinct reasoning methods. From left to right:
1. Base
2. Text-RAG
3. Graph-RAG
4. Graph-CoT Agent
5. Graph-CoT Explore
6. Graph-ToT Agent
7. Graph-ToT Explore
8. Graph-GoT Agent
9. Graph-GoT Explore
* **Legend 1 (Top-Left):** Titled "Models". Defines color coding for model size:
* Blue: 8B
* Orange: 70B
* Green: 405B
* **Legend 2 (Top-Center):** Titled "State Evaluators". Defines bar fill pattern:
* Solid Fill: "Score"
* Hatched Fill (diagonal lines): "Select"
* **Spatial Layout:** The two legends are positioned in the top-left and top-center of the chart area. The bars are grouped by method, with each group containing up to six bars (three models, each with two evaluator types).
### Detailed Analysis
Each method group on the x-axis contains bars for the 8B (blue), 70B (orange), and 405B (green) models. Within each model color, the left bar is solid ("Score" evaluator) and the right bar is hatched ("Select" evaluator).
**Trend Verification & Data Extraction (Approximate Values):**
1. **Base:**
* 8B (Blue): ~7.5 (Score), ~8 (Select)
* 70B (Orange): ~10 (Score), ~10 (Select)
* 405B (Green): ~9 (Score), ~9 (Select)
* *Trend:* Scores are low (<10). Larger models do not show a clear advantage here.
2. **Text-RAG:**
* 8B (Blue): ~8.5 (Score), ~8.5 (Select)
* 70B (Orange): ~10.5 (Score), ~10.5 (Select)
* 405B (Green): ~11.5 (Score), ~11.5 (Select)
* *Trend:* Slight improvement over Base. A small, consistent increase with model size.
3. **Graph-RAG:**
* 8B (Blue): ~13 (Score), ~13 (Select)
* 70B (Orange): ~18 (Score), ~18 (Select)
* 405B (Green): ~16 (Score), ~16 (Select)
* *Trend:* Notable jump from Text-RAG. The 70B model performs best here.
4. **Graph-CoT Agent:**
* 8B (Blue): ~17 (Score), ~17 (Select)
* 70B (Orange): ~33.5 (Score), ~33.5 (Select)
* 405B (Green): ~28.5 (Score), ~28.5 (Select)
* *Trend:* Significant increase, especially for 70B. 70B outperforms 405B.
5. **Graph-CoT Explore:**
* 8B (Blue): ~25.5 (Score), ~25.5 (Select)
* 70B (Orange): ~29.5 (Score), ~29.5 (Select)
* 405B (Green): ~28.5 (Score), ~28.5 (Select)
* *Trend:* Scores are high and clustered. 70B and 405B are nearly tied.
6. **Graph-ToT Agent:**
* 8B (Blue): ~29 (Score), ~29 (Select)
* 70B (Orange): ~38.5 (Score), ~40.5 (Select)
* 405B (Green): ~47.5 (Score), ~46.5 (Select)
* *Trend:* Highest scores observed so far. Clear advantage for larger models. 405B achieves the highest single score (~47.5).
7. **Graph-ToT Explore:**
* 8B (Blue): ~24.5 (Score), ~25 (Select)
* 70B (Orange): ~32.5 (Score), ~33.5 (Select)
* 405B (Green): ~34 (Score), ~34 (Select)
* *Trend:* Scores drop compared to the "Agent" variant. 405B and 70B are close.
8. **Graph-GoT Agent:**
* 8B (Blue): ~29.5 (Score), ~29.5 (Select)
* 70B (Orange): ~31 (Score), ~40.5 (Select)
* 405B (Green): ~43.5 (Score), ~47.5 (Select)
* *Trend:* Very high scores. A large discrepancy appears for the 70B model between "Score" (~31) and "Select" (~40.5) evaluators. 405B "Select" is very high (~47.5).
9. **Graph-GoT Explore:**
* 8B (Blue): ~25 (Score), ~25 (Select)
* 70B (Orange): ~31 (Score), ~37 (Select)
* 405B (Green): ~35 (Score), ~34 (Select)
* *Trend:* Similar pattern to Graph-ToT Explore, with "Agent" variants outperforming "Explore". 70B shows a notable gap between evaluators.
### Key Observations
1. **Method Progression:** There is a clear upward trend in Rouge-L scores as methods evolve from Base -> Text-RAG -> Graph-RAG -> CoT -> ToT/GoT. Graph-based Tree-of-Thought (ToT) and Graph-of-Thought (GoT) methods achieve the highest performance.
2. **Model Size Impact:** Generally, larger models (70B, 405B) outperform the 8B model. However, the advantage is not always linear; in some cases (e.g., Graph-RAG, Graph-CoT Agent), the 70B model outperforms the 405B model.
3. **Agent vs. Explore:** For ToT and GoT methods, the "Agent" variant consistently yields higher scores than the "Explore" variant for the same model size.
4. **Evaluator Discrepancy:** For most method/model combinations, the "Score" and "Select" evaluators produce nearly identical results (bars of equal height). The most significant exception is **Graph-GoT Agent with the 70B model**, where the "Select" evaluator bar is substantially taller (~40.5) than the "Score" evaluator bar (~31).
5. **Peak Performance:** The highest approximate score on the chart is ~47.5, achieved by the **405B model using the Graph-GoT Agent method with the "Select" evaluator**.
### Interpretation
This chart demonstrates the effectiveness of advanced, graph-augmented reasoning frameworks (ToT, GoT) over simpler RAG or chain-of-thought approaches for the task measured by Rouge-L. The data suggests that structuring reasoning as a tree or graph ("Agent" mode) is more effective than an exploratory approach ("Explore" mode).
The relationship between model size and performance is complex. While scaling from 8B to 70B provides a major boost, further scaling to 405B yields diminishing or inconsistent returns, indicating that methodological improvements (like switching from CoT to ToT) can be as impactful as raw parameter scaling.
The outlier in the Graph-GoT Agent (70B) results, where evaluators disagree, may indicate instability in that specific configuration or that the "Select" evaluator is better at capturing the benefits of the GoT method for that model size. Overall, the chart makes a strong case for investing in sophisticated graph-based reasoning architectures, particularly when paired with sufficiently large language models.
</details>
Figure 11: Performance results using the Rouge-L metric on the healthcare graph of GRBench graphCoT, comparing all methods with LLama 3.1 Instruct models of various sizes (8B, 70B, 405B-FP8). Experimental details are included in Section 5.
<details>
<summary>figures/healthcare_gpt4score_all.png Details</summary>

### Visual Description
## Grouped Bar Chart: GPT-4 Scores by Model Size and Method
### Overview
This image is a grouped bar chart comparing the performance of different AI models (8B, 70B, 405B parameters) across various methods or techniques. The performance is measured by a "GPT-4 Score" on the y-axis. The chart evaluates two types of "State Evaluators": "Score" (solid bars) and "Select" (hatched bars).
### Components/Axes
* **Y-Axis:** Labeled "GPT-4 Score". The scale runs from 0 to 40, with major grid lines at intervals of 10.
* **X-Axis:** Lists nine distinct methods or experimental conditions. From left to right:
1. Base
2. Text-RAG
3. Graph-RAG
4. Graph-CoT Agent
5. Graph-CoT Search
6. Graph-ToT Agent
7. Graph-ToT Search
8. Graph-GoT Agent
9. Graph-GoT Search
* **Legend 1 (Top-Left):** "Models" with three color-coded categories:
* Blue: 8B
* Orange: 70B
* Green: 405B
* **Legend 2 (Top-Center):** "State Evaluators" with two pattern-coded categories:
* Solid fill: Score
* Diagonal hatching (///): Select
### Detailed Analysis
The chart presents data for each of the nine methods, with up to six bars per method (three model sizes, each with a "Score" and "Select" variant). Values are approximate based on visual alignment with the y-axis grid.
**1. Base**
* **8B (Blue):** Score â 12. No Select bar.
* **70B (Orange):** Score â 13. No Select bar.
* **405B (Green):** Score â 16. No Select bar.
* *Trend:* Scores increase with model size.
**2. Text-RAG**
* **8B (Blue):** Score â 12. No Select bar.
* **70B (Orange):** Score â 12. No Select bar.
* **405B (Green):** Score â 13. No Select bar.
* *Trend:* Minimal improvement over Base, with slight increase for 405B.
**3. Graph-RAG**
* **8B (Blue):** Score â 15. No Select bar.
* **70B (Orange):** Score â 17. No Select bar.
* **405B (Green):** Score â 18. No Select bar.
* *Trend:* Clear improvement over Text-RAG for all model sizes.
**4. Graph-CoT Agent**
* **8B (Blue):** Score â 18. No Select bar.
* **70B (Orange):** Score â 33. No Select bar.
* **405B (Green):** Score â 29. No Select bar.
* *Trend:* Significant jump for 70B and 405B models. The 70B model outperforms the 405B model here.
**5. Graph-CoT Search**
* **8B (Blue):** Score â 22. No Select bar.
* **70B (Orange):** Score â 28. No Select bar.
* **405B (Green):** Score â 29. No Select bar.
* *Trend:* Consistent improvement with model size. Scores are generally higher than Graph-CoT Agent for 8B and 405B.
**6. Graph-ToT Agent**
* **8B (Blue):** Score â 31, Select â 30.
* **70B (Orange):** Score â 30, Select â 40.
* **405B (Green):** Score â 46, Select â 44.
* *Trend:* Highest scores observed so far. The 405B model achieves the chart's peak score (â46). For 70B, the "Select" evaluator yields a much higher score than the "Score" evaluator.
**7. Graph-ToT Search**
* **8B (Blue):** Score â 21, Select â 22.
* **70B (Orange):** Score â 25, Select â 31.
* **405B (Green):** Score â 31, Select â 32.
* *Trend:* Scores are lower than the "Agent" variant for the same method. The "Select" evaluator consistently scores slightly higher than the "Score" evaluator.
**8. Graph-GoT Agent**
* **8B (Blue):** Score â 30, Select â 30.
* **70B (Orange):** Score â 30, Select â 40.
* **405B (Green):** Score â 42, Select â 45.
* *Trend:* Very high performance, comparable to Graph-ToT Agent. The 405B "Select" bar (â45) is among the highest on the chart. The 70B model again shows a large gap between "Score" and "Select".
**9. Graph-GoT Search**
* **8B (Blue):** Score â 23, Select â 22.
* **70B (Orange):** Score â 24, Select â 33.
* **405B (Green):** Score â 32, Select â 31.
* *Trend:* Similar pattern to Graph-ToT Search: "Agent" variants outperform "Search" variants. The 70B "Select" score is notably higher than its "Score".
### Key Observations
1. **Method Superiority:** The "Graph-ToT Agent" and "Graph-GoT Agent" methods achieve the highest overall scores, particularly for the 405B model.
2. **Model Size Impact:** Larger models (405B) generally perform better, but this is not absolute. The 70B model outperforms the 405B in the "Graph-CoT Agent" condition.
3. **Agent vs. Search:** For both ToT (Tree-of-Thought) and GoT (Graph-of-Thought) methods, the "Agent" variant consistently yields higher scores than the "Search" variant.
4. **Evaluator Discrepancy:** For the high-performing Agent methods (ToT and GoT), the "Select" evaluator (hatched bars) often produces a significantly higher score than the standard "Score" evaluator, especially for the 70B model.
5. **Baseline Progression:** There is a clear progression in performance from Base â Text-RAG â Graph-RAG, establishing a baseline improvement from simple to graph-based retrieval.
### Interpretation
This chart demonstrates the efficacy of advanced reasoning frameworks (CoT, ToT, GoT) when applied to graph-based structures, especially in an "Agent" configuration. The data suggests that:
* **Structured Reasoning Matters:** Moving from simple retrieval (Text-RAG) to graph retrieval (Graph-RAG) and then to structured reasoning over those graphs (CoT/ToT/GoT Agents) yields substantial performance gains.
* **The "Agent" Paradigm is Key:** The significant drop from "Agent" to "Search" variants for ToT and GoT implies that the agentic loopâwhere the model actively plans, executes, and self-correctsâis crucial for maximizing performance on this GPT-4 evaluation metric.
* **Evaluation Method Influences Results:** The discrepancy between "Score" and "Select" evaluators, particularly for the 70B model in Agent modes, indicates that the evaluation protocol itself can dramatically affect the reported outcome. "Select" may represent a more favorable or lenient evaluation criterion.
* **Scale Isn't Everything:** The anomaly where 70B outperforms 405B in "Graph-CoT Agent" suggests that for certain method-model combinations, factors beyond raw parameter count (like training data or alignment for that specific task) play a critical role. The overall trend, however, still favors scale, as seen in the peak performance of the 405B model in the most advanced methods.
In summary, the chart argues for the development of AI systems that combine large-scale models with sophisticated, graph-aware agentic reasoning frameworks, while also highlighting the importance of the evaluation methodology in assessing their true capability.
</details>
Figure 12: Performance results using the GPT4Score metric on the healthcare graph of GRBench graphCoT, comparing all methods with LLama 3.1 Instruct models of various sizes (8B, 70B, 405B-FP8). Experimental details are included in Section 5.
## Appendix D Results on GPT4Score
In this section, we present the results of the experiments described in Section 5 for all methods, using the GPT4Score metric. This metric calculates the percentage of "correct" answers as judged by GPT-4 when presented with both the correct and the generated answer. The tables in this section present the same data as in Tables 1 and 2, but evaluated using GPT4Score.
| | Method | | Model | Healthcare | Goodreads | Biology | Chemistry | Materials Science | Medicine | Physics |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Baselines | | Llama 3.1 8B-Ins | 11.85 | 13.33 | 10.71 | 11.43 | 7.86 | 7.87 | 9.29 | |
| Base | Llama 3.1 70B-Ins | 12.96 | 19.17 | 10.00 | 12.14 | 11.43 | 11.43 | 12.86 | | |
| Llama 3.1 405B-Ins | 15.55 | 26.67 | 12.86 | 12.14 | 12.14 | 13.57 | 12.14 | | | |
| Llama 3.1 8B-Ins | 11.85 | 21.67 | 12.86 | 10.00 | 10.00 | 8.57 | 7.86 | | | |
| Text-RAG | Llama 3.1 70B-Ins | 12.22 | 27.5 | 12.14 | 13.57 | 13.57 | 13.57 | 12.86 | | |
| Llama 3.1 405B-Ins | 12.96 | 26.67 | 15.00 | 13.57 | 12.86 | 14.29 | 13.57 | | | |
| Llama 3.1 8B-Ins | 14.81 | 32.50 | 29.29 | 29.28 | 27.86 | 25.71 | 29.29 | | | |
| Graph-RAG | Llama 3.1 70B-Ins | 17.04 | 32.92 | 39.29 | 40.71 | 43.57 | 34.29 | 40.00 | | |
| Llama 3.1 405B-Ins | 18.15 | 31.67 | 37.14 | 42.86 | 40.00 | 36.43 | 41.43 | | | |
| Graph CoT | | Llama 3.1 8B-Ins | 18.15 | 32.5 | 20.71 | 19.28 | 25.00 | 14.29 | 21.43 | |
| Agent | Llama 3.1 70B-Ins | 32.59 | 43.75 | 50.00 | 51.43 | 50.00 | 48.57 | 46.43 | | |
| Llama 3.1 405B-Ins | 28.89 | 48.33 | 38.57 | 38.57 | 47.86 | 56.43 | 34.29 | | | |
| Graph Explore | Llama 3.1 8B-Ins | 22.22 | 36.67 | 35.00 | 30.71 | 29.29 | 29.29 | 32.86 | | |
| Llama 3.1 70B-Ins | 27.78 | 32.92 | 45.71 | 49.29 | 40.00 | 40.00 | 44.29 | | | |
| Llama 3.1 405B-Ins | 28.89 | 48.33 | 38.57 | 38.57 | 47.86 | 56.43 | 34.29 | | | |
| Graph ToT | Agent | Score | Llama 3.1 8B-Ins | 30.49 | 55.14 | 43.33 | 41.67 | 44.05 | 36.43 | 39.52 |
| Llama 3.1 70B-Ins | 30.49 | 54.48 | 65.48 | 62.14 | 55.95 | 63.57 | 56.19 | | | |
| Llama 3.1 405B-Ins | 45.55 | 56.53 | 71.67 | 65.71 | 52.62 | 68.81 | 44.76 | | | |
| Select | Llama 3.1 8B-Ins | 30.00 | 54.17 | 40.71 | 37.14 | 40.00 | 32.86 | 36.43 | | |
| Llama 3.1 70B-Ins | 39.63 | 56.67 | 65.00 | 67.14 | 62.86 | 60.71 | 55.55 | | | |
| Llama 3.1 405B-Ins | 44.07 | 58.75 | 71.43 | 69.29 | 65.00 | 68.81 | 60.00 | | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 21.48 | 41.10 | 32.86 | 31.67 | 31.43 | 32.14 | 35.24 | |
| Llama 3.1 70B-Ins | 24.94 | 40.97 | 52.38 | 57.86 | 49.29 | 54.29 | 47.86 | | | |
| Llama 3.1 405B-Ins | 30.86 | 48.33 | 47.86 | 57.14 | 50.71 | 56.67 | 47.14 | | | |
| Select | Llama 3.1 8B-Ins | 21.85 | 41.67 | 32.86 | 31.67 | 31.43 | 32.14 | 35.24 | | |
| Llama 3.1 70B-Ins | 30.37 | 42.08 | 54.29 | 57.14 | 47.86 | 52.14 | 46.43 | | | |
| Llama 3.1 405B-Ins | 31.48 | 48.75 | 45.00 | 57.86 | 48.86 | 57.14 | 45.71 | | | |
Table 4: GPT4Score performance results on GRBench graphCoT, comparing standard LLMs, Text-RAG, Graph-RAG, Graph-CoT, and Graph-ToT. Experiments are described in Section 5, using LLama 3.1 - Instruct backbone models with sizes 8B, 70B, and 405B.
| Method | Model | Healthcare | Biology | |
| --- | --- | --- | --- | --- |
| Agent | Score | Llama 3.1 8B-Ins | 29.88 | 32.86 |
| Llama 3.1 70B-Ins | 29.51 | 61.69 | | |
| Llama 3.1 405B-Ins | 41.81 | 48.33 | | |
| Select | Llama 3.1 8B-Ins | 30.00 | 40.71 | |
| Llama 3.1 70B-Ins | 39.63 | 69.83 | | |
| Llama 3.1 405B-Ins | 44.81 | 72.86 | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 22.72 | 21.19 |
| Llama 3.1 70B-Ins | 24.20 | 48.57 | | |
| Llama 3.1 405B-Ins | 32.22 | 41.67 | | |
| Select | Llama 3.1 8B-Ins | 22.59 | 19.28 | |
| Llama 3.1 70B-Ins | 32.96 | 52.86 | | |
| Llama 3.1 405B-Ins | 31.48 | 57.86 | | |
Figure 13: Graph-GoT results (GPT4Score) on GRBench with Llama 3.1 Instruct sizes 8B, 70B, and 405B.
<details>
<summary>figures/radar_levels.png Details</summary>

### Visual Description
## Radar Chart: Performance Comparison of CoT, ToT, and GoT Across Difficulty Levels
### Overview
The image is a radar chart (spider plot) comparing the performance of three distinct methods or modelsâlabeled CoT, ToT, and GoTâacross three difficulty categories: Easy, Medium, and Hard. The chart uses a triangular axis layout with concentric circles as scale markers. The data suggests a performance evaluation, likely in a problem-solving or reasoning task context.
### Components/Axes
* **Chart Type:** Radar/Spider Chart.
* **Axes (Categories):** Three axes radiate from the center, labeled:
* **Easy** (positioned at the 3 o'clock position, right side).
* **Medium** (positioned at the 11 o'clock position, top-left).
* **Hard** (positioned at the 7 o'clock position, bottom-left).
* **Scale:** Concentric dashed circles represent the radial scale. The visible markers are labeled:
* **0.2** (inner circle)
* **0.4** (middle circle)
* **0.6** (outer circle, forming the chart's boundary).
* **Legend:** Located in the top-right corner of the image. It defines the three data series:
* **CoT:** Represented by a **black line** and a semi-transparent grey filled area.
* **ToT:** Represented by a **blue line** and a semi-transparent light blue filled area.
* **GoT:** Represented by a **red line** and a semi-transparent pink/red filled area.
### Detailed Analysis
**Data Series & Approximate Values:**
The performance value for each method on each axis is determined by how far its line extends along that axis from the center.
1. **CoT (Black Line):**
* **Trend:** Forms the innermost triangle, indicating the lowest overall performance.
* **Easy Axis:** Extends to approximately **0.4**.
* **Medium Axis:** Extends to approximately **0.3**.
* **Hard Axis:** Extends to approximately **0.25**.
* **Pattern:** Performance decreases as task difficulty increases.
2. **ToT (Blue Line):**
* **Trend:** Forms the outermost triangle, indicating the highest overall performance.
* **Easy Axis:** Extends to the maximum value of **0.6**.
* **Medium Axis:** Extends to approximately **0.55**.
* **Hard Axis:** Extends to approximately **0.5**.
* **Pattern:** Maintains high performance across all levels, with a slight decrease from Easy to Hard.
3. **GoT (Red Line):**
* **Trend:** Forms a triangle positioned between CoT and ToT.
* **Easy Axis:** Extends to the maximum value of **0.6**, matching ToT.
* **Medium Axis:** Extends to approximately **0.5**.
* **Hard Axis:** Extends to approximately **0.4**.
* **Pattern:** Matches the top performance on Easy tasks but shows a more pronounced drop-off on Medium and Hard tasks compared to ToT.
**Spatial Grounding & Cross-Reference:**
* The legend is clearly separated in the top-right and does not overlap the chart data.
* The color mapping is consistent: Black=CoT, Blue=ToT, Red=GoT. The filled areas correspond to their respective line colors.
* All three lines converge at the **0.6** point on the **Easy** axis, indicating equal peak performance for ToT and GoT, with CoT reaching a lower value (0.4) on the same axis.
### Key Observations
1. **Universal Peak on "Easy":** All methods achieve their highest score on the "Easy" axis. ToT and GoT both reach the chart's maximum value (0.6).
2. **Performance Hierarchy:** A clear and consistent hierarchy is visible across all difficulty levels: **ToT > GoT > CoT**.
3. **Difficulty Impact:** Performance for all methods degrades as task difficulty increases from Easy to Hard. The rate of degradation varies: CoT and GoT show a steeper decline than ToT.
4. **Convergence Point:** The only point where two methods (ToT and GoT) have identical performance is on the "Easy" task axis.
### Interpretation
This chart visually demonstrates the comparative effectiveness of three different approaches (likely reasoning or problem-solving frameworks, given the common acronyms CoT for Chain-of-Thought, ToT for Tree-of-Thought, and GoT for Graph-of-Thought) as a function of task difficulty.
* **What the data suggests:** The **ToT (Tree-of-Thought)** method is the most robust and effective across the board, maintaining high performance even as tasks become harder. **GoT (Graph-of-Thought)** is highly effective for easy problems but is less scalable to complex tasks than ToT. **CoT (Chain-of-Thought)** is the least effective method shown, particularly for medium and hard problems.
* **Relationship between elements:** The triangular layout directly maps performance to difficulty. The area enclosed by each method's polygon can be interpreted as its overall "capability space." ToT's larger area signifies greater general capability.
* **Notable Anomaly/Insight:** The fact that GoT matches ToT on easy tasks but falls behind on harder ones suggests that the additional structural complexity of a graph (GoT) may not provide benefits over a tree structure (ToT) for simple problems, but becomes a limiting factor for complex ones compared to the tree's organized hierarchy. The chart argues for the superiority of the Tree-of-Thought approach in this specific evaluation context.
</details>
Figure 14: Results decomposed into easy/medium/hard questions according to GPT4Score.
## Appendix E Computational Analysis
<details>
<summary>figures/computational_analysis.png Details</summary>

### Visual Description
## Line Chart: Number of Operations vs. Reasoning Steps
### Overview
The image is a line chart comparing the computational cost, measured in "Number of Operations," of three different reasoning methodsâCoT, ToT, and GoTâacross an increasing number of "Reasoning Steps" from 1 to 10. The chart demonstrates a dramatic divergence in scalability, with one method showing exponential growth.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis (Horizontal):**
* **Label:** "Reasoning Steps"
* **Scale:** Linear, integer values from 1 to 10.
* **Y-Axis (Vertical):**
* **Label:** "Number of Operations"
* **Scale:** Linear, with major gridlines and labels at 0, 50,000, 100,000, 150,000, and 200,000.
* **Legend:** Located in the top-left corner of the plot area.
* **CoT:** Black line with solid circle markers.
* **ToT:** Blue line with solid square markers.
* **GoT:** Red line with solid triangle markers.
### Detailed Analysis
The chart plots three distinct data series:
1. **CoT (Black line, circle markers):**
* **Trend:** The line is essentially flat and horizontal, hovering very close to the x-axis (0 operations) across all 10 reasoning steps.
* **Data Points (Approximate):** The number of operations remains at or near 0 for steps 1 through 10.
2. **ToT (Blue line, square markers):**
* **Trend:** The line remains flat near zero for the first several steps, then begins a clear, accelerating upward curve starting around step 7.
* **Data Points (Approximate):**
* Steps 1-6: ~0 operations.
* Step 7: ~5,000 operations.
* Step 8: ~10,000 operations.
* Step 9: ~30,000 operations.
* Step 10: ~90,000 operations.
3. **GoT (Red line, triangle markers):**
* **Trend:** The line shows the most aggressive growth. It begins its upward curve earlier than ToT (around step 6) and rises much more steeply, exhibiting clear exponential growth.
* **Data Points (Approximate):**
* Steps 1-5: ~0 operations.
* Step 6: ~5,000 operations.
* Step 7: ~10,000 operations.
* Step 8: ~25,000 operations.
* Step 9: ~75,000 operations.
* Step 10: ~220,000 operations (exceeding the top gridline of 200,000).
### Key Observations
* **Divergence Point:** All three methods have negligible operational cost for the first 5-6 steps. The significant divergence begins at step 6 for GoT and step 7 for ToT.
* **Growth Magnitude:** At step 10, the operational cost of GoT is approximately 2.4 times that of ToT and orders of magnitude greater than CoT.
* **Scalability:** CoT appears to scale linearly or with very low constant cost. ToT scales super-linearly (likely polynomial or exponential). GoT scales with a very steep exponential curve.
* **Visual Confirmation:** The red triangle markers (GoT) are consistently above the blue square markers (ToT) from step 8 onward, and both are far above the black circle markers (CoT).
### Interpretation
This chart visually quantifies the trade-off between reasoning complexity and computational cost for different AI reasoning paradigms.
* **CoT (Chain-of-Thought):** Represents a simple, sequential reasoning process. Its flat line suggests it adds minimal operational overhead per step, making it highly efficient but potentially limited in problem-solving power.
* **ToT (Tree-of-Thought):** Represents a branching, exploratory reasoning process. The rising curve indicates that evaluating multiple branches at each step leads to a multiplicative increase in operations, becoming costly for deep reasoning chains.
* **GoT (Graph-of-Thought):** Represents a more complex reasoning process where thoughts can form interconnected graphs (allowing for merging, cycling, and more complex structures than a tree). The steep exponential rise suggests this flexibility comes at an extreme computational price, potentially limiting its practical application to problems requiring only a few reasoning steps unless optimized.
The data suggests that while more sophisticated reasoning frameworks (ToT, GoT) may offer superior problem-solving capabilities, they face a severe scalability challenge. The "Number of Operations" serves as a proxy for time, memory, or cost, implying that for tasks requiring deep reasoning (many steps), CoT may be the only feasible approach without significant algorithmic or hardware improvements. The chart argues for the necessity of research into pruning, heuristic guidance, or approximate methods to make advanced reasoning frameworks like GoT practical.
</details>
Figure 15: Comparison of computational costs across different reasoning strategies â CoT, ToT, GoT â as a function of reasoning steps. GoT exhibits the highest cost due to merge operations and graph traversal.
| | Method | Key Parameters | Approx. # LLM Calls | Approx. # KG Operations | Primary Growth Driver(s) |
| --- | --- | --- | --- | --- | --- |
| CoT | Agent | $n$ | $O(n)$ | $O(n)$ | $n$ (linear) |
| Graph Explore | $n$ , $d$ | $O(n)$ | $O(n¡Cost_Explore(d))$ | $n$ , $d$ | |
| ToT | Agent | $D_max$ , $k$ , $t$ | $Oâ¤ft(k¡\frac{t^D_max-1}{t-1}\right)$ | Same as LLM Calls | $D_max$ , $k$ , $t$ (exponential in $D_max$ ) |
| Graph Explore | $D_max$ , $k$ , $t$ , $d$ | $Oâ¤ft(k¡\frac{t^D_max-1}{t-1}\right)$ | $Oâ¤ft(k¡\frac{t^D_max-1}{t-1}¡Cost_Explore(d)\right)$ | $D_max$ , $k$ , $t$ , $d$ | |
| GoT | Agent | $D_max$ , $k$ , $t$ | . $Oâ¤ft(k¡\frac{t^D_max-1}{t-1}+â_i=1^D_maxâ¤ft\lfloor\frac{k¡ t^i}{2}\right\rfloor\right)$ | Same as LLM Calls | $D_max$ , $k$ , $t$ (aggregation adds extra cost) |
| Graph Explore | $D_max$ , $k$ , $t$ , $d$ | Same as Agent | $Oâ¤ft(LLM Calls¡Cost_Explore(d)\right)$ | $D_max$ , $k$ , $t$ , $d$ | |
Table 5: Theoretical computational complexity comparison of reasoning methods. Parameters: $n$ (reasoning steps), $D_max$ (tree depth), $k$ (branching factor), $t$ (paths retained), $d$ (KG search depth), and $Cost_Explore(d)$ (cost per KG search). GoT includes pairwise aggregation of thoughts at each depth.
The computational analysis summarized in Table 5 highlights the trade-offs between reasoning power and computational cost when grounding LLMs with Knowledge Graphs (KGs). The Agent -based methods scale linearly with the number of reasoning steps or tree nodes, with CoT representing the lowest cost baseline. In contrast, Automatic Graph Exploration methods introduce additional overhead via entity extraction, multi-hop traversal up to a fixed max_depth, and LLM-based pruning at each step.
Among reasoning strategies, ToT introduces exponential growth in cost with respect to depth due to its exploration of $k$ branches and selection of $t$ continuations per level. GoT further amplifies this by incorporating aggregation transformations that attempt to merge every pair of thoughts at each depth, leading to an additional cost proportional to $â_i=1^D_maxâ¤ft\lfloor\frac{k¡ t^i}{2}\right\rfloor$ .
Importantly, our experiments reveal that the higher complexity of GoT does not consistently translate to improved accuracy compared to ToT, suggesting diminishing returns. While the model size (e.g., 8B, 70B, 405B) influences the latency and memory footprint of each LLM call, it does not affect the algorithmic complexity classes shown. Hence, selecting a strategy requires balancing reasoning depth with feasible compute budgets.
## Appendix F Prompts Archive
In this section, we gather the necessary prompts essential for implementing the proposed methodologies.
Agent
Agent Step: Solve a question answering task with interleaving Thought, Interaction with Graph, Feedback from Graph steps. In Thought step, you can think about what further information is needed, and In Interaction step, you can get feedback from graphs with four functions: (1) RetrieveNode[keyword], which retrieves the related node from the graph according to the corresponding query. (2) NodeFeature[Node, feature], which returns the detailed attribute information of Node regarding the given "feature" key. (3) NodeDegree[Node, neighbor_type], which calculates the number of "neighbor_type" neighbors of the node Node in the graph. (4) NeighbourCheck[Node, neighbor_type], which lists the "neighbor_type" neighbours of the node Node in the graph and returns them. You may take as many steps as necessary. Here are some examples: {examples} Please answer by providing node main feature (e.g., names) rather than node IDs. Generate the next step. Definition of the graph: {graph_definition} Question: {question} {scratchpad}
Automatic Graph Exploration
Search Thought: Given the previous thoughts, generate the next thought to answer the provided question. Your end goal is to answer the question step by step. For context, you are also provided with some knowledge triples from a knowledge base. Follow the format of the examples to generate the next thought. {examples} Graph Definition: {graph_definition} Question: {question} Knowledge Triples: {triples} Previous thoughts: {thoughts} Related Entity Attributes: {attributes} Next Thought: Search End?: Your are provided with the an original question, the associated subquestion thoughts and their corresponding knowledge graph triples (head_entity -> relation -> tail_entity). Your task is to answer whether itâs sufficient for you to answer the original question (Yes or No). You are provided with examples. You should follow the same format as in the examples, writing âYesâ or âNoâ within brackets at the beginning of the answer. (Examples) Task: Question: {question} Thoughts: {thoughts} Knowledge Triples: {triples} Entity Attributes: {attributes} Answer: Entity Extraction: Given the provided text, extract the relevant entities that may appear in a knowledge base. Return the answer at the end with brackets relevant entities as shown in the following examples. If there are several entities, separate them with commas. (Examples) Task: Text: {text} Relevant Entities: Prune Relations: From the given entity and relations, select only the relevant relations to answer the question. Provide the answer at the end with bracketsanswer , as shown in the following example. (Examples) Question: {question} Head Entity: {entity} Relations: {relations} Answer: Prune entities: You are provided with a question, a head entity, a relation and tail entity or entities from a knowledge base. Select the tail entity or entities to answer the question. Return the tail entity or entities at the end with brackets relevant entity or entities, as shown in the following examples. (Examples) Question: {question} Head Entity: {head_entity} Relation: {relation} Tail Entities: {tail_entities} Relevant Entities: Search Attributes: Is any of the attributes relevant to answer the question? Return the answer at the end with brackets answer , as shown in the following examples. (Examples) Question: {question} Entity: {entity} Attributes: {attributes} Relevant Attributes:
State Evaluators
Selection Vote: Given a question, you need to select the possible chain of thought that may lead to the correct answer with higher probablity. You are provided with several choices with thouhgts and related triples from a knowledge base. Decide which choice is most promising to complete the task. Analyze each choice in detail, then conclude in the last line: "The best choice is s" , where s the integer id of the choice. (Examples) Question: {question} Choices: {choices} (+ attributes/triples) Answer: Score Vote: Generate a score for the given reasoning chain. The score represents the probability that the chain will lead to the correct answer. The chains contain interleaved thoughts and related triples from a knowledge base. Some chains may not be complete, but you need to judge the steps that are provided. The score can be any floating number between 0 and 1. (Examples) Question: {question} Thought Chain: {thoughts} (+ attributes/triples) Score:
Graph of Thoughts
Graph of Thought Merge: Generate the next thought for the merged chain of thoughts. You are provided with the question, two chains of thoughts, and the corresponding merged chain of thought. Identify inconsistencies or errors from the previous chains and provide the next thought for the merged chain. You should follow the same format as in the examples. (Examples) Question: {question} Chain 1: {chain_1} (+triples/attributes) Chain 2: {chain_2} (+triples/attributes) Merged Chain: {merged_chain} Next Thought:
## Appendix G Examples
| Example 1: Multi-step Reasoning over Graph | Example 2: Direct Triple Lookup |
| --- | --- |
| Question: What anatomy can be expressed by gene KRT39? Ground Truth Answer: head, skin of body Model Answer: head, skin of body
Thought 1: The question is related to a gene node (KRT39). We need to find this node in the graph. Action 1: RetrieveNode[KRT39] Observation 1: The ID of the node is 390792. Thought 2: We need to check the âAnatomy-expresses-Geneâ neighbors of this gene node. Action 2: NeighbourCheck[390792, Anatomy-expresses-Gene] Observation 2: The neighbors are [âUBERON:0000033â, âUBERON:0002097â]. Thought 3: Retrieve names of the anatomy nodes. Action 3: NodeFeature[UBERON:0000033, name], NodeFeature[UBERON:0002097, name] Observation 3: UBERON:0000033 $â$ head, UBERON:0002097 $â$ skin of body. Thought 4: These are the anatomy terms expressed by the gene. Action 4: Finish[head, skin of body] | Question: What anatomy can be expressed by gene KRT39? Ground Truth Answer: head, skin of body Model Answer: head, skin of body
Thought 1: KRT39 is a gene that is known to be expressed in two anatomical regions. Found Triples: â˘
"KRT39" --> Anatomy-expresses-Gene --> head â˘
"KRT39" --> Anatomy-expresses-Gene --> skin of body Action: Finish[head, skin of body] |
Table 6: Comparison of two LLM-generated reasoning traces for the same question using different strategies: multi-step graph traversal (Example 1) vs. direct triple lookup (Example 2).