# Grounding LLM Reasoning with Knowledge Graphs
**Authors**: Alfonso Amayuelas, UC Santa Barbara
> Work completed during an internship at JP Morgan AI Research
Foundations of Reasoning in Language Models
## Abstract
Large Language Models (LLMs) excel at generating natural language answers, yet their outputs often remain unverifiable and difficult to trace. Knowledge Graphs (KGs) offer a complementary strength by representing entities and their relationships in structured form, providing a foundation for more reliable reasoning. We propose a novel framework that integrates LLM reasoning with KGs by linking each step of the reasoning process to graph-structured data. This grounding turns intermediate “thoughts” into interpretable traces that remain consistent with external knowledge. Our approach incorporates multiple reasoning strategies, Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT), and is evaluated on GRBench, a benchmark for domain-specific graph reasoning. Our experiments show state-of-the-art (SOTA) performance, with at least 26.5% improvement over CoT baselines. Beyond accuracy, we analyze how step depth, branching structure, and model size influence reasoning quality, offering insights into the conditions that support effective reasoning. Together, these contributions highlight how grounding LLMs in structured knowledge enables both higher accuracy and greater interpretability in complex reasoning tasks.
## 1 Introduction
LLMs have shown remarkable versatility in answering questions posed in natural language. This is mainly due to their ability to generate text, their broad internal knowledge, and their capacity to access external information (toolqa; rag). However, a significant area for improvement is their tendency to produce information that, while plausible-sounding, is often unverifiable and lacks traceable origins and sources (hallucination_survey). This limitation highlights a deeper issue in how LLMs organize and apply reasoning, especially when reliable and accountable outputs are required.
The LLM generation process heavily relies on their internal parameters, making it difficult to link their outputs to external sources (llm_internal_knowledge; llm_kb). This limitation challenges their reliability in industrial applications (llm_explainability_survey). In applied settings, where LLMs handle critical operations, integrating them with domain-specific knowledge is essential. Fine-tuning LLMs for new domains is labor-intensive, especially for companies with proprietary data facing privacy and legal issues. As a result, there is a need for interventions that guide reasoning processes so that outputs remain accurate and transparent without requiring exhaustive re-training.
Methods such as Retrieval-Augmented Generation (RAG) (rag) and SQL-based querying (llm_eval_sql) address this gap partially. However, they often fail to capture the dynamic relationships between concepts that are necessary for comprehensive understanding. These approaches typically assume that knowledge is well-represented in discrete units, such as documents or tables, which can lead to incomplete insights when dealing with interconnected knowledge that spans multiple sources. This limits the ability to support reasoning over complex queries.
<details>
<summary>figures/general.png Details</summary>

### Visual Description
\n
## Diagram: Agent-Based Knowledge Graph Exploration
### Overview
This diagram illustrates a multi-step process of an agent interacting with a knowledge graph to answer the question: "What is the most frequently cited artwork by Mark Brunswick?". The diagram depicts three steps (Step 1, Step 2, Step n) showing the agent's actions, observations, and internal thoughts, alongside the corresponding exploration within a knowledge graph.
### Components/Axes
The diagram is structured into three main vertical columns: "Agent", "Automatic Exploration", and "Knowledge Graph". Each column is further divided into steps (1, 2, n). Within each step, there are sections for "[Action]", "[Observation]", and "[Thought]". The "Automatic Exploration" column also displays "[Entities]" and "[Triples]". The Knowledge Graph is represented as a circular network of nodes.
### Detailed Analysis or Content Details
**Step 1:**
* **Question:** "What is the most frequently cited artwork by Mark Brunswick?" (Top of diagram, in a curved box)
* **Agent - Action:** `RetrieveNode[Mark Brunswick]`
* **Agent - Observation:** "The node ID is 83029"
* **Agent - Thought 1:** "1. The question is asking for information about Mark Brunswick"
* **Automatic Exploration - Entities:** "Mark Brunswick"
* **Automatic Exploration - Triples:** `(Mark Brunswick, authorOf, Nocturne and Rondo) (Mark Brunswick, authorOf, Symphony in B b)`
* **Knowledge Graph:** A circular network of nodes, with connections between them.
**Step 2:**
* **Agent - Action:** `NeighborCheck[8309, works]`
* **Agent - Observation:** "The Neighbors are Nocturne and Rondo, Symphony in B..."
* **Agent - Thought 2:** "2. Mark Brunswick authored The Master Builder, Symphony in B..."
* **Automatic Exploration - Entities:** "The Master Builder, Symphony in B..."
* **Automatic Exploration - Triples:** `(The Master Builder, citedBy, ) (Mark Brunswick, authorOf, Symphony in B)`
* **Knowledge Graph:** A circular network of nodes, with connections between them.
**Step n:**
* **Agent - Action:** `Finish[The Master Builder]`
* **Agent - Observation:** N/A
* **Agent - Thought n:** "n. The most cited work of Mark Brunswick is The Master Builder"
* **Automatic Exploration - Is the end?:** "The Master Builder"
* **Knowledge Graph:** A circular network of nodes, with connections between them.
### Key Observations
The diagram demonstrates a sequential reasoning process. The agent starts by retrieving a node representing "Mark Brunswick" and then explores its neighbors to identify his works. The process continues until the most frequently cited work is identified as "The Master Builder". The knowledge graph visually represents the relationships between entities and their attributes. The diagram highlights the interplay between the agent's actions, observations, and internal reasoning.
### Interpretation
This diagram illustrates a simplified example of how an agent can leverage a knowledge graph to answer a complex question. The agent uses a series of actions (node retrieval, neighbor checking) to navigate the graph and extract relevant information. The "thoughts" represent the agent's internal reasoning process, which helps it to interpret the observations and refine its search. The diagram suggests a potential approach to building intelligent systems that can reason about knowledge and provide answers to natural language queries. The use of node IDs (e.g., 83029) indicates a structured and formalized representation of knowledge within the graph. The iterative nature of the process (Step 1, Step 2, Step n) suggests that the agent can handle more complex queries that require multiple steps of reasoning. The diagram is a conceptual illustration and does not provide specific details about the underlying knowledge graph or the agent's implementation.
</details>
Figure 1: Methods for Question-Answering in KGs (Section 4). Left: Agent. LLM decides to take one of the predefined actions to connect with the graph. Right: Automatic Graph Exploration. Entities are extracted in each reasoning step, triggering a search for each identified entity.
KGs capture such relationships by organizing entities and their connections in a structured representation. Recent work has begun to explore how KGs can guide reasoning in LLMs (rog). Building on this direction, we propose a framework that integrates reasoning strategies with domain-specific KGs from the GRBench dataset. Each reasoning step is explicitly connected to the graph, producing answers that are both accurate and traceable. We evaluate three strategies, CoT, ToT, and GoT, which provide different ways of organizing reasoning traces (Figure 2). Our framework achieves SOTA performance on GRBench, improving by at least 26.5% over CoT baselines. Beyond accuracy, we analyze how reasoning behaviors vary with step depth, branching, and model size (Section 7), offering insights into how reasoning evolves during inference and how it can be shaped by structured knowledge. Our contributions can be summarized as follows:
- We present a versatile framework that links reasoning steps with graph search, providing an intervention that can be applied across domains and knowledge settings.
- We show substantial gains over existing reasoning strategies, with state-of-the-art performance on GRBench and an improvement of at least 26.5% compared to CoT baselines.
- We conduct a detailed study of how reasoning quality is influenced by step depth, branching structure, and model size, offering insights into the conditions under which different interventions succeed (Section 7).
- By systematically grounding each step in a knowledge graph, we improve both the accuracy and the traceability of outputs, creating reasoning traces that are interpretable and verifiable.
## 2 Related Work
LLMs require substantial data and resources for training villalobos2024will. Retrieval-Augmented Generation (RAG) enable the models to incorporate external evidence at inference time (rag; demonstrate-search-predict). Recent work further combines RAG with structured knowledge, such as ontologies and KGs, to improve factuality and reasoning li2024structrag, underscoring the growing importance of structured data for robust and domain-adaptable LLMs.
Structured Knowledge
Structured knowledge, such as databases or KGs, provides organizations with reliable sources of information that can be systematically maintained and automatically updated. KGs in particular offer an adaptable knowledge model that captures complex relationships between interconnected concepts. Research has explored models that can interact with multiple types of structured knowledge, such as StructLM (structLM), and approaches that incorporate structured knowledge during pretraining to improve model performance (skill_llm_structured_knowledge_infusion).
Integrating KGs with LLMs
The integration of KGs with LLMs has emerged as a promising direction to strengthen reasoning capabilities and reliability (peng2024graph). In general, four primary methods can be distinguished: (1) learning graph representations (let_graph_do_thinking; graphllm), though latent representations often underperform compared to text-based methods on Knowledge Graph Question Answering (KGQA) tasks; (2) using Graph Neural Network (GNN) retrievers to extract relevant entities and feeding them as text-based input to the model (g-retriever; gnn-rag); (3) generating code, such as SPARQL queries, to directly retrieve information from graphs (kb-binder); and (4) step-by-step interaction methods that allow iterative reasoning over graphs (rog; decaf; chatkbqa), which currently achieve the strongest results on KGQA benchmarks.
LLM Reasoning with Graphs
Beyond retrieval, KGs have also been studied as a means to structure and analyze the reasoning processes of LLMs themselves wang2024understanding. This integration enables more coherent and contextually relevant outputs while also supporting the tracing and verification of reasoning steps. The most effective methods typically rely on interactive, step-by-step engagement between LLMs and graphs, as discussed above. Examples of this approach include systems such as think-on-graph; mindmap; rog; kg-gpt; li2025cot, which demonstrate improved reasoning performance through graph-based scaffolding. More recent work has further investigated the integration of traditional reasoning strategies, such as CoT and tree-structured reasoning, into KG-based interaction (graphCoT; tree-of-traversals).
Building on these advances, our framework integrates established reasoning strategies directly with domain-specific KGs. Unlike previous methods that treat KGs as retrieval tools or rely on latent representations, our approach systematically links each reasoning step to graph entities and relations.
## 3 Background
In this section, we formalize the prerequisite knowledge relevant to this paper. We use $p_{\theta}$ to denote a pre-trained language model with parameters $\theta$ , and letters $x,y,z$ to refer to a language sequence. $x=(x_{1},x_{2},...,x_{n})$ , where each generated token is $x_{i}$ is a such that $p_{\theta}(x)=\prod_{i=1}^{n}p_{\theta}(x_{i}|x_{1...i-1})$ .
Knowledge Graphs (KGs)
A KG is a heterogeneous directed graph that contains factual knowledge to model structured information. Nodes represent entities, events, or concepts, while edges represent the connection and types of relations between them. Formally, a KG is represented as $\mathcal{G}$ , defined by a set of triples $\mathcal{G}=\{(h,r,t)\mid h,t\in\mathcal{E},r\in\mathcal{R}\}$ , where $\mathcal{E}$ , $\mathcal{R}$ denote the set of entities and relations, respectively. KGs provide a structured framework that can guide reasoning processes by explicitly representing the relationships between concepts.
Knowledge Graph Question-Answering (KGQA)
It is a reasoning task that leverages KGs. Given a natural language question, $q$ , and an associated KG, $\mathcal{G}$ , the goal is to develop a method that retrieves the correct answer, $a$ , based on the knowledge extracted from the KG: $a=f(q,\mathcal{G})$ . Beyond retrieving facts, KGQA often requires integrating multiple reasoning steps that traverse the graph to connect related concepts.
Step-by-step Reasoning with LLMs
To improve the reasoning capabilities of LLMs at inference time, a common approach is to generate intermediate reasoning steps. The key idea is the introduction of intermediate steps, $Z_{p_{\theta}}=z_{1},...,z_{n}$ , to add inference sources to bridge the $q$ and $a$ . This decomposition allows models to tackle complex, multi-step problems incrementally, focusing computational effort on parts of reasoning chain that require deeper analysis. Stepwise reasoning over KGs offers a natural mechanism to track, guide, and interpret the reasoning process.
## 4 Method
This work demonstrates how progressively conditioning LLM reasoning at each step can enhance performance on domain-specific question answering over knowledge graphs. By structuring reasoning into incremental steps that interact with graph data, the model can manage complex dependencies and dynamically refine its conclusions. Our method combines reasoning strategies for LLMs: CoT, ToT, GoT with 2 graph interaction methods: (1) Agent, an agent to navigate the graph; and (2) Automatic Graph Exploration, an automatic graph traversal mechanism based on the generated text.
<details>
<summary>figures/cot.jpg Details</summary>

### Visual Description
\n
## Diagram: Conceptual Flow of Thought Process
### Overview
The image depicts a simplified diagram illustrating a conceptual flow of a thought process. It shows a sequence of stages: Input, three stages labeled "Thought" each accompanied by a network-like graphic, and finally, "Answer". The diagram uses arrows to indicate the direction of flow from one stage to the next.
### Components/Axes
The diagram consists of the following components:
* **Input:** An oval shape at the top.
* **Thought (x3):** Three rectangular shapes, each labeled "Thought" and containing a network graphic.
* **Answer:** A rectangular shape at the bottom.
* **Arrows:** Downward-pointing arrows connecting each stage in sequence.
* **Network Graphic:** Each "Thought" stage contains a network of nodes (circles) connected by lines. The nodes are colored in three colors: green, blue, and yellow. The yellow nodes are highlighted with a small square around them.
### Detailed Analysis or Content Details
The diagram shows a linear progression of thought. Each "Thought" stage appears to represent a level of processing or refinement of the initial "Input". The network graphic within each "Thought" stage seems to represent connections and relationships between ideas or concepts.
* **Input:** No specific data or details are provided within the "Input" stage.
* **Thought 1:** The network graphic contains approximately 15-20 nodes. The distribution of colors is roughly equal between green, blue, and yellow.
* **Thought 2:** The network graphic appears similar to "Thought 1" in terms of node count and color distribution.
* **Thought 3:** The network graphic is also similar to the previous two "Thought" stages.
* **Answer:** No specific data or details are provided within the "Answer" stage.
### Key Observations
The diagram emphasizes a sequential process. The repeated "Thought" stages suggest iterative refinement or processing of information. The network graphics within each "Thought" stage imply complex relationships and connections between ideas. The highlighting of yellow nodes within the network graphics could indicate a focus on specific concepts or ideas during each stage of thought.
### Interpretation
This diagram likely represents a simplified model of cognitive processing. The "Input" represents initial information received, and the "Thought" stages represent the internal processing and manipulation of that information. The network graphics symbolize the interconnectedness of ideas and concepts within the mind. The "Answer" represents the final output or conclusion reached after processing.
The iterative nature of the "Thought" stages suggests that the thought process is not always linear but may involve revisiting and refining ideas multiple times. The highlighting of yellow nodes could indicate that certain concepts are more salient or important during the thought process.
The diagram is conceptual and does not provide specific data or quantitative information. It serves as a visual metaphor for the complexity and dynamism of human thought. It could be used to illustrate concepts in psychology, artificial intelligence, or cognitive science. The diagram is a high-level representation and lacks specific details about the nature of the input, the processing steps within each "Thought" stage, or the criteria for arriving at the "Answer".
</details>
(a) Chain of Thought (CoT)
<details>
<summary>figures/tot.jpg Details</summary>

### Visual Description
\n
## Diagram: Neural Network/Decision Tree Flow
### Overview
The image depicts a diagram representing a computational process, likely a neural network or a decision tree. It shows a flow of information from an "Input" node through multiple layers of processing nodes to an "Answer" node. Each processing node appears to perform a test ("T") and has associated feedback mechanisms indicated by checkmarks (green) and crosses (red). The diagram is structured in a hierarchical, tree-like fashion.
### Components/Axes
The diagram consists of the following key components:
* **Input:** A single oval-shaped node at the top center, labeled "Input".
* **Processing Nodes:** Rectangular nodes arranged in three layers. Each node contains the letter "T" and is connected to other nodes via arrows.
* **Feedback Indicators:** Green checkmarks and red crosses associated with some processing nodes, indicating success or failure of a test.
* **Arrows:** Green arrows indicate a positive flow or successful path, while the absence of an arrow indicates a blocked path.
* **Answer:** A single oval-shaped node at the bottom center, labeled "Answer".
* **Network Structure:** Each processing node contains a network of interconnected circles (nodes) and lines (connections).
### Detailed Analysis or Content Details
The diagram can be broken down into layers:
**Layer 1 (Top Layer):**
* Four processing nodes are present.
* Each node has the letter "T" inside.
* Each node is connected to the "Input" node via a green arrow.
* Two nodes have a green checkmark, indicating a successful test.
* Two nodes have a red cross, indicating a failed test.
* Each node contains a network of approximately 20-25 interconnected circles.
**Layer 2 (Middle Layer):**
* Five processing nodes are present.
* Each node has the letter "T" inside.
* The first node is connected to the first node in Layer 1 (with a red cross).
* The second node is connected to the second node in Layer 1 (with a green checkmark).
* The third node is connected to the third node in Layer 1 (with a red cross).
* The fourth node is connected to the fourth node in Layer 1 (with a green checkmark).
* The fifth node is connected to the first and second nodes in Layer 1.
* One node has a green checkmark, indicating a successful test.
* Four nodes have a red cross, indicating a failed test.
* Each node contains a network of approximately 20-25 interconnected circles.
**Layer 3 (Bottom Layer):**
* Five processing nodes are present.
* Each node has the letter "T" inside.
* The first node is connected to the first node in Layer 2 (with a red cross).
* The second node is connected to the second node in Layer 2 (with a green checkmark).
* The third node is connected to the third node in Layer 2 (with a red cross).
* The fourth node is connected to the fourth node in Layer 2 (with a red cross).
* The fifth node is connected to the fifth node in Layer 2 (with a green checkmark).
* One node has a green checkmark, indicating a successful test.
* Four nodes have a red cross, indicating a failed test.
* Each node contains a network of approximately 20-25 interconnected circles.
* The node with the green checkmark is connected to the "Answer" node via a green arrow.
### Key Observations
* The diagram demonstrates a filtering or selection process. Nodes with red crosses appear to block the flow of information.
* The "T" inside each node likely represents a test or condition.
* The checkmarks and crosses provide feedback on the outcome of the test.
* The network structure within each node suggests a complex internal computation.
* The flow of information is not uniform; some paths are blocked while others are successful.
### Interpretation
The diagram likely represents a simplified model of a neural network or a decision tree used for classification or prediction. The "Input" represents the initial data, and the processing nodes perform a series of tests to determine the appropriate "Answer". The checkmarks and crosses indicate whether the tests pass or fail, guiding the flow of information towards the final output. The internal network structure within each node suggests that each test involves a complex computation. The diagram highlights the importance of filtering and selection in the decision-making process. The presence of both successful and failed paths suggests that the system is capable of handling uncertainty and making decisions based on incomplete or noisy data. The diagram is a visual representation of a computational process, and the specific meaning of the "T" and the internal network structure would require further context.
</details>
(b) Tree of Thought (ToT)
<details>
<summary>figures/got.jpg Details</summary>

### Visual Description
\n
## Diagram: Multi-Layered Processing Network
### Overview
The image depicts a multi-layered processing network, resembling a neural network or a similar computational model. Data flows from an "Input" node through several layers of interconnected nodes labeled "T", ultimately leading to an "Answer" node. Each node within the layers appears to have internal connections represented by a network of circles and lines. Some connections are marked with a green checkmark, while others are marked with a red "X". The diagram illustrates a process of filtering or selection, where some connections are accepted (checkmark) and others are rejected (X).
### Components/Axes
The diagram consists of the following key components:
* **Input:** Top-most node, serving as the starting point for data flow.
* **Layers:** Three distinct layers of nodes labeled "T". Each layer contains multiple nodes.
* **Nodes (T):** Rectangular boxes labeled with the letter "T". Each node contains a complex network of smaller circles connected by lines.
* **Connections:** Lines connecting nodes between layers. These lines represent the flow of data or signals.
* **Selection Markers:** Green checkmarks (✓) and red "X" marks (✗) placed on the connections, indicating acceptance or rejection of the signal.
* **Answer:** Bottom-most node, representing the final output of the processing network.
There are no explicit axes or scales in this diagram.
### Detailed Analysis or Content Details
The diagram shows a flow of information from the "Input" to the "Answer" through three layers.
* **Layer 1:** The "Input" splits into six connections, each leading to a node labeled "T". Two of these connections are marked with a red "X", indicating rejection. Four connections are marked with a green checkmark, indicating acceptance.
* **Layer 2:** Each of the four accepted connections from Layer 1 leads to a node labeled "T". One of these connections is marked with a red "X", while three are marked with a green checkmark.
* **Layer 3:** The three accepted connections from Layer 2 lead to nodes labeled "T". One of these connections is marked with a red "X", while two are marked with a green checkmark.
* **Output:** The two accepted connections from Layer 3 converge into the "Answer" node.
The internal structure of each "T" node appears to be a network of approximately 15-20 small circles connected by lines. The arrangement of these circles seems consistent across all "T" nodes.
### Key Observations
* The network progressively filters the input signal, reducing the number of active connections at each layer.
* The rejection rate appears to be approximately 33-50% at each layer.
* The internal structure of the "T" nodes suggests a complex processing unit, potentially performing some form of transformation or computation on the input signal.
* The diagram does not provide any quantitative data about the strength or nature of the signals being processed.
### Interpretation
This diagram likely represents a simplified model of a decision-making process or a filtering mechanism. The "Input" represents the initial data, and the layers of "T" nodes represent stages of processing or evaluation. The green checkmarks and red "X" marks indicate which signals are allowed to pass through the network and which are blocked. The "Answer" node represents the final outcome of this filtering process.
The consistent internal structure of the "T" nodes suggests that each processing unit performs a similar function. The progressive filtering of the signal indicates that the network is designed to select only the most relevant or important information.
The diagram could be interpreted as a representation of:
* **A neural network:** Where the "T" nodes are neurons and the connections are synapses.
* **A decision tree:** Where the "T" nodes represent decision points and the checkmarks/X marks represent the outcomes of those decisions.
* **A signal processing system:** Where the "T" nodes are filters and the checkmarks/X marks represent the pass/fail criteria.
The diagram is abstract and does not provide enough information to determine the specific application or purpose of the network. However, it clearly illustrates a process of selective filtering and information reduction.
</details>
(c) Graph of Thought (GoT)
Figure 2: Reasoning Strategies: This figure illustrates different LLM reasoning strategies to navigate the potential answer space: CoT, ToT, GoT. Each strategy consists of "thoughts" connected to the KG through search methods (Section 4.2) illustrating stepwise reasoning over structured knowledge.
### 4.1 Reasoning Strategies
Chain-of-Thought (CoT)
CoT is a well-known reasoning method that involves generating a sequence of logical steps, where each step builds upon previous ones, ultimately leading to a conclusion. Formally, it generates a sequence of reasoning steps $Z_{p_{\theta}}(q)=\{z_{1},z_{2},\ldots,z_{n}\}$ , where each step $z_{i}$ is sampled sequentially given the input query $q$ , all previous steps and graph information from all steps , $\mathcal{G}^{\prime}$ , as $z_{i}\sim p_{\theta}^{\text{CoT}}(z_{i}|q,\mathcal{G}^{\prime},z_{1\ldots i-1})$ . The final answer $a$ is derived from this reasoning process given all the generated thoughts $a\sim p_{\theta}^{\text{CoT}}(a|q,\mathcal{G}^{\prime},z_{1\ldots n})$ . In practice, it is sampled as a continuous language sequence. Figure 2(a) represents this method, where each step is linked to the KG.
Tree-of-Thought (ToT)
ToT generalizes CoT by modeling the reasoning process as a tree, enabling simultaneous exploration of multiple reasoning paths. Starting from an initial state $s_{0}=[q]$ , where $q$ is the input, ToT incrementally expands each state by generating multiple candidate thoughts:
$$
z_{i+1}^{(j)}\sim p_{\theta}(z_{i+1}\mid s_{i}),\quad j=1,\dots,k \tag{1}
$$
Each candidate thought represents a node in the tree, forming new states. These states are evaluated by a heuristic scoring function $V(p_{\theta},s)$ , guiding the selection and pruning of branches. Search strategies, such as breadth-first search (BFS), systematically explore this tree:
$$
S_{t}=\text{argmax}_{S^{\prime}\subseteq\hat{S}_{t},|S^{\prime}|=b}\sum_{s\in S^{\prime}}V(p_{\theta},s) \tag{2}
$$
where $\hat{S}_{t}$ denotes candidate states at step $t$ , and $b$ limits the breadth. We implement two versions of heuristic functions $V$ to select the top $t$ states:
1. Selection: The LLM directly chooses the top $t$ states to proceed, discarding the others.
1. Score: The states are ranked by a heuristic voting mechanism: $V(p_{\theta},S)(s)=\mathbb{P}[s=s^{*}]$ where the LLM is prompted to estimate probability of the current state solving the given input question.
This structured search and pruning strategy allows the model to evaluate multiple candidate reasoning paths, enabling more deliberate and interpretable reasoning.
Graph-of-Thought (GoT)
GoT extends ToT by organizing reasoning into a directed graph structure $G=(V,E)$ , where each node represents a thought and edges reflect dependencies. Starting from initial thought, new thoughts are generated similarly to ToT and added to the graph. Each new thought is connected to its parent, and additional reasoning chains can be formed through merging operations:
$$
z_{i+1}=A(z_{i}^{(a)},z_{i}^{(b)}) \tag{3}
$$
where $A$ denotes a merge operation that integrates two thought chains into a single coherent reasoning step. The merged thought is added as a new node with edges from both parents. In our implementation, thoughts are evaluated using either Selection- or Score-based strategy as in ToT. Merged thoughts inherit information from both parents and can enhance robustness. At each depth, a fixed number of thoughts are retained using breadth-first traversal and evaluated for progression. The graph-based organization captures dependencies and merges information from multiple reasoning chains, supporting dynamic refinement and structured exploration of the reasoning space.
### 4.2 LLM + KG Interaction Methods
We implement methods to connect reasoning strategies with KGs. The LLM interacts with the KG at every step. This retrieves new information and conditions the model for subsequent steps. We present 2 methods to achieve this interaction, both illustrated in Appendix B.
#### 4.2.1 Agent
This approach creates an agent that interacts with the graph, following the methodology initially described in ReACT (react). After generating a thought, the LLM selects from a set of actions based on the given thought. Each step in the reasoning chain consists of an interleaved sequence: thought $\rightarrow$ action $\rightarrow$ retrieved data. This method implements four actions as described in GraphCoT (graphCoT): (a) RetrieveNode (Text): Identifies the related node in the graph using semantic search, (b) NodeFeature (NodeID, FeatureName): Retrieves textual information for a specific node from the graph, (c) NeighborCheck (NodeID, EdgeType): Retrieves neighbors’ information for a specific node, (d) NodeDegree (NodeID, EdgeType): Returns the degree (#neighbors) for a given node and edge type. These actions collectively enable the agent to navigate and extract meaningful information from the graph, enhancing the reasoning capabilities of the LLM by grounding its thoughts in structured, retrievable data.
#### 4.2.2 Automatic Graph Exploration
This method incrementally searches the graph by interleaving language generation with structured retrieval. At each step, the LLM generates a new "thought" based on previous thoughts and retrieved triples. Entities mentioned in the generated text are automatically extracted using LLM prompts and serve as anchors for further graph exploration.
<details>
<summary>figures/graph_search.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Construction Pipeline
### Overview
The image depicts a simplified pipeline for constructing a knowledge graph from text data. The pipeline consists of three main stages: Entity Extraction, Relation Selection, and Neighbor Selection, culminating in the formation of Triples representing the knowledge graph.
### Components/Axes
The diagram is structured horizontally, with a clear flow from left to right. The stages are labeled as follows:
* **Entity Extraction:** Represented by a document icon and a robot icon.
* **Relation Selection:** A network graph with connections between entities.
* **Neighbor Selection:** A filtering stage.
* **Triples:** The final output, represented as pairs of connected rectangles.
The entities are labeled:
* Entity1 (blue rectangle)
* Entity2 (blue rectangle)
* Entity3 (blue rectangle)
* EntityA (green rectangle)
The relations are labeled:
* Rel1 (orange arrow)
* Rel2 (red arrow)
The final output is labeled "Triples".
### Detailed Analysis or Content Details
1. **Entity Extraction:** A document icon on the far left feeds into a robot icon, which then outputs three entities: Entity1, Entity2, and Entity3. These entities are represented as blue rectangles.
2. **Relation Selection:** The entities are then fed into a network graph. This graph shows connections between the entities and EntityA. The connections are represented by arrows.
* Rel1 (orange arrow) connects to EntityA and is marked with a green checkmark, indicating a valid relation.
* Rel2 (red arrow) connects to EntityA and is marked with a red 'X', indicating an invalid relation.
3. **Neighbor Selection:** EntityA is shown three times, each connected to a different relation. Two of these connections are marked with green checkmarks, and one is marked with a red 'X'. This suggests a filtering process based on relation validity.
4. **Triples:** The final output consists of three pairs of connected rectangles, representing the triples.
* The first triple consists of a blue rectangle and a green rectangle, connected by an orange line.
* The second triple consists of a blue rectangle and a green rectangle, connected by an orange line.
* The third triple consists of a blue rectangle and a green rectangle, connected by a red line.
### Key Observations
* The pipeline demonstrates a process of filtering relations to ensure the quality of the knowledge graph.
* The red 'X' symbols indicate that not all relations are considered valid or relevant.
* The final triples represent the extracted knowledge in a structured format.
* The diagram does not provide any quantitative data, but illustrates a conceptual process.
### Interpretation
The diagram illustrates a common approach to knowledge graph construction. It begins with identifying entities from text, then determining the relationships between those entities. A crucial step is filtering these relationships to ensure accuracy and relevance. The final output, the triples, represents the core building blocks of the knowledge graph – subject, predicate (relation), and object. The use of checkmarks and 'X' symbols highlights the importance of relation validation in building a high-quality knowledge graph. The diagram suggests a process where the system attempts to link entities, but not all links are deemed valid, and only the valid ones are incorporated into the final knowledge representation. The diagram is a conceptual illustration and does not contain specific data points or numerical values. It focuses on the flow and logic of the process.
</details>
Figure 3: Automatic Graph Exploration. It extracts entities from text (query/thought), then select relevant relations and neighbors with the LLM. The resulting entity-relation-entity combinations form triples to expand the reasoning chain.
Graph exploration proceeds through a multi-step Search + Prune pipeline, inspired by the process described in think-on-graph. For each unvisited entity, the system first retrieves and prunes relation types using LLM guidance. Then, for each selected relation, neighboring entities are discovered and filtered using a second round of pruning. The model selects only the most relevant neighbors based on their contextual fit with the question and previous reasoning steps. This hierarchical pruning – first on relations, then on entities – ensures the method remains computationally tractable while preserving interpretability. The overall traversal follows a breadth-first search (BFS) pattern, with pruning decisions at each level directed by LLM. This process is shown in Figure 3. This iterative reasoning and retrieval process allows the model to condition future steps on progressively relevant subgraphs, shaping the reasoning trajectory. Unlike agentic methods that rely on predefined actions, the automatic approach operates in the graph space guided by the natural language, providing more freedom in the generation. The mechanism is designed to maximize information gain at each step while avoiding graph overgrowth. More details are provided in Algorithm 1.
## 5 Experiments
Benchmark
We use the GRBench dataset to evaluate our methods. This dataset is specifically designed to evaluate how effectively LLMs can perform stepwise reasoning over domain-specific graphs. It includes several graphs spanning various general domains. For our evaluation, we selected 7 graphs across multiple domains, excluding those with excessively high RAM requirements that exceed our available resources. Comprehensive graph statistics are provided in Appendix A.
Baselines
The proposed methods, Agent and Automatic Graph Exploration, applied to CoT, ToT, and GoT, are compared against the following baseline methods: (1) Zero-Shot: Directly querying the model to answer the question without additional context. (2) Text RAG rag_sruvey: Text-retrieval method that uses text representation of nodes as input for query, with retrieved data serving as context for the model. (3) Graph RAG: Includes node neighbors (1-hop) for additional context beyond Text RAG. (4) Graph CoT (Agent): Implements Graph CoT as an agent for CoT reasoning, utilizing the actions described in Section 4.2. These baselines allow us to measure impact of stepwise, knowledge-grounded reasoning versus simple retrieval-augmented or zero-shot approaches.
Experimental methods
We implement the methods described in Section 4, extending (1) Agent and (2) Automatic Graph Exploration with various reasoning strategies during inference: (1) CoT, (2) ToT, and (3) GoT. For ToT and GoT, we evaluate the impact of stepwise decision-making using State Evaluation methods: (1) Selection and (2) Score. In the results presented in Table 1, we set $n=10$ steps for all methods. ToT and GoT use a branching factor and Selection of $k=t=3$ . Our experiments focus on the effect of structured reasoning interventions on performance and stepwise refinement of answers. We use only open-access Llama 3.1 (Instruct) llama3models as the backend models, which enhances reproducibility and allows for unlimited free calls. Specifically, we employ the 8B, 70B, and 405B versions, using the FP8 variant for the 405B model.
Evaluation
We use rule-based and model-based metrics to evaluate the models, following GRBench paper graphCoT. For the rule-based metric, we use Rouge-L (R-L) (rouge_metric), which measures the longest sequence of words appearing in same order in both generated text and ground truth answer. For model-based metric, we prompt GPT-4o to assess if the model’s output matches ground truth answer. GPT4Score is percentage of answers that GPT-4o identifies as correct. These evaluation methods capture not only final answer accuracy but also the fidelity of reasoning steps, reflecting the effectiveness of our interventions in guiding LLM reasoning over structured knowledge.
Implementation Details
The experiments are run on NVIDIA TITAN RTX or NVIDIA A100 using Python 3.8. The models are deployed with vLLM vllm, a memory-efficient library for LLM inference and serving. For the baseline, Mpnet-v2 is used as the retriever, and FAISS faiss is employed for indexing.
## 6 Results
The main results from both the baselines and experimental methods, evaluated using R-L, are presented in Table 1. For brevity, additional results using GPT4Score can be found in Appendix D. Together, these findings allow us to compare different forms of reasoning interventions, agentic action selection, automatic graph exploration, and structured multi-path search, on their ability to guide LLMs toward accurate answers. We highlight three key insights from the findings: (1) The agentic method generally outperformed automatic graph exploration, indicating that targeted interventions on knowledge graph traversal enhance answer accuracy. (2) The ToT strategy demonstrated superior performance by effectively exploring multiple reasoning paths, showcasing the benefits of inference-time interventions that diversify reasoning trajectories. (3) Although GoT strategy showed potential, it did not significantly outperform ToT, suggesting that merging divergent reasoning paths remains a challenging intervention design problem. These results show the importance of reasoning strategies in enabling LLMs to navigate multiple paths in the graph, while also illustrating the limits of current intervention techniques.
| | Method | | Model | Healthcare | Goodreads | Biology | Chemistry | Materials Science | Medicine | Physics |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Baselines | | Llama 3.1 8B-Ins | 7.32 | 6.18 | 10.68 | 11.69 | 8.95 | 8.77 | 11.52 | |
| Base | Llama 3.1 70B-Ins | 9.74 | 9.79 | 11.49 | 12.58 | 10.40 | 12.21 | 12.61 | | |
| Llama 3.1 405B-Ins | 8.66 | 12.49 | 10.52 | 13.51 | 11.73 | 11.82 | 11.63 | | | |
| Llama 3.1 8B-Ins | 8.24 | 14.69 | 12.43 | 11.42 | 9.46 | 10.75 | 11.29 | | | |
| Text-RAG | Llama 3.1 70B-Ins | 10.32 | 18.81 | 11.87 | 16.35 | 12.25 | 12.77 | 12.54 | | |
| Llama 3.1 405B-Ins | 11.61 | 16.23 | 16.11 | 13.82 | 14.23 | 15.16 | 16.32 | | | |
| Llama 3.1 8B-Ins | 12.94 | 22.30 | 30.72 | 34.46 | 30.20 | 25.81 | 33.49 | | | |
| Graph-RAG | Llama 3.1 70B-Ins | 17.95 | 25.36 | 38.88 | 40.90 | 41.09 | 31.43 | 39.75 | | |
| Llama 3.1 405B-Ins | 16.12 | 23.13 | 37.57 | 42.58 | 37.74 | 33.34 | 40.98 | | | |
| Graph CoT | | Llama 3.1 8B-Ins | 16.83 | 30.91 | 20.15 | 18.43 | 26.29 | 14.95 | 21.41 | |
| Agent | Llama 3.1 70B-Ins | 33.48 | 40.98 | 50.00 | 51.53 | 49.6 | 48.27 | 44.35 | | |
| Llama 3.1 405B-Ins | 28.41 | 36.56 | 41.35 | 48.36 | 47.81 | 42.54 | 35.24 | | | |
| Graph Explore | Llama 3.1 8B-Ins | 25.58 | 32.34 | 36.65 | 35.33 | 31.06 | 31.05 | 35.96 | | |
| Llama 3.1 70B-Ins | 29.41 | 29.60 | 44.63 | 49.49 | 39.23 | 38.87 | 45.52 | | | |
| Llama 3.1 405B-Ins | 28.45 | 43.06 | 36.93 | 38.71 | 47.49 | 55.66 | 32.73 | | | |
| Graph ToT | Agent | Score | Llama 3.1 8B-Ins | 28.91 | 52.25 | 43.81 | 44.18 | 43.49 | 36.07 | 39.56 |
| Llama 3.1 70B-Ins | 38.51 | 51.58 | 64.44 | 61.13 | 55.19 | 63.00 | 55.33 | | | |
| Llama 3.1 405B-Ins | 47.51 | 50.73 | 70.34 | 64.9 | 49.02 | 65.40 | 44.63 | | | |
| Select | Llama 3.1 8B-Ins | 28.67 | 50.59 | 42.33 | 37.07 | 40.81 | 33.17 | 36.50 | | |
| Llama 3.1 70B-Ins | 40.26 | 52.59 | 64.53 | 66.84 | 61.42 | 61.21 | 55.89 | | | |
| Llama 3.1 405B-Ins | 46.90 | 51.68 | 70.27 | 67.95 | 63.74 | 64.23 | 59.56 | | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 24.49 | 36.80 | 35.81 | 36.41 | 34.28 | 34.49 | 37.69 | |
| Llama 3.1 70B-Ins | 32.79 | 38.19 | 53.83 | 58.25 | 48.55 | 52.18 | 48.07 | | | |
| Llama 3.1 405B-Ins | 33.90 | 42.68 | 46.87 | 57.43 | 50.46 | 55.56 | 48.73 | | | |
| Select | Llama 3.1 8B-Ins | 25.04 | 37.8 | 36.34 | 38.5 | 32.44 | 33.31 | 34.85 | | |
| Llama 3.1 70B-Ins | 33.40 | 39.13 | 54.78 | 58.53 | 47.19 | 51.13 | 47.51 | | | |
| Llama 3.1 405B-Ins | 33.82 | 43.63 | 44.47 | 59.06 | 48.52 | 55.62 | 46.07 | | | |
Table 1: Rouge-L (R-L) performance results on GRBench, comparing standard LLMs, Text-RAG, Graph-RAG, Graph-CoT, and Graph-ToT. Experiments are described in Section 5, using LLama 3.1 - Instruct backbone models with sizes 8B, 70B, and 405B-FP8.
Agent vs Graph Search
In our experimental results, the agentic method outperformed graph exploration approach across most datasets and reasoning strategies. The agent-based method, which involves LLM selecting specific actions to interact with KG, consistently improves performance as the number of reasoning steps increases, as shown in Section 7. This highlights that explicit, model-driven interventions are more effective than passive expansion strategies, as they promote iterative refinement and selective focus. While graph exploration can quickly provide broad coverage, the agentic method’s targeted, stepwise interactions yield more accurate and comprehensive answers over longer sequences of reasoning.
Tree of Thought (ToT)
The ToT reasoning strategy showed superior performance across its various interaction methods and state evaluators, as summarized in Table 1. ToT achieved performance improvements of 54.74% in agent performance and 11.74% in exploration mode compared to the CoT baseline. However, this improvement comes with the trade-off of increased inference time, highlighting the cost of inference-time reasoning interventions. The success of ToT illustrates how branching interventions that explore multiple candidate paths can substantially enhance reasoning accuracy, especially when coupled with evaluators that prune unpromising trajectories. We also compared the two State Evaluation methods (Selection and Score), finding complementary benefits depending on dataset and scale.
| Method | Model | Healthcare | Biology | |
| --- | --- | --- | --- | --- |
| Agent | Score | Llama 3.1 8B-Ins | 29.11 | 33.25 |
| Llama 3.1 70B-Ins | 30.88 | 56.64 | | |
| Llama 3.1 405B-Ins | 43.53 | 48.1 | | |
| Select | Llama 3.1 8B-Ins | 29.05 | 40.37 | |
| Llama 3.1 70B-Ins | 40.74 | 65.59 | | |
| Llama 3.1 405B-Ins | 47.63 | 71.49 | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 24.96 | 21.72 |
| Llama 3.1 70B-Ins | 31.24 | 50.70 | | |
| Llama 3.1 405B-Ins | 35.00 | 39.10 | | |
| Select | Llama 3.1 8B-Ins | 25.06 | 21.84 | |
| Llama 3.1 70B-Ins | 36.95 | 52.32 | | |
| Llama 3.1 405B-Ins | 33.74 | 54.64 | | |
Table 2: Graph-GoT results on GRBench using Rouge-L
Graph of Thought (GoT)
The results for GoT strategy are summarized in Table 2. Due to the additional computational cost, we report results for two datasets only. GoT did not outperform ToT. Our initial hypothesis was that LLMs could integrate divergent results from multiple branches, but in practice the models struggled to merge these effectively. Specifically, in the graph exploration setting, models often failed to combine different triples found in separate branches. This reveals a current limitation of reasoning interventions based on aggregation: while branching helps discover diverse facts, robust mechanisms for synthesis and reconciliation are still underdeveloped. This finding opens a direction for future research into more advanced intervention strategies for merging partial reasoning outcomes.
## 7 Analysis & Ablation studies
In this section, we want to better understand the nuances of our methods for LLM and KG grounding. We conduct an analysis on the Academic datasets from the benchmark, as they all contain the same number of samples and feature questions generated from similar templates to ensure a controlled comparison.
<details>
<summary>figures/analysis_steps_academic_rl.png Details</summary>

### Visual Description
\n
## Line Chart: Rouge-L Score vs. Steps
### Overview
This image presents a line chart comparing the Rouge-L scores of two methods, "Agent" and "Explore", across 10 steps. The chart visualizes the performance of each method as the number of steps increases. Shaded areas around each line represent the standard deviation or confidence interval.
### Components/Axes
* **X-axis:** "Steps", ranging from 1 to 10, with tick marks at each integer value.
* **Y-axis:** "Rouge-L", ranging from 0 to 60, with tick marks at intervals of 10.
* **Data Series 1:** "Agent" - Represented by a blue line with circular data points.
* **Data Series 2:** "Explore" - Represented by an orange line with triangular data points.
* **Legend:** Located in the bottom-right corner, identifying the lines as "Agent" (blue) and "Explore" (orange).
### Detailed Analysis
**Agent (Blue Line):**
The "Agent" line starts at approximately 1 on Step 1 and exhibits a generally upward trend.
* Step 1: ~1
* Step 2: ~8
* Step 3: ~15
* Step 4: ~38
* Step 5: ~45
* Step 6: ~48
* Step 7: ~50
* Step 8: ~52
* Step 9: ~55
* Step 10: ~57
**Explore (Orange Line):**
The "Explore" line starts at approximately 43 on Step 1 and shows a more gradual upward trend, with some fluctuations.
* Step 1: ~43
* Step 2: ~46
* Step 3: ~49
* Step 4: ~50
* Step 5: ~51
* Step 6: ~51
* Step 7: ~52
* Step 8: ~52
* Step 9: ~52
* Step 10: ~52
The shaded areas around each line indicate variability. The "Agent" line has a wider shaded area in the earlier steps (1-4), suggesting greater variance in performance. The "Explore" line has a relatively consistent shaded area throughout.
### Key Observations
* The "Agent" method demonstrates a significant increase in Rouge-L score between Steps 3 and 4, indicating a rapid improvement in performance.
* The "Explore" method shows a more stable, but less dramatic, increase in Rouge-L score.
* By Step 10, the "Agent" method has a Rouge-L score of approximately 57, while the "Explore" method has a score of approximately 52.
* The "Explore" method consistently outperforms the "Agent" method in the initial steps (1-3).
### Interpretation
The chart suggests that the "Agent" method, while initially slower to improve, eventually surpasses the "Explore" method in terms of Rouge-L score as the number of steps increases. This could indicate that the "Agent" method requires a certain number of steps to "learn" or adapt, after which it achieves higher performance. The "Explore" method provides a more consistent, but ultimately lower, level of performance. The shaded areas suggest that the "Agent" method's performance is more variable, particularly in the early stages, potentially due to a more exploratory or unstable learning process. The Rouge-L metric is commonly used to evaluate the quality of text summarization or generation, so this chart likely represents the performance of two different algorithms or strategies in a natural language processing task.
</details>
Figure 4: Effect of the number of steps in the LLM-KG Interaction Methods. The Agent requires more steps to obtain the performance of the Graph Exploration, while the Graph Exploration only needs the anchor entities to perform the search within the graph.
<details>
<summary>figures/search_depth_academic_rl.png Details</summary>

### Visual Description
\n
## Line Chart: Rouge-L Score vs. Search Depth
### Overview
This image presents a line chart illustrating the relationship between Search Depth and Rouge-L score. The chart also displays the standard deviation (SD) around the "Explore" line. The data suggests an initial increase in Rouge-L score with increasing search depth, followed by a plateau and slight decline.
### Components/Axes
* **X-axis:** "Search Depth" ranging from 1 to 5, with tick marks at each integer value.
* **Y-axis:** "Rouge-L" ranging from 30 to 70, with tick marks every 5 units.
* **Data Series 1:** "Explore" - Represented by a solid blue line with circular data points.
* **Data Series 2:** "SD (σ)" - Represented by a light blue shaded area, indicating the standard deviation.
* **Legend:** Located in the top-left corner, labeling the two data series.
### Detailed Analysis
The "Explore" line (blue) shows the following approximate values:
* Search Depth 1: Rouge-L ≈ 36
* Search Depth 2: Rouge-L ≈ 41
* Search Depth 3: Rouge-L ≈ 43
* Search Depth 4: Rouge-L ≈ 43.5
* Search Depth 5: Rouge-L ≈ 42.5
The "SD (σ)" shaded area represents the standard deviation around the "Explore" line. The standard deviation appears to be largest at Search Depth 1, and decreases slightly as Search Depth increases. The approximate range of the standard deviation is:
* Search Depth 1: SD ranges from approximately 32 to 40.
* Search Depth 2: SD ranges from approximately 37 to 45.
* Search Depth 3: SD ranges from approximately 40 to 47.
* Search Depth 4: SD ranges from approximately 41 to 46.
* Search Depth 5: SD ranges from approximately 40 to 45.
The "Explore" line initially slopes upward from Search Depth 1 to Search Depth 3, indicating an increase in Rouge-L score. From Search Depth 3 to Search Depth 5, the line plateaus and then slightly declines, suggesting diminishing returns or potential overfitting with increased search depth.
### Key Observations
* The Rouge-L score increases rapidly between Search Depth 1 and 2.
* The Rouge-L score plateaus between Search Depth 3 and 4.
* The standard deviation is relatively consistent across search depths, with a slight decrease as search depth increases.
* The standard deviation is significant, indicating variability in the Rouge-L score for a given search depth.
### Interpretation
The chart suggests that increasing the search depth initially improves the Rouge-L score, likely by allowing the model to consider more relevant information. However, beyond a certain point (around Search Depth 3), further increasing the search depth does not lead to significant improvements and may even result in a slight decrease in performance. This could be due to the model retrieving less relevant information or encountering noise as the search depth increases. The standard deviation indicates that the performance is not consistent across different runs or datasets, and there is inherent variability in the Rouge-L score for a given search depth. The data suggests an optimal search depth around 3, where the Rouge-L score is maximized while maintaining a reasonable level of stability.
</details>
Figure 5: Effect of the Search depth in Graph Exploration interaction method for a fixed steps number. The method can achieve relatively good performance with the anchor entities extracted from the question.
How does the number of steps affect the results?
We observe in Figure 5 the effect of varying the number of steps in the KG interaction methods (Agent, Explore) across all academic datasets. The plots indicate that graph exploration performs better with fewer steps, as it automatically traverses the graph for the identified anchor entities. Conversely, the agentic methods improve as the number of steps increases, eventually achieving better performance. This validates our framework’s design choice to support both exploration and agentic strategies, each excels in complementary regimes.
<details>
<summary>figures/tree_width_academic_rl.png Details</summary>

### Visual Description
\n
## Line Chart: Rouge-L Score vs. Tree Width
### Overview
This image presents a line chart illustrating the relationship between Tree Width (on the x-axis) and Rouge-L score (on the y-axis) for different models: ToT-Explore, CoT, and the standard deviation (SD) around ToT-Explore. The chart aims to demonstrate how performance, as measured by Rouge-L, changes with varying tree widths.
### Components/Axes
* **X-axis:** "Tree Width" with values ranging from 1 to 5.
* **Y-axis:** "Rouge-L" with values ranging from 30 to 70.
* **Data Series:**
* "ToT-Explore" (represented by a solid orange line with circular markers).
* "SD (σ)" (represented by a shaded peach-colored area).
* "CoT" (represented by black 'x' markers).
* **Legend:** Located in the top-right corner of the chart.
### Detailed Analysis
* **ToT-Explore:** The line representing ToT-Explore starts at approximately 42 when Tree Width is 1. It then increases sharply to around 58 at Tree Width 2, continues to rise more gradually to approximately 62 at Tree Width 3, plateaus around 61 at Tree Width 4, and slightly decreases to approximately 61 at Tree Width 5.
* **SD (σ):** The shaded area representing the standard deviation is centered around the ToT-Explore line. It starts at approximately 58 at Tree Width 1, rises to around 64 at Tree Width 2, remains relatively stable between 63 and 65 from Tree Width 2 to 5.
* **CoT:** The CoT data points are consistently around 63, regardless of the Tree Width. Specifically:
* Tree Width 1: Approximately 63
* Tree Width 2: Approximately 63
* Tree Width 3: Approximately 63
* Tree Width 4: Approximately 63
* Tree Width 5: Approximately 63
### Key Observations
* ToT-Explore shows significant improvement in Rouge-L score as Tree Width increases from 1 to 2, but the gains diminish with further increases in Tree Width.
* The standard deviation (SD) around ToT-Explore remains relatively consistent across all Tree Widths, indicating stable performance.
* CoT maintains a constant Rouge-L score of approximately 63 across all Tree Widths.
* ToT-Explore outperforms CoT at Tree Width 1 and 2, but the performance gap narrows as Tree Width increases.
### Interpretation
The data suggests that increasing the Tree Width initially leads to substantial improvements in the performance of the ToT-Explore model, as measured by the Rouge-L score. However, beyond a certain point (around Tree Width 2), the benefits of increasing Tree Width diminish. The consistent performance of CoT indicates that its performance is not significantly affected by changes in Tree Width. The relatively stable standard deviation around ToT-Explore suggests that the model's performance is reliable and consistent across different Tree Widths. The initial rapid increase in ToT-Explore's score could be due to the model's ability to explore a wider range of possibilities with larger Tree Widths, but the diminishing returns may indicate that the model reaches a point of saturation where further increases in Tree Width do not lead to significant improvements. The fact that CoT remains constant suggests it may not benefit from the same exploration capabilities as ToT-Explore.
</details>
Figure 6: Impact of tree width on Agentic ToT performance. It shows a general trend of performance improvement with increasing tree width.
<details>
<summary>figures/state_evaluators.png Details</summary>

### Visual Description
## Bar Chart: Rouge-L Scores for Agent and Explore Models
### Overview
This bar chart compares the Rouge-L scores of two models, "Agent" and "Explore", across three different model sizes: 8B, 70B, and 405B. Each model size has two bars representing "Score" and "Select", with error bars indicating variance. The y-axis represents the Rouge-L score, and the x-axis represents the model size.
### Components/Axes
* **Title:** The chart is divided into two sections labeled "Agent" and "Explore".
* **Y-axis:** "Rouge-L" with a scale ranging from 30 to 70, incrementing by 10.
* **X-axis:** Model sizes: "8B", "70B", and "405B".
* **Legend:** Located at the bottom-center of the chart.
* "Score" - represented by a white fill and a solid border.
* "Select" - represented by a diagonal striped fill.
* **Error Bars:** Black vertical lines extending above and below each bar, indicating the standard deviation or confidence interval.
### Detailed Analysis
The chart consists of six groups of bars, three for "Agent" and three for "Explore", each with two bars representing "Score" and "Select".
**Agent:**
* **8B:**
* Score: Approximately 41, with an error bar ranging from roughly 37 to 45.
* Select: Approximately 38, with an error bar ranging from roughly 34 to 42.
* **70B:**
* Score: Approximately 63, with an error bar ranging from roughly 59 to 67.
* Select: Approximately 60, with an error bar ranging from roughly 56 to 64.
* **405B:**
* Score: Approximately 65, with an error bar ranging from roughly 61 to 69.
* Select: Approximately 63, with an error bar ranging from roughly 59 to 67.
**Explore:**
* **8B:**
* Score: Approximately 34, with an error bar ranging from roughly 30 to 38.
* Select: Approximately 32, with an error bar ranging from roughly 28 to 36.
* **70B:**
* Score: Approximately 55, with an error bar ranging from roughly 51 to 59.
* Select: Approximately 52, with an error bar ranging from roughly 48 to 56.
* **405B:**
* Score: Approximately 57, with an error bar ranging from roughly 53 to 61.
* Select: Approximately 54, with an error bar ranging from roughly 50 to 58.
### Key Observations
* For both "Agent" and "Explore", the Rouge-L scores generally increase with model size.
* The "Agent" model consistently achieves higher Rouge-L scores than the "Explore" model across all model sizes.
* The difference in scores between "Score" and "Select" is relatively small for each model size and for both models.
* The error bars indicate a significant amount of variance in the scores, particularly for the smaller model sizes (8B).
### Interpretation
The data suggests that increasing model size improves Rouge-L scores for both the "Agent" and "Explore" models. The "Agent" model consistently outperforms the "Explore" model, indicating that it is a more effective model for the task being evaluated (likely text generation or summarization, given the use of Rouge-L). The error bars suggest that the results may not be statistically significant for the smaller model sizes due to the high variance. The relatively small difference between "Score" and "Select" suggests that the method used to select the best output does not significantly impact the Rouge-L score. The Rouge-L metric is a recall-focused metric for evaluating text summarization or machine translation by counting overlapping n-grams between the generated text and the reference text. Higher Rouge-L scores indicate better overlap and, therefore, better quality. The chart demonstrates a clear positive correlation between model size and performance, as measured by Rouge-L.
</details>
Figure 7: Influence of the State Evaluators in ToT. The Select method obtains better results over Score method.
What is the effect of Search Depth in Automatic Graph Exploration?
We observe the effect of search depth in Figure 5, which presents performance results across various depths, with fixed step size of one. The results demonstrate that the performance of depth-first search plateaus at depth of 3, highlighting the relevance of search exploration with respect to the given query. Beyond this point, deeper traversal yields no significant gains, likely due to diminishing relevance of distant nodes. This shows why shallow, targeted exploration is sufficient in our framework, keeping search efficient without sacrificing accuracy.
What is the effect of tree width in the reasoning strategy (ToT)?
Based on experimental results across all academic datasets, we observe performance variations among different methods. To gain further insight, we observe in Figure 7 the effect of tree width on results. We notice a slight upward trend in performance as the tree width increases, although the difference is more pronounced between CoT and ToT itself, going from one branch to two. The added computational time and resources likely contribute to this performance enhancement.
What is the influence of the state evaluator?
We observe in Figure 7 the impact of state evaluators, specifically Score and Select, within the ToT framework. The analysis indicates that, while there is no significant difference between the two methods, the Select evaluator generally yields slightly better results. This trend is especially evident in the context of the Agent’s performance, though the advantage is less pronounced in automatic graph exploration.
How are errors different for each strategy?
<details>
<summary>figures/pie_cot.png Details</summary>

### Visual Description
\n
## Pie Chart: Percentage Distribution
### Overview
The image is a pie chart illustrating the percentage distribution of three categories. The chart is visually divided into three segments, each representing a different proportion of the whole. The percentages are directly labeled on the chart.
### Components/Axes
The chart consists of three segments, each with a distinct color:
* **Blue:** Represents 34.1%
* **Green:** Represents 60.4%
* **Orange:** Represents 5.5%
There are no explicit axes or a legend beyond the color-coding and percentage labels. The chart is a simple representation of proportions.
### Detailed Analysis
The pie chart displays the following data:
* **Category 1 (Blue):** 34.1%
* **Category 2 (Green):** 60.4%
* **Category 3 (Orange):** 5.5%
The largest segment is green, occupying the majority of the pie chart. The blue segment is significantly smaller, and the orange segment is the smallest.
### Key Observations
* The green category dominates the distribution, representing over 60% of the total.
* The orange category is a minor component, contributing less than 6% to the total.
* The sum of the percentages is 34.1% + 60.4% + 5.5% = 100%, confirming that the chart represents a complete distribution.
### Interpretation
The data suggests a highly skewed distribution, where one category (green) is substantially more prevalent than the others. This could represent a variety of scenarios, such as market share, demographic breakdown, or resource allocation. The small percentage represented by the orange category indicates it is a relatively insignificant factor in the overall context. Without further context, it is difficult to determine the specific meaning of these percentages, but the chart clearly highlights the dominance of the green category. The chart is a simple visual aid to demonstrate the relative proportions of three different categories.
</details>
(a) CoT
<details>
<summary>figures/pie_tot.png Details</summary>

### Visual Description
\n
## Pie Chart: Percentage Distribution
### Overview
The image is a pie chart illustrating the distribution of percentages across three categories. The chart is visually divided into three segments, each representing a different proportion of the whole. The percentages are directly labeled on each segment.
### Components/Axes
The chart lacks traditional axes. It consists of three segments, each representing a category with an associated percentage. There is no explicit legend, as the percentages are directly embedded within each segment.
### Detailed Analysis
The pie chart displays the following data:
* **Segment 1 (Green):** 67.8% - This segment occupies the largest portion of the pie chart, positioned at the bottom.
* **Segment 2 (Orange):** 21.3% - This segment is positioned to the left of the green segment.
* **Segment 3 (Blue):** 10.9% - This segment is positioned at the top of the pie chart.
The sum of the percentages is approximately 100% (67.8 + 21.3 + 10.9 = 100.0%).
### Key Observations
The distribution is heavily skewed towards the green segment, which represents the majority (67.8%) of the total. The orange segment represents a significant minority (21.3%), while the blue segment represents a small portion (10.9%).
### Interpretation
The data suggests a strong dominance of one category (represented by the green segment) over the other two. This could represent a situation where a particular option, characteristic, or group is significantly more prevalent than others. The relatively small size of the blue segment indicates that the corresponding category is less common or less important in the context of the data. Without further context, it is difficult to determine the specific meaning of these percentages, but the chart clearly demonstrates a significant imbalance in the distribution. The chart is a simple visual representation of proportions, and its effectiveness relies on the viewer understanding what each segment represents.
</details>
(b) ToT
<details>
<summary>figures/pie_got.png Details</summary>

### Visual Description
\n
## Pie Chart: Percentage Distribution
### Overview
The image presents a pie chart illustrating the distribution of percentages across three categories. The chart is visually divided into three segments, each representing a different percentage of the whole.
### Components/Axes
The chart lacks explicit axis labels or a formal legend. The percentages are directly labeled on each segment. The segments are differentiated by color: teal, blue, and orange.
### Detailed Analysis
The pie chart displays the following data:
* **Teal Segment:** Represents 67.8% of the total. This segment occupies the largest portion of the pie chart, positioned at the bottom.
* **Blue Segment:** Represents 20.0% of the total. This segment is positioned at the top.
* **Orange Segment:** Represents 12.2% of the total. This segment is positioned to the top-right.
The sum of the percentages is 67.8% + 20.0% + 12.2% = 100.0%, confirming that the chart represents a complete distribution.
### Key Observations
The teal segment dominates the distribution, representing a significant majority (67.8%) of the whole. The blue segment represents a substantial minority (20.0%), while the orange segment represents the smallest portion (12.2%).
### Interpretation
The data suggests a highly skewed distribution, where one category (represented by teal) is significantly more prevalent than the other two. This could represent a variety of scenarios, such as market share, demographic breakdown, or resource allocation. Without further context, it is difficult to determine the specific meaning of the data. The large difference between the teal segment and the other two suggests a strong dominance of the category it represents. The chart effectively visualizes the relative proportions of each category, highlighting the disparity in their contributions to the whole.
</details>
(c) GoT
<details>
<summary>figures/pie_legend.png Details</summary>

### Visual Description
\n
## Legend: Status Indicators
### Overview
The image presents a legend defining the meaning of three colored squares. Each square is associated with a textual label indicating a specific status or outcome. This is a simple key for interpreting information presented elsewhere, likely in a report or dashboard.
### Components/Axes
The image consists of three horizontally aligned elements:
* **Blue Square:** Labeled "Reached limit"
* **Teal Square:** Labeled "Found answer but not returned"
* **Orange Square:** Labeled "Wrong reasoning"
There are no axes or scales present, as this is a key/legend rather than a chart or graph.
### Detailed Analysis or Content Details
The image provides a direct mapping between color and status:
* **Blue:** Indicates a situation where a limit has been reached.
* **Teal:** Indicates a situation where a correct answer was found, but was not successfully returned or communicated.
* **Orange:** Indicates a situation where the reasoning process was flawed or incorrect.
### Key Observations
The legend uses distinct colors to represent different types of failures or outcomes. The labels are concise and descriptive. The arrangement is horizontal, suggesting a simple, linear relationship between the statuses.
### Interpretation
This legend is likely used in a system that evaluates the performance of an automated process, such as a question-answering system or a problem-solving algorithm. The statuses indicate different points of failure: reaching a resource limit, finding a solution but failing to deliver it, and arriving at an incorrect conclusion due to faulty logic. The presence of these distinct statuses suggests a detailed monitoring and error-tracking system is in place. The legend is a crucial component for understanding the results of this system. It allows users to quickly identify the nature of any problems encountered.
</details>
Figure 8: Error distribution across strategies. ToT and GoT reduce unanswered cases but increase logical errors due to more complex reasoning.
To understand failure patterns, we define three error types: (1) Reached limit — the reasoning hit the step limit; (2) Answer found but not returned — the correct answer appeared but was not output; (3) Wrong reasoning step — the model followed an illogical step. Using GPT-4o, we labeled a larger set of answers and traces. We observe in Figure 8 that ToT and GoT show more “answer found but not returned” cases than CoT, suggesting better retrieval but occasional failures in synthesis. This comes with a slight rise in logical errors, likely due to the complexity of multiple reasoning paths.
## 8 Limitations
In this work, we demonstrate how LLMs can be used to explore a graph while conditioning the next steps based on the graph’s results. We show that the two approaches presented achieve superior results in graph exploration. Integrating KGs with LLMs can provide complex relational knowledge for LLMs to leverage. However, the overall effectiveness depends heavily on both the coverage and quality of the underlying graph, as well as the capabilities of the language model.
Extending inference-time reasoning methods for LLMs is significantly constrained by computational resources and the time available to the user. We analyze the computational complexity of the methods in Appendix E, where we show the exponential growth of ToT and GoT due to its branching structure. GoT further compounds this by allowing merges between reasoning paths, which increases the total number of evaluations. Additionally, loading large graphs into memory introduces substantial RAM overhead, limiting applicability to resource-rich environments.
While LLMs conditioned on external knowledge can generate outputs based on accessed content, their generated output is not strictly limited to that information. Thus, they may still hallucinate. Our framework helps mitigate this risk by grounding reasoning in explicit graph structure, but does not eliminate it.
## 9 Conclusion
We present a framework for grounding LLM reasoning in KGs by integrating each step with structured graph retrieval. By combining strategies like CoT, ToT, and GoT with adaptive graph search, our method achieves state-of-the-art performance on GRBench. Beyond performance, we find that explicitly linking reasoning steps to graph structure offers a more interpretable view of how LLMs navigate knowledge. The approach enables inference-time reasoning, offering flexibility across domains, and suggests a path toward reasoning interventions that are both systematic and transparent. Future work includes extending this framework to larger and more heterogeneous graphs, and exploring how structured retrieval can guide reasoning in domains where accuracy and verifiability are critical.
Disclaimers
This [paper/presentation] was prepared for informational purposes [“in part” if the work is collaborative with external partners] by the Artificial Intelligence Research group of JPMorganChase and its affiliates ("JP Morgan”) and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.
©2025 JPMorganChase. All rights reserved
## Appendix A GRBench Statistics
Detailed statistics of the graphs in GRBench graphCoT are shown in Table 3. Academic Graphs contain 3 types of nodes: paper, author, venue. Literature Graphs contain 4 types of nodes: book, author, publisher and series. Healthcare Graph contains 11 types of nodes: anatomy, biological process, cellular component, compound, disease, gene, molecular function, pathway, pharmacologic class, side effect, and symptom. Questions are created according to multiple templates labeled as easy, medium, and hard, depending on the number of nodes required to give the answer.
| Domain | Topic | Graph Statistics | Data | | |
| --- | --- | --- | --- | --- | --- |
| # Nodes | # Edges | # Templates | # Questions | | |
| Academic | Biology | $\sim$ 4M | $\sim$ 39M | 14 | 140 |
| Chemistry | $\sim$ 4M | $\sim$ 30M | 14 | 140 | |
| Material Science | $\sim$ 3M | $\sim$ 22M | 14 | 140 | |
| Medicine | $\sim$ 6M | $\sim$ 30M | 14 | 140 | |
| Physics | $\sim$ 2M | $\sim$ 33M | 14 | 140 | |
| Literature | Goodreads | $\sim$ 3M | $\sim$ 22M | 24 | 240 |
| Healthcare | Disease | $\sim$ 47K | $\sim$ 4M | 27 | 270 |
| SUM | - | - | - | 121 | 1210 |
Table 3: Detailed statistics of the GRBench graphCoT.
## Appendix B LLM <–> KG Interaction Pipelines
Description of the two LLM + KG Interaction Pipelines in their CoT form:
1. Agent – Figure 9 A pipeline where the LLM alternates between generating a reasoning step, selecting an explicit action (e.g., retrieving a node, checking neighbors), and observing results from the KG until termination.
1. Automatic Graph Exploration – Figure 10. A pipeline where entities are automatically extracted from the LLM’s generated text and used to guide iterative graph traversal with pruning, progressively expanding the reasoning chain.
<details>
<summary>figures/agent_pipeline.png Details</summary>

### Visual Description
\n
## Diagram: Agent-Knowledge Graph Interaction Loop
### Overview
The image depicts a diagram illustrating an agent's interaction loop with a knowledge graph. The loop consists of four stages: Query, Thought, Action, and Observation, with the Action stage branching into several specific actions that interact with the Knowledge Graph. The diagram visually represents a process where an agent receives a query, processes it into a thought, translates that thought into an action, observes the result from a knowledge graph, and repeats the cycle.
### Components/Axes
The diagram consists of four main rectangular blocks labeled "Query", "Thought", "Action", and "Observation", arranged horizontally from left to right and connected by arrows indicating the flow of the process. Below the "Action" block is a list of actions: "RetrieveNode[Text]", "NodeFeature[N,F]", "NeighborCheck [N,E]", "NodeDegree[N,E]", and "Finish[answer]". To the bottom-right of the diagram is a visual representation of a "Knowledge Graph" consisting of nodes and edges.
### Detailed Analysis or Content Details
The diagram shows a sequential process:
1. **Query:** The starting point, represented by a rectangular box labeled "Query".
2. **Thought:** An agent icon (a small robot) is placed inside a rectangular box labeled "Thought".
3. **Action:** Another agent icon is placed inside a rectangular box labeled "Action". Below this box, a list of actions is provided:
* **RetrieveNode[Text]:** Retrieves a node based on text input.
* **NodeFeature[N,F]:** Extracts features (F) from a node (N).
* **NeighborCheck [N,E]:** Checks the neighbors (E) of a node (N).
* **NodeDegree[N,E]:** Determines the degree of a node (N) based on its edges (E).
* **Finish[answer]:** Finalizes the process and provides an answer.
4. **Observation:** A rectangular box labeled "Observation".
5. **Knowledge Graph:** A visual representation of a graph with green nodes and blue edges. The graph appears to be a network of interconnected nodes.
The arrows connecting the blocks indicate the flow of information. An arrow points from "Action" to the "Knowledge Graph", indicating that the actions are performed on the knowledge graph. An arrow points from the "Knowledge Graph" to "Observation", indicating that the observation is based on the results from the knowledge graph.
### Key Observations
The diagram highlights the iterative nature of the agent's interaction with the knowledge graph. The agent doesn't simply query the graph once but engages in a loop of thought, action, and observation to refine its understanding and ultimately arrive at an answer. The specific actions listed under "Action" suggest that the agent can perform various operations on the knowledge graph, such as retrieving nodes, extracting features, checking neighbors, and determining node degrees.
### Interpretation
This diagram illustrates a common architecture in knowledge-based AI systems, particularly those employing agents that interact with knowledge graphs. The loop represents a reasoning process where the agent uses the knowledge graph as its source of information and iteratively refines its understanding through a series of actions and observations. The actions listed suggest a focus on graph traversal and feature extraction, which are essential for tasks such as knowledge discovery, question answering, and reasoning. The diagram suggests a system designed for complex problem-solving where the agent needs to explore and analyze a knowledge graph to arrive at a solution. The use of agent icons suggests an autonomous or semi-autonomous system capable of independent reasoning and action. The diagram does not provide any specific data or numerical values, but rather a conceptual framework for how an agent interacts with a knowledge graph.
</details>
Figure 9: Agent Pipeline: (1) Input Query, (2) Thought Generation (3) Action Selection, (4) Environment Observation from the Knowledge Graph. The process is repeated until termination action is generated or limit reached.
<details>
<summary>figures/graph_explore_pipeline.png Details</summary>

### Visual Description
\n
## Diagram: AI Reasoning Process Flow
### Overview
The image depicts a diagram illustrating a sequential process flow for an AI reasoning system. The process begins with a "Query" and proceeds through stages of "Thought", "Entity Extraction", "Graph Search", and a decision point "End?". The "Graph Search" stage interacts with a "Knowledge Graph". The diagram uses rectangular boxes to represent process steps and arrows to indicate the flow of information. Small robot-like icons are placed within the "Thought", "Entity Extraction", and "Graph Search" boxes.
### Components/Axes
The diagram consists of the following components:
* **Query:** The initial input to the system.
* **Thought:** A processing stage involving reasoning.
* **Entity Extraction:** A stage focused on identifying entities.
* **Graph Search:** A stage that utilizes a knowledge graph.
* **End?:** A decision point to determine if the process should terminate.
* **Knowledge Graph:** A data structure representing relationships between entities.
* **Arrows:** Indicate the sequential flow of the process.
* **Robot Icons:** Represent the AI agent performing the tasks.
There are no axes or scales present in this diagram.
### Detailed Analysis or Content Details
The diagram shows a linear process flow:
1. **Query** initiates the process.
2. The process moves to **Thought**, represented by a robot icon with a thinking bubble.
3. From **Thought**, the process proceeds to **Entity Extraction**, represented by a robot icon with a magnifying glass.
4. **Entity Extraction** leads to **Graph Search**, represented by a robot icon with a network symbol.
5. **Graph Search** interacts with the **Knowledge Graph**, depicted as a network of nodes (green) and connections (blue). An arrow points *from* the Knowledge Graph *to* the Graph Search box, indicating data retrieval.
6. The process then reaches the **End?** decision point.
7. If the answer to "End?" is no, the process loops back to the beginning, starting with a new **Query**.
The Knowledge Graph is positioned at the bottom-center of the diagram. The robot icons are consistently styled and positioned within their respective boxes.
### Key Observations
The diagram highlights a cyclical reasoning process. The AI system iteratively refines its understanding through a series of steps, leveraging a knowledge graph for information retrieval. The loop back from "End?" to "Query" suggests an iterative refinement process. The diagram does not provide any quantitative data or specific details about the algorithms or techniques used in each stage.
### Interpretation
This diagram illustrates a common architecture for AI systems that rely on knowledge representation and reasoning. The process demonstrates how an AI agent can take a query, reason about it, extract relevant entities, search a knowledge graph for related information, and then potentially iterate on the process to refine its understanding or generate a more accurate response. The Knowledge Graph serves as the system's memory and source of factual information. The cyclical nature of the process suggests a continuous learning or refinement loop. The diagram is a high-level conceptual overview and does not delve into the specifics of implementation. It is a visual representation of a common AI reasoning pipeline.
</details>
Figure 10: Automatic Graph Exploration Pipeline: (1) Input Query, (2) Thought Generation, (3) Entity Extraction (from query or thought with LLM), (4) Automatic Graph Search as described in Algorithm 1 (5) Query LLM for answer or continue
Algorithm 1 presents the pseudocode for the Automatic Graph Exploration described in Section 4.2.2.
Algorithm 1 Graph Exploration Algorithm
1: procedure GraphExplore ( $LLM,seen\_entities,search\_depth$ )
2: $relevant\_attributes$ , $found\_triples\leftarrow 0$
3: for $depth$ in $search\_depth$ do
4: for $entity$ in $seen\_entities$ do
5: if $seen\_entities[entity\_id].visited$ == True then
6: Continue
7: else
8: $seen\_entities[entity]\leftarrow$ Visited
9: end if
10: $head\_entity\_name$ , $entity\_attributes$ , $neighbors\leftarrow$ Graph[ $entity$ ]
11: $pruned\_neighbors\leftarrow$ prune_relations( $LLM,neighbors$ )
12: $pruned\_neighbors\leftarrow$ prune_entities( $LLM,pruned\_neighbors$ )
13: $found\_triples\leftarrow$ generate_triples( $entity,pruned\_neighbros$ )
14: end for
15: $seen\_entities\leftarrow$ Update( $seen\_entities$ , neighbors)
16: if End?( $LLM,found\_triples,relevant\_attributes$ ) == True then
17: break
18: end if
19: end for
20: return $found\_triples,relevant\_attributes,seen\_entities$
21: end procedure
## Appendix C Performance results in plots
Figures 11 and 12 illustrate the performance results using the Rouge-L and GPT4score metrics, respectively, for the healthcare graph for all methods. The results were run on the LLama 3.1 Instruct models (8B, 70B, and 405B-FP8) and demonstrate the improved performance achieved through more complex reasoning and search strategies during inference.
<details>
<summary>figures/healthcare_rl_all.png Details</summary>

### Visual Description
\n
## Bar Chart: Rouge-L Score vs. Models and State Evaluators
### Overview
This bar chart compares the Rouge-L scores of different language models (8B, 70B, and 405B) across various state evaluators (Base, Text-RAG, Graph-RAG, Graph-CoT Agent, Graph-CoT Explore, Graph-ToT Agent, Graph-ToT Explore, Graph-GoT Agent, Graph-GoT Explore). The Rouge-L score is plotted on the y-axis, and the state evaluators are displayed on the x-axis. Each state evaluator has three bars representing the performance of the three models.
### Components/Axes
* **Y-axis:** Rouge-L Score (Scale: 0 to 50, increments of 10)
* **X-axis:** State Evaluators (Categories: Base, Text-RAG, Graph-RAG, Graph-CoT Agent, Graph-CoT Explore, Graph-ToT Agent, Graph-ToT Explore, Graph-GoT Agent, Graph-GoT Explore)
* **Legend:**
* Models:
* 8B (Blue)
* 70B (Orange)
* 405B (Green)
* State Evaluators:
* Score (Solid bars)
* Select (Hatched bars)
### Detailed Analysis
The chart consists of nine groups of three bars, one for each state evaluator. Within each group, the bars represent the Rouge-L scores for the 8B, 70B, and 405B models, respectively.
* **Base:**
* 8B: Approximately 7.
* 70B: Approximately 9.
* 405B: Approximately 8.
* **Text-RAG:**
* 8B: Approximately 10.
* 70B: Approximately 12.
* 405B: Approximately 11.
* **Graph-RAG:**
* 8B: Approximately 17.
* 70B: Approximately 20.
* 405B: Approximately 16.
* **Graph-CoT Agent:**
* 8B: Approximately 26.
* 70B: Approximately 34.
* 405B: Approximately 30.
* **Graph-CoT Explore:**
* 8B: Approximately 28.
* 70B: Approximately 30.
* 405B: Approximately 29.
* **Graph-ToT Agent:**
* 8B: Approximately 32.
* 70B: Approximately 35.
* 405B: Approximately 33.
* **Graph-ToT Explore:**
* 8B: Approximately 33.
* 70B: Approximately 34.
* 405B: Approximately 35.
* **Graph-GoT Agent:**
* 8B: Approximately 28.
* 70B: Approximately 32.
* 405B: Approximately 44.
* **Graph-GoT Explore:**
* 8B: Approximately 31.
* 70B: Approximately 33.
* 405B: Approximately 36.
Generally, the 405B model consistently outperforms the 8B and 70B models, especially in the Graph-GoT Agent state evaluator. The 70B model generally outperforms the 8B model.
### Key Observations
* The largest performance difference between models is observed with the "Graph-GoT Agent" state evaluator, where the 405B model achieves a significantly higher Rouge-L score (approximately 44) compared to the 8B (approximately 28) and 70B (approximately 32) models.
* The "Base" state evaluator shows the lowest Rouge-L scores across all models.
* The performance gap between the 8B and 70B models is relatively consistent across most state evaluators.
* The "Graph-ToT Explore" state evaluator shows the highest scores for the 405B model.
### Interpretation
The data suggests that increasing model size (from 8B to 70B to 405B) generally improves performance, as measured by the Rouge-L score. The choice of state evaluator significantly impacts performance, with "Graph-GoT Agent" yielding the most substantial gains from larger models. This indicates that the Graph-GoT Agent approach is particularly effective at leveraging the capabilities of larger language models. The low scores for the "Base" evaluator suggest that more sophisticated evaluation methods (like those involving graphs and CoT reasoning) are necessary to accurately assess the performance of these models. The consistent trend of larger models performing better supports the hypothesis that model capacity is a key factor in achieving higher Rouge-L scores. The differences between the "Agent" and "Explore" variants within the Graph-based evaluators suggest that the exploration strategy also plays a role in performance.
</details>
Figure 11: Performance results using the Rouge-L metric on the healthcare graph of GRBench graphCoT, comparing all methods with LLama 3.1 Instruct models of various sizes (8B, 70B, 405B-FP8). Experimental details are included in Section 5.
<details>
<summary>figures/healthcare_gpt4score_all.png Details</summary>

### Visual Description
## Bar Chart: GPT-4 Score vs. Models & State Evaluators
### Overview
This bar chart compares the GPT-4 scores of different models (8B, 70B, and 405B) across various state evaluators: Base, Text-RAG, Graph-RAG, Graph-CoT Agent, Graph-CoT Search, Graph-ToT Agent, and Graph-GoT Search. Each state evaluator has three bars representing the performance of each model. The chart uses a grouped bar format to facilitate comparison.
### Components/Axes
* **X-axis:** State Evaluators - Base, Text-RAG, Graph-RAG, Graph-CoT Agent, Graph-CoT Search, Graph-ToT Agent, Graph-GoT Search.
* **Y-axis:** GPT-4 Score (Scale from 0 to 40, increments of 10).
* **Legend:**
* Models:
* 8B (Blue)
* 70B (Orange)
* 405B (Green)
* State Evaluators:
* Score (Solid bars)
* Select (Hatched bars) - *Note: The "Select" bars are not consistently present across all state evaluators.*
### Detailed Analysis
The chart presents GPT-4 scores for each model and state evaluator combination. The following data points are extracted, with approximate values due to the visual nature of the chart:
* **Base:**
* 8B: ~13
* 70B: ~15
* 405B: ~17
* **Text-RAG:**
* 8B: ~17
* 70B: ~22
* 405B: ~26
* **Graph-RAG:**
* 8B: ~16
* 70B: ~21
* 405B: ~23
* **Graph-CoT Agent:**
* 8B: ~24
* 70B: ~32
* 405B: ~31
* **Graph-CoT Search:**
* 8B: ~26
* 70B: ~28
* 405B: ~32
* **Graph-ToT Agent:**
* 8B: ~28
* 70B: ~33
* 405B: ~35
* **Graph-GoT Search:**
* 8B: ~27
* 70B: ~30
* 405B: ~43
**Trends:**
* **Model Performance:** Generally, the 405B model consistently outperforms the 70B and 8B models across all state evaluators. The 70B model typically performs better than the 8B model.
* **State Evaluator Impact:** The GPT-4 scores vary significantly depending on the state evaluator used. The "Base" evaluator yields the lowest scores, while "Graph-GoT Search" generally produces the highest scores.
* **Select Bars:** The "Select" bars are present for Graph-CoT Agent, Graph-ToT Agent, and Graph-GoT Search. These bars are generally slightly lower than the corresponding "Score" bars.
### Key Observations
* The largest performance jump appears to occur when moving from the 8B to the 70B model, particularly in the Text-RAG and Graph-CoT Agent evaluators.
* The 405B model shows the most consistent high performance, with a particularly strong result in the Graph-GoT Search evaluator (~43).
* The "Select" bars suggest a potential difference in evaluation methodology or a different metric being measured, but the impact is relatively small.
### Interpretation
The data suggests that model size is a significant factor in achieving higher GPT-4 scores, with the 405B model demonstrating superior performance. The choice of state evaluator also plays a crucial role, with more sophisticated evaluators (those involving graph-based approaches like Graph-CoT, Graph-ToT, and Graph-GoT) leading to higher scores. This indicates that leveraging graph structures and reasoning capabilities can improve performance. The presence of "Select" bars implies an alternative evaluation criterion, potentially focusing on a specific aspect of model behavior, but its impact is less pronounced than the model size or evaluator type. The consistent upward trend in scores as the evaluators become more complex suggests that these evaluators are better at capturing the nuances of model performance. The outlier is the 405B model's performance on Graph-GoT Search, which is significantly higher than any other combination, indicating a particularly effective synergy between the model and this specific evaluation setup.
</details>
Figure 12: Performance results using the GPT4Score metric on the healthcare graph of GRBench graphCoT, comparing all methods with LLama 3.1 Instruct models of various sizes (8B, 70B, 405B-FP8). Experimental details are included in Section 5.
## Appendix D Results on GPT4Score
In this section, we present the results of the experiments described in Section 5 for all methods, using the GPT4Score metric. This metric calculates the percentage of "correct" answers as judged by GPT-4 when presented with both the correct and the generated answer. The tables in this section present the same data as in Tables 1 and 2, but evaluated using GPT4Score.
| | Method | | Model | Healthcare | Goodreads | Biology | Chemistry | Materials Science | Medicine | Physics |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Baselines | | Llama 3.1 8B-Ins | 11.85 | 13.33 | 10.71 | 11.43 | 7.86 | 7.87 | 9.29 | |
| Base | Llama 3.1 70B-Ins | 12.96 | 19.17 | 10.00 | 12.14 | 11.43 | 11.43 | 12.86 | | |
| Llama 3.1 405B-Ins | 15.55 | 26.67 | 12.86 | 12.14 | 12.14 | 13.57 | 12.14 | | | |
| Llama 3.1 8B-Ins | 11.85 | 21.67 | 12.86 | 10.00 | 10.00 | 8.57 | 7.86 | | | |
| Text-RAG | Llama 3.1 70B-Ins | 12.22 | 27.5 | 12.14 | 13.57 | 13.57 | 13.57 | 12.86 | | |
| Llama 3.1 405B-Ins | 12.96 | 26.67 | 15.00 | 13.57 | 12.86 | 14.29 | 13.57 | | | |
| Llama 3.1 8B-Ins | 14.81 | 32.50 | 29.29 | 29.28 | 27.86 | 25.71 | 29.29 | | | |
| Graph-RAG | Llama 3.1 70B-Ins | 17.04 | 32.92 | 39.29 | 40.71 | 43.57 | 34.29 | 40.00 | | |
| Llama 3.1 405B-Ins | 18.15 | 31.67 | 37.14 | 42.86 | 40.00 | 36.43 | 41.43 | | | |
| Graph CoT | | Llama 3.1 8B-Ins | 18.15 | 32.5 | 20.71 | 19.28 | 25.00 | 14.29 | 21.43 | |
| Agent | Llama 3.1 70B-Ins | 32.59 | 43.75 | 50.00 | 51.43 | 50.00 | 48.57 | 46.43 | | |
| Llama 3.1 405B-Ins | 28.89 | 48.33 | 38.57 | 38.57 | 47.86 | 56.43 | 34.29 | | | |
| Graph Explore | Llama 3.1 8B-Ins | 22.22 | 36.67 | 35.00 | 30.71 | 29.29 | 29.29 | 32.86 | | |
| Llama 3.1 70B-Ins | 27.78 | 32.92 | 45.71 | 49.29 | 40.00 | 40.00 | 44.29 | | | |
| Llama 3.1 405B-Ins | 28.89 | 48.33 | 38.57 | 38.57 | 47.86 | 56.43 | 34.29 | | | |
| Graph ToT | Agent | Score | Llama 3.1 8B-Ins | 30.49 | 55.14 | 43.33 | 41.67 | 44.05 | 36.43 | 39.52 |
| Llama 3.1 70B-Ins | 30.49 | 54.48 | 65.48 | 62.14 | 55.95 | 63.57 | 56.19 | | | |
| Llama 3.1 405B-Ins | 45.55 | 56.53 | 71.67 | 65.71 | 52.62 | 68.81 | 44.76 | | | |
| Select | Llama 3.1 8B-Ins | 30.00 | 54.17 | 40.71 | 37.14 | 40.00 | 32.86 | 36.43 | | |
| Llama 3.1 70B-Ins | 39.63 | 56.67 | 65.00 | 67.14 | 62.86 | 60.71 | 55.55 | | | |
| Llama 3.1 405B-Ins | 44.07 | 58.75 | 71.43 | 69.29 | 65.00 | 68.81 | 60.00 | | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 21.48 | 41.10 | 32.86 | 31.67 | 31.43 | 32.14 | 35.24 | |
| Llama 3.1 70B-Ins | 24.94 | 40.97 | 52.38 | 57.86 | 49.29 | 54.29 | 47.86 | | | |
| Llama 3.1 405B-Ins | 30.86 | 48.33 | 47.86 | 57.14 | 50.71 | 56.67 | 47.14 | | | |
| Select | Llama 3.1 8B-Ins | 21.85 | 41.67 | 32.86 | 31.67 | 31.43 | 32.14 | 35.24 | | |
| Llama 3.1 70B-Ins | 30.37 | 42.08 | 54.29 | 57.14 | 47.86 | 52.14 | 46.43 | | | |
| Llama 3.1 405B-Ins | 31.48 | 48.75 | 45.00 | 57.86 | 48.86 | 57.14 | 45.71 | | | |
Table 4: GPT4Score performance results on GRBench graphCoT, comparing standard LLMs, Text-RAG, Graph-RAG, Graph-CoT, and Graph-ToT. Experiments are described in Section 5, using LLama 3.1 - Instruct backbone models with sizes 8B, 70B, and 405B.
| Method | Model | Healthcare | Biology | |
| --- | --- | --- | --- | --- |
| Agent | Score | Llama 3.1 8B-Ins | 29.88 | 32.86 |
| Llama 3.1 70B-Ins | 29.51 | 61.69 | | |
| Llama 3.1 405B-Ins | 41.81 | 48.33 | | |
| Select | Llama 3.1 8B-Ins | 30.00 | 40.71 | |
| Llama 3.1 70B-Ins | 39.63 | 69.83 | | |
| Llama 3.1 405B-Ins | 44.81 | 72.86 | | |
| Graph Explore | Score | Llama 3.1 8B-Ins | 22.72 | 21.19 |
| Llama 3.1 70B-Ins | 24.20 | 48.57 | | |
| Llama 3.1 405B-Ins | 32.22 | 41.67 | | |
| Select | Llama 3.1 8B-Ins | 22.59 | 19.28 | |
| Llama 3.1 70B-Ins | 32.96 | 52.86 | | |
| Llama 3.1 405B-Ins | 31.48 | 57.86 | | |
Figure 13: Graph-GoT results (GPT4Score) on GRBench with Llama 3.1 Instruct sizes 8B, 70B, and 405B.
<details>
<summary>figures/radar_levels.png Details</summary>

### Visual Description
\n
## Radar Chart: Difficulty Assessment
### Overview
The image presents a radar chart (also known as a spider chart or star chart) comparing the difficulty levels of three different tests: CoT (black line), ToT (blue line), and GoT (red line). The chart uses a triangular coordinate system with axes representing "Easy" and "Hard", and a third axis labeled "Medium". The values on the axes range from 0.2 to 0.6. The chart visually represents the relative difficulty of each test across these dimensions.
### Components/Axes
* **Axes:** The chart has three axes, labeled "Easy", "Hard", and "Medium". These are positioned at approximately 120-degree intervals around the center of the chart.
* **Scale:** The axes are scaled from 0.2 to 0.6, with tick marks at 0.2, 0.4, and 0.6.
* **Legend:** Located in the top-right corner, the legend identifies the three data series:
* Black line: CoT
* Blue line: ToT
* Red line: GoT
* **Data Series:** Three lines representing the difficulty profiles of CoT, ToT, and GoT. The areas enclosed by these lines are shaded in corresponding colors (black, blue, and red).
### Detailed Analysis
The chart displays the difficulty profiles of the three tests. The lines are not smooth curves but rather consist of straight line segments connecting points.
* **CoT (Black Line):** The CoT line forms a roughly triangular shape. It starts at approximately 0.2 on the "Hard" axis, extends to around 0.4 on the "Medium" axis, and reaches approximately 0.4 on the "Easy" axis before returning to the starting point.
* **ToT (Blue Line):** The ToT line also forms a triangular shape, but it is shifted slightly towards the "Easy" axis compared to CoT. It starts at approximately 0.2 on the "Hard" axis, extends to around 0.4 on the "Medium" axis, and reaches approximately 0.5 on the "Easy" axis before returning to the starting point.
* **GoT (Red Line):** The GoT line is positioned between CoT and ToT. It starts at approximately 0.2 on the "Hard" axis, extends to around 0.4 on the "Medium" axis, and reaches approximately 0.45 on the "Easy" axis before returning to the starting point.
### Key Observations
* All three tests (CoT, ToT, and GoT) are relatively similar in difficulty, with values generally ranging between 0.2 and 0.5 across the three dimensions.
* ToT appears to be the easiest test overall, as its line extends furthest along the "Easy" axis (reaching approximately 0.5).
* CoT appears to be the hardest test overall, as its line is closest to the "Hard" axis.
* The shapes of the lines are similar, suggesting that the tests assess similar types of difficulty.
### Interpretation
The radar chart suggests that the three tests (CoT, ToT, and GoT) have comparable difficulty profiles, but with subtle differences. ToT is perceived as slightly easier than the other two, while CoT is perceived as slightly harder. The triangular shape of the profiles indicates that the tests assess difficulty across a spectrum of "Easy", "Medium", and "Hard" dimensions. The fact that the lines are not smooth suggests that the difficulty levels are not continuous but rather discrete or categorical. The chart provides a visual representation of the relative difficulty of the tests, which can be useful for selecting appropriate tests for different individuals or purposes. The data does not provide any information about the nature of the tests themselves or the specific skills or knowledge they assess.
</details>
Figure 14: Results decomposed into easy/medium/hard questions according to GPT4Score.
## Appendix E Computational Analysis
<details>
<summary>figures/computational_analysis.png Details</summary>

### Visual Description
## Line Chart: Number of Operations vs. Reasoning Steps
### Overview
This image presents a line chart illustrating the relationship between the number of reasoning steps and the number of operations performed by three different methods: CoT (Chain of Thought), ToT (Tree of Thoughts), and GoT (Graph of Thoughts). The chart shows how the computational cost (measured by the number of operations) scales with the complexity of the reasoning process (measured by the number of steps).
### Components/Axes
* **X-axis:** "Reasoning Steps" - Scale ranges from 1 to 10, with integer markers.
* **Y-axis:** "Number of Operations" - Scale ranges from 0 to 200,000, with markers at 0, 50,000, 100,000, 150,000, and 200,000.
* **Legend:** Located in the top-left corner.
* CoT (Chain of Thought) - Black line with circle markers.
* ToT (Tree of Thoughts) - Blue line with triangle markers.
* GoT (Graph of Thoughts) - Red line with square markers.
### Detailed Analysis
* **CoT (Black Line):** The line is relatively flat from step 1 to step 8, remaining below approximately 10,000 operations. It then shows a slight upward trend from step 8 to step 9, reaching around 12,000 operations, and then plateaus at approximately 12,000 operations at step 10.
* Step 1: ~2,000 operations
* Step 2: ~2,000 operations
* Step 3: ~3,000 operations
* Step 4: ~3,000 operations
* Step 5: ~3,000 operations
* Step 6: ~5,000 operations
* Step 7: ~8,000 operations
* Step 8: ~10,000 operations
* Step 9: ~12,000 operations
* Step 10: ~12,000 operations
* **ToT (Blue Line):** The line remains close to zero from step 1 to step 7. It then exhibits a steep upward trend from step 7 to step 10.
* Step 1: ~1,000 operations
* Step 2: ~1,000 operations
* Step 3: ~1,000 operations
* Step 4: ~1,000 operations
* Step 5: ~1,000 operations
* Step 6: ~2,000 operations
* Step 7: ~3,000 operations
* Step 8: ~15,000 operations
* Step 9: ~75,000 operations
* Step 10: ~90,000 operations
* **GoT (Red Line):** The line is nearly flat from step 1 to step 7, staying below approximately 5,000 operations. From step 7 to step 10, it experiences an extremely rapid and exponential increase.
* Step 1: ~1,000 operations
* Step 2: ~1,000 operations
* Step 3: ~2,000 operations
* Step 4: ~2,000 operations
* Step 5: ~2,000 operations
* Step 6: ~3,000 operations
* Step 7: ~4,000 operations
* Step 8: ~10,000 operations
* Step 9: ~60,000 operations
* Step 10: ~200,000 operations
### Key Observations
* CoT exhibits the lowest computational cost and the most stable performance across all reasoning steps.
* ToT and GoT have similar computational costs for the first 7 steps.
* GoT experiences the most dramatic increase in computational cost as the number of reasoning steps increases, exhibiting exponential growth.
* ToT also shows a significant increase in computational cost with more reasoning steps, but it is less pronounced than that of GoT.
* The difference in computational cost between the methods becomes substantial at 8 or more reasoning steps.
### Interpretation
The chart demonstrates a clear trade-off between reasoning complexity and computational cost. CoT is the most efficient method for simple reasoning tasks (few steps), but its performance may be limited for more complex problems. ToT and GoT offer the potential for solving more complex problems, but at the cost of significantly increased computational resources. The exponential growth of GoT suggests that it may become impractical for very complex reasoning tasks due to its high computational demands. The data suggests that the choice of method depends on the specific application and the available computational resources. The rapid increase in operations for GoT and ToT indicates that these methods explore a much larger search space as the number of reasoning steps increases, leading to a combinatorial explosion in computational cost. This highlights the importance of efficient search strategies and pruning techniques for these methods.
</details>
Figure 15: Comparison of computational costs across different reasoning strategies — CoT, ToT, GoT — as a function of reasoning steps. GoT exhibits the highest cost due to merge operations and graph traversal.
| | Method | Key Parameters | Approx. # LLM Calls | Approx. # KG Operations | Primary Growth Driver(s) |
| --- | --- | --- | --- | --- | --- |
| CoT | Agent | $n$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $n$ (linear) |
| Graph Explore | $n$ , $d$ | $\mathcal{O}(n)$ | $\mathcal{O}(n\cdot\text{Cost}_{\text{Explore}}(d))$ | $n$ , $d$ | |
| ToT | Agent | $D_{\text{max}}$ , $k$ , $t$ | $\mathcal{O}\left(k\cdot\frac{t^{D_{\text{max}}}-1}{t-1}\right)$ | Same as LLM Calls | $D_{\text{max}}$ , $k$ , $t$ (exponential in $D_{\text{max}}$ ) |
| Graph Explore | $D_{\text{max}}$ , $k$ , $t$ , $d$ | $\mathcal{O}\left(k\cdot\frac{t^{D_{\text{max}}}-1}{t-1}\right)$ | $\mathcal{O}\left(k\cdot\frac{t^{D_{\text{max}}}-1}{t-1}\cdot\text{Cost}_{\text{Explore}}(d)\right)$ | $D_{\text{max}}$ , $k$ , $t$ , $d$ | |
| GoT | Agent | $D_{\text{max}}$ , $k$ , $t$ | . $\mathcal{O}\left(k\cdot\frac{t^{D_{\text{max}}}-1}{t-1}+\sum_{i=1}^{D_{\text{max}}}\left\lfloor\frac{k\cdot t^{i}}{2}\right\rfloor\right)$ | Same as LLM Calls | $D_{\text{max}}$ , $k$ , $t$ (aggregation adds extra cost) |
| Graph Explore | $D_{\text{max}}$ , $k$ , $t$ , $d$ | Same as Agent | $\mathcal{O}\left(\text{LLM Calls}\cdot\text{Cost}_{\text{Explore}}(d)\right)$ | $D_{\text{max}}$ , $k$ , $t$ , $d$ | |
Table 5: Theoretical computational complexity comparison of reasoning methods. Parameters: $n$ (reasoning steps), $D_{\text{max}}$ (tree depth), $k$ (branching factor), $t$ (paths retained), $d$ (KG search depth), and $\text{Cost}_{\text{Explore}}(d)$ (cost per KG search). GoT includes pairwise aggregation of thoughts at each depth.
The computational analysis summarized in Table 5 highlights the trade-offs between reasoning power and computational cost when grounding LLMs with Knowledge Graphs (KGs). The Agent -based methods scale linearly with the number of reasoning steps or tree nodes, with CoT representing the lowest cost baseline. In contrast, Automatic Graph Exploration methods introduce additional overhead via entity extraction, multi-hop traversal up to a fixed max_depth, and LLM-based pruning at each step.
Among reasoning strategies, ToT introduces exponential growth in cost with respect to depth due to its exploration of $k$ branches and selection of $t$ continuations per level. GoT further amplifies this by incorporating aggregation transformations that attempt to merge every pair of thoughts at each depth, leading to an additional cost proportional to $\sum_{i=1}^{D_{\text{max}}}\left\lfloor\frac{k\cdot t^{i}}{2}\right\rfloor$ .
Importantly, our experiments reveal that the higher complexity of GoT does not consistently translate to improved accuracy compared to ToT, suggesting diminishing returns. While the model size (e.g., 8B, 70B, 405B) influences the latency and memory footprint of each LLM call, it does not affect the algorithmic complexity classes shown. Hence, selecting a strategy requires balancing reasoning depth with feasible compute budgets.
## Appendix F Prompts Archive
In this section, we gather the necessary prompts essential for implementing the proposed methodologies.
Agent
Agent Step: Solve a question answering task with interleaving Thought, Interaction with Graph, Feedback from Graph steps. In Thought step, you can think about what further information is needed, and In Interaction step, you can get feedback from graphs with four functions: (1) RetrieveNode[keyword], which retrieves the related node from the graph according to the corresponding query. (2) NodeFeature[Node, feature], which returns the detailed attribute information of Node regarding the given "feature" key. (3) NodeDegree[Node, neighbor_type], which calculates the number of "neighbor_type" neighbors of the node Node in the graph. (4) NeighbourCheck[Node, neighbor_type], which lists the "neighbor_type" neighbours of the node Node in the graph and returns them. You may take as many steps as necessary. Here are some examples: {examples} Please answer by providing node main feature (e.g., names) rather than node IDs. Generate the next step. Definition of the graph: {graph_definition} Question: {question} {scratchpad}
Automatic Graph Exploration
Search Thought: Given the previous thoughts, generate the next thought to answer the provided question. Your end goal is to answer the question step by step. For context, you are also provided with some knowledge triples from a knowledge base. Follow the format of the examples to generate the next thought. {examples} Graph Definition: {graph_definition} Question: {question} Knowledge Triples: {triples} Previous thoughts: {thoughts} Related Entity Attributes: {attributes} Next Thought: Search End?: Your are provided with the an original question, the associated subquestion thoughts and their corresponding knowledge graph triples (head_entity -> relation -> tail_entity). Your task is to answer whether it’s sufficient for you to answer the original question (Yes or No). You are provided with examples. You should follow the same format as in the examples, writing ’Yes’ or ’No’ within brackets at the beginning of the answer. (Examples) Task: Question: {question} Thoughts: {thoughts} Knowledge Triples: {triples} Entity Attributes: {attributes} Answer: Entity Extraction: Given the provided text, extract the relevant entities that may appear in a knowledge base. Return the answer at the end with brackets relevant entities as shown in the following examples. If there are several entities, separate them with commas. (Examples) Task: Text: {text} Relevant Entities: Prune Relations: From the given entity and relations, select only the relevant relations to answer the question. Provide the answer at the end with bracketsanswer , as shown in the following example. (Examples) Question: {question} Head Entity: {entity} Relations: {relations} Answer: Prune entities: You are provided with a question, a head entity, a relation and tail entity or entities from a knowledge base. Select the tail entity or entities to answer the question. Return the tail entity or entities at the end with brackets relevant entity or entities, as shown in the following examples. (Examples) Question: {question} Head Entity: {head_entity} Relation: {relation} Tail Entities: {tail_entities} Relevant Entities: Search Attributes: Is any of the attributes relevant to answer the question? Return the answer at the end with brackets answer , as shown in the following examples. (Examples) Question: {question} Entity: {entity} Attributes: {attributes} Relevant Attributes:
State Evaluators
Selection Vote: Given a question, you need to select the possible chain of thought that may lead to the correct answer with higher probablity. You are provided with several choices with thouhgts and related triples from a knowledge base. Decide which choice is most promising to complete the task. Analyze each choice in detail, then conclude in the last line: "The best choice is s" , where s the integer id of the choice. (Examples) Question: {question} Choices: {choices} (+ attributes/triples) Answer: Score Vote: Generate a score for the given reasoning chain. The score represents the probability that the chain will lead to the correct answer. The chains contain interleaved thoughts and related triples from a knowledge base. Some chains may not be complete, but you need to judge the steps that are provided. The score can be any floating number between 0 and 1. (Examples) Question: {question} Thought Chain: {thoughts} (+ attributes/triples) Score:
Graph of Thoughts
Graph of Thought Merge: Generate the next thought for the merged chain of thoughts. You are provided with the question, two chains of thoughts, and the corresponding merged chain of thought. Identify inconsistencies or errors from the previous chains and provide the next thought for the merged chain. You should follow the same format as in the examples. (Examples) Question: {question} Chain 1: {chain_1} (+triples/attributes) Chain 2: {chain_2} (+triples/attributes) Merged Chain: {merged_chain} Next Thought:
## Appendix G Examples
| Example 1: Multi-step Reasoning over Graph | Example 2: Direct Triple Lookup |
| --- | --- |
| Question: What anatomy can be expressed by gene KRT39? Ground Truth Answer: head, skin of body Model Answer: head, skin of body
Thought 1: The question is related to a gene node (KRT39). We need to find this node in the graph. Action 1: RetrieveNode[KRT39] Observation 1: The ID of the node is 390792. Thought 2: We need to check the ’Anatomy-expresses-Gene’ neighbors of this gene node. Action 2: NeighbourCheck[390792, Anatomy-expresses-Gene] Observation 2: The neighbors are [’UBERON:0000033’, ’UBERON:0002097’]. Thought 3: Retrieve names of the anatomy nodes. Action 3: NodeFeature[UBERON:0000033, name], NodeFeature[UBERON:0002097, name] Observation 3: UBERON:0000033 $\rightarrow$ head, UBERON:0002097 $\rightarrow$ skin of body. Thought 4: These are the anatomy terms expressed by the gene. Action 4: Finish[head, skin of body] | Question: What anatomy can be expressed by gene KRT39? Ground Truth Answer: head, skin of body Model Answer: head, skin of body
Thought 1: KRT39 is a gene that is known to be expressed in two anatomical regions. Found Triples: •
"KRT39" --> Anatomy-expresses-Gene --> head •
"KRT39" --> Anatomy-expresses-Gene --> skin of body Action: Finish[head, skin of body] |
Table 6: Comparison of two LLM-generated reasoning traces for the same question using different strategies: multi-step graph traversal (Example 1) vs. direct triple lookup (Example 2).