2504.02670
Model: gemini-2.0-flash
# Affordable AI Assistants with Knowledge Graph of Thoughts
**Authors**: Maciej Besta, ETH Zurich, &Lorenzo Paleari, ETH Zurich, &Jia Hao Andrea Jiang, ETH Zurich, &Robert Gerstenberger, ETH Zurich, &You Wu, ETH Zurich, &JĂłn Gunnar Hannesson, ETH Zurich, &Patrick Iff, ETH Zurich, &Ales Kubicek, ETH Zurich, &Piotr Nyczyk, &Diana Khimey, ETH Zurich, &Nils Blach, ETH Zurich, &Haiqiang Zhang, ETH Zurich, &Tao Zhang, ETH Zurich, &Peiran Ma, ETH Zurich, &Grzegorz KwaĆniewski, ETH Zurich, &Marcin Copik, ETH Zurich, &Hubert Niewiadomski, &Torsten Hoefler, ETH Zurich
> corresponding author
Abstract
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36 $Ă$ compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
Website & code: https://github.com/spcl/knowledge-graph-of-thoughts
1 Introduction
Large Language Models (LLMs) are transforming the world. However, training LLMs is expensive, time-consuming, and resource-intensive. In order to democratize the access to generative AI, the landscape of agent systems has massively evolved during the last two years (LangChain Inc., 2025a; Rush, 2023; Kim et al., 2024; Sumers et al., 2024; Hong et al., 2024; Guo et al., 2024; Edge et al., 2025; Besta et al., 2025c; Zhuge et al., 2024; Beurer-Kellner et al., 2024; Shinn et al., 2023; Kagaya et al., 2024; Zhao et al., 2024a; Stengel-Eskin et al., 2024; Wu et al., 2024). These schemes have been applied to numerous tasks in reasoning (Creswell et al., 2023; Bhattacharjya et al., 2024; Besta et al., 2025c), planning (Wang et al., 2023c; Prasad et al., 2024; Shen et al., 2023; Huang et al., 2023), software development (Tang et al., 2024), and many others (Xie et al., 2024; Li & Vasarhelyi, 2024; Schick et al., 2023; Beurer-Kellner et al., 2023).
Among the most impactful applications of LLM agents is the development of AI assistants capable of helping with a wide variety of tasks. These assistants promise to serve as versatile tools, enhancing productivity and decision-making across domains. From aiding researchers with complex problem-solving to managing day-to-day tasks for individuals, AI assistants are becoming an indispensable part of modern life. Developing such systems is highly relevant, but remains challenging, particularly in designing solutions that are both effective and economically viable.
The GAIA benchmark (Mialon et al., 2024) has become a key standard for evaluating LLM-based agent systems across diverse tasks, including web navigation, code execution, image reasoning, scientific QA, and multimodal challenges. Despite its introduction nearly two years ago, top-performing solutions still struggle with many tasks. Moreover, operational costs remain high: running all validation tasks with Hugging Face Agents (Roucher & Petrov, 2025) and GPT-4o costs $â$ $200, underscoring the need for more affordable alternatives . Smaller models like GPT-4o mini significantly reduce expenses but suffer from steep drops in task success, making them insufficient. Open large models also pose challenges due to demanding infrastructure needs, while smaller open models, though cheaper to run, lack sufficient capabilities.
To address these challenges, we propose Knowledge Graph of Thoughts (KGoT), a novel AI assistant architecture that significantly reduces task execution costs while maintaining a high success rate (contribution #1). The central innovation of KGoT lies in its use of a knowledge graph (KG) (Singhal, 2012; Besta et al., 2024b) to represent knowledge relevant to a given task. A KG organizes information into triples, providing a structured representation of knowledge that small, cost-effective models can efficiently process. Hence, KGoT âturns the unstructured into the structuredâ, i.e., KGoT turns the often unstructured data such as website contents or PDF files into structured KG triples. This approach enhances the comprehension of task requirements, enabling even smaller models to achieve performance levels comparable to much larger counterparts, but at a fraction of the cost.
The KGoT architecture (contribution #2) implements this concept by iteratively constructing a KG from the task statement, incorporating tools as needed to gather relevant information. The constructed KG is kept in a graph store, serving as a repository of structured knowledge. Once sufficient information is gathered, the LLM attempts to solve the task by either directly embedding the KG in its context or querying the graph store for specific insights. This approach ensures that the LLM operates with a rich and structured knowledge base, improving its task-solving ability without incurring the high costs typically associated with large models. The architecture is modular and extensible towards different types of graph query languages and tools.
Our evaluation against top GAIA leaderboard baselines demonstrates its effectiveness and efficiency (contribution #3). KGoT with GPT-4o mini solves $>$ 2 $Ă$ more tasks from the validation set than Hugging Face Agents with GPT-4o or GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs: from $187 with GPT-4o to roughly $5 with GPT-4o mini. KGoTâs benefits generalize to other models, baselines, and benchmarks such as SimpleQA (Wei et al., 2024).
On top of that, KGoT reduces noise and simultaneously minimizes bias and improves fairness by externalizing reasoning into an explicit knowledge graph rather than relying solely on the LLMâs internal generation (contribution #4). This ensures that key steps when resolving tasks are grounded in transparent, explainable, and auditable information.
2 Knowledge Graph of Thoughts
We first illustrate the key idea, namely, using a knowledge graph to encode structurally the task contents. Figure 1 shows an example task and its corresponding evolving KG.
2.1 What is a Knowledge Graph?
A knowledge graph (KG) is a structured representation of information that organizes knowledge into a graph-based format, allowing for efficient querying, reasoning, and retrieval. Formally, a KG consists of a set of triples, where each triple $(s,p,o)$ represents a relationship between two entities $s$ (subject) and $o$ (object) through a predicate $p$ . For example, the triple $(\text{``Earth''},\text{``orbits''},\text{``Sun''})$ captures the fact that Earth orbits the Sun. Mathematically, a knowledge graph can be defined as a directed labeled graph $G=(V,E,L)$ , where $V$ is the set of vertices (entities), $Eâeq VĂ V$ is the set of edges (relationships), and $L$ is the set of labels (predicates) assigned to the edges. Each entity or predicate may further include properties or attributes, enabling richer representation. Knowledge graphs are widely used in various domains, including search engines, recommendation systems, and AI reasoning, as they facilitate both efficient storage and complex queries.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph Construction for Question Answering
### Overview
The image illustrates the process of building a knowledge graph to answer a question posed about a YouTube 360 VR video. The process starts with an input task statement, builds a basic knowledge graph, enhances it with additional data from the web and a YouTube transcriber, and finally extracts information to generate a response.
### Components/Axes
The diagram is divided into five main stages, arranged horizontally from left to right:
1. **Input task statement:** Contains the question to be answered.
2. **Knowledge Graph:** Initial knowledge graph built from the input statement.
3. **Knowledge Graph (enhanced):** Knowledge graph enhanced with additional data.
4. **Knowledge Graph (enhanced):** Further enhanced knowledge graph.
5. **Response:** The final answer generated.
Each stage is marked with a title and separated by arrows indicating the flow of information. The top of the diagram includes labels indicating the actions performed at each stage: "start building the knowledge graph (KG)", "query web for additional data", "invoke text inspector (YouTube transcriber)", and "extract info from graph and generate response".
### Detailed Analysis or Content Details
**1. Input task statement:**
* Text: "Input task statement (e.g., level 3 question from the GAIA Benchmark)"
* Question: "In the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Rings' Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?"
**2. Knowledge Graph:**
* Nodes: "Gollum (LotR)" and "Andy Serkis"
* Edge: "interpreted by" connecting Gollum to Andy Serkis.
**3. Knowledge Graph (enhanced):**
* Nodes: "Gollum (LotR)", "Andy Serkis", and "The Silmarillion", "We Are Stars"
* Edges: "interpreted by" connecting Andy Serkis to "The Silmarillion" and "We Are Stars", "interpreted by" connecting Gollum to Andy Serkis.
* "The Silmarillion" details: Type: JP4, Date: Jul, 2023, ID: d6xAaRv-UI
* "We Are Stars" details: Type: VR 260, Date: Mar, 2018, ID: tSHGAGEo
**4. Knowledge Graph (enhanced):**
* Nodes: "Gollum (LotR)", "Andy Serkis", "The Silmarillion", "We Are Stars"
* Edges: "interpreted by" connecting Andy Serkis to "The Silmarillion" and "We Are Stars", "interpreted by" connecting Gollum to Andy Serkis.
* "The Silmarillion" details: Type: JP4, Date: Jul, 2023, ID: d6xAaRv-UI
* "We Are Stars" details: Type: VR 260, Date: Mar, 2018, ID: tSHGAGEo
* Text: "...Dinosaurs dominated the earth for over a hundred million years..."
**5. Response:**
* Text: "In the YouTube 360 VR video 'We Are Stars', narrated by Andy Serkis, the number mentioned after the dinosaurs first appearance is 100,000,000"
### Key Observations
* The knowledge graph evolves from a simple relationship between "Gollum" and "Andy Serkis" to a more complex structure including "The Silmarillion" and "We Are Stars".
* The "enhanced" knowledge graphs incorporate information about the type, date, and ID of the related media.
* The final response directly answers the question posed in the input statement.
### Interpretation
The diagram illustrates a question-answering system that leverages knowledge graphs. The system starts with a user's question and constructs a knowledge graph to represent the entities and relationships involved. It then enhances this graph by querying external sources (the web and a YouTube transcriber) to gather additional information. Finally, it extracts the relevant information from the enhanced graph to generate a concise and accurate answer to the user's question. The example demonstrates how knowledge graphs can be used to reason about complex information and provide meaningful answers to natural language queries.
</details>
Figure 1: The key idea behind Knowledge Graph of Thoughts (KGoT): transforming the representation of a task for an AI assistant from a textual form into a knowledge graph (KG). As an example, we use a Level-3 (i.e., highest difficulty) task from the GAIA benchmark. In order to solve the task, KGoT evolves this KG by adding relevant information that brings the task closer to completion. This is achieved by iteratively running various tools. Finally, the task is solved by extracting the relevant information from the KG, using â for example â a graph query, or an LLMâs inference process with the KG provided as a part of the input prompt. More examples of KGs are in Appendix A.
2.2 Harnessing Knowledge Graphs for Effective AI Assistant Task Resolution
At the heart of KGoT is the process of transforming a task solution state into an evolving KG. The KG representation of the task is built from âthoughtsâ generated by the LLM. These âthoughtsâ are intermediate insights identified by the LLM as it works through the problem. Each thought contributes to expanding or refining the KG by adding vertices or edges that represent new information.
For example, consider the following Level 3 (i.e., highest difficulty) task from the GAIA benchmark: âIn the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Ringsâ Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?â (see Figure 1 for an overview; more examples of constructed KGs are in Appendix A). Here, the KG representation of the task solution state has a vertex âGollum (LotR)â. Then, the thought âGollum from Lord of the Rings is interpreted by Andy Serkisâ results in adding a vertex for âAndy Serkisâ, and linking âGollum (LotR)â to âAndy Serkisâ with the predicate âinterpreted byâ. Such integration of thought generation and KG construction creates a feedback loop where the KG continuously evolves as the task progresses, aligning the representation with problem requirements.
In order to evolve the KG task representation, KGoT iteratively interacts with tools and retrieves more information. For instance, the system might query the internet to identify videos narrated by Andy Serkis (e.g., âThe Silmarillionâ and âWe Are Starsâ). It can also use a YouTube transcriber tool to find their publication date. This iterative refinement allows the KG to model the current âstateâ of a task at each step, creating a more complete and structured representation of this task and bringing it closer to completion. Once the KG has been sufficiently populated with task-specific knowledge, it serves as a robust resource for solving the problem.
In addition to adding new graph elements, KGoT also supports other graph operations. This includes removing nodes and edges, used as a part of noise elimination strategies.
2.3 Extracting Information from the KG
To accommodate different tasks, KGoT supports different ways to extract the information from the KG. Currently, we offer graph query languages or general-purpose languages; each of them can be combined with the so-called Direct Retrieval. First, one can use a graph query, prepared by the LLM in a language such as Cypher (Francis et al., 2018) or SPARQL (Pérez et al., 2009), to extract the answer to the task from the graph. This works particularly well for tasks that require retrieving specific patterns within the KG. Second, we also support general scripts prepared by the LLM in a general-purpose programming language such as Python. This approach, while not as effective as query languages for pattern matching, offers greater flexibility and may outperform the latter when a task requires, for example, traversing a long path in the graph. Third, in certain cases, once enough information is gathered into the KG, it may be more effective to directly paste the KG into the LLM context and ask the LLM to solve the task, instead of preparing a dedicated query or script. We refer to this approach as Direct Retrieval.
The above schemes offer a tradeoff between accuracy, cost, and runtime. For example, when low latency is priority, general-purpose languages should be used, as they provide an efficient lightweight representation of the KG and offer rapid access and modification of graph data. When token cost is most important, one should avoid Direct Retrieval (which consumes many tokens as it directly embeds the KG into the LLM context) and focus on either query or general-purpose languages, with a certain preference for the former, because its generated queries tend to be shorter than scripts. Finally, when aiming for solving as many tasks as possible, one should experiment with all three schemes. As shown in the Evaluation section, these methods have complementary strengths: Direct Retrieval is effective for broad contextual understanding, while graph queries and scripts are better suited for structured reasoning.
2.4 Representing the KG
KGoT can construct three interoperable KG representations: Property graphs (used with graph query languages such as Cypher and systems such as Neo4j (Robinson et al., 2015)), RDF graphs (used with graph query languages such as SPARQL and systems such as RDF4J (Ben Mahria et al., 2021)), and the adjacency list graphs (Besta et al., 2018) (used with general-purpose languages such as Python and systems such as NetworkX (NetworkX Developers, 2025)).
Each representation supports a different class of analysis. The Property graph view facilitates analytics such as pattern matching, filtering, of motif queries directly on the evolving task-state graph. The RDF graph view facilitates reasoning over ontology constraints, schema validation, and SPARQL-based inference for missing links. The adjacency list representation with NetworkX facilitates Python-based graph analytics, for example centrality measures, connected components, clustering coefficients, etc., all on the same KG snapshots.
Appendix A contains examples of task-specific KGs, illustrating how their topology varies with the task domain (e.g., tree-like procedural chains vs. dense relational subgraphs in multi-entity reasoning).
2.5 Bias, Fairness, and Noise Mitigation through KG-Based Representation
KGoT externalizes and structures the reasoning process, which reduces noise, mitigates model bias, and improves fairness, because in each iteration both the outputs from tools and LLM thoughts are converted into triples and stored explicitly. Unlike opaque monolithic LLM generations, this fosters transparency and facilitates identifying biased inference steps. It also facilitates noise mitigation: new triples can be explicitly checked for the quality of their information content before being integrated into the KG, and existing triples can also be removed if they are deemed redundant (examples of such triples that have been found and removed are in Appendix B.6).
3 System Architecture
The KGoT modular and flexible architecture, pictured in Figure 2, consists of three main components: the Graph Store Module, the Controller, and the Integrated Tools, each playing a critical role in the task-solving process. Below, we provide a detailed description of each component and its role in the system. Additional details are in Appendix B (architecture) and in Appendix C (prompts).
3.1 Maintaining the Knowledge Graph with the Graph Store Module
A key component of the KGoT system is the Graph Store Module, which manages the storage and retrieval of the dynamically evolving knowledge graph which represents the task state. In order to harness graph queries, we use a graph database backend; in the current KGoT implementation, we test Cypher together with Neo4j (Robinson et al., 2015), an established graph database (Besta et al., 2023b; c), as well as SPARQL together with the RDF4J backend (Ben Mahria et al., 2021). Then, in order to support graph accesses using a general-purpose language, KGoT harnesses the NetworkX library (NetworkX Developers, 2025) and Python. Note that the extensible design of KGoT enables seamless integration of any other backends and languages.
3.2 Managing the Workflow with the Controller Module
The Controller orchestrates the interactions between the KG and the tools. Upon receiving a user query, it iteratively interprets the task, determines the appropriate tools to invoke based on the KG state and task needs, and integrates tool outputs back into the KG. The Controller uses a dual-LLM architecture with a clear separation of roles: the LLM Graph Executor constructs and evolves the KG, while the LLM Tool Executor manages tool selection and execution.
The LLM Graph Executor determines the next steps after each iteration that constructs and evolves the KG. It identifies any missing information necessary to solve the task, formulates appropriate queries for the graph store interaction (retrieve/insert operations), and parses intermediate or final results for integration into the KG. It also prepares the final response to the user based on the KG.
The LLM Tool Executor operates as the executor of the plan devised by the LLM Graph Executor. It identifies the most suitable tools for retrieving missing information, considering factors such as tool availability, relevance, and the outcome of previous tool invocation attempts. For example, if a web crawler fails to retrieve certain data, the LLM Tool Executor might prioritize a different retrieval mechanism or adjust its queries. The LLM Tool Executor manages the tool execution process, including interacting with APIs, performing calculations, or extracting information, and returns the results to the LLM Graph Executor for further reasoning and integration into the KG.
3.3 Ensuring Versatile and Extensible Set of Integrated Tools
KGoT offers a hierarchical suite of tools tailored to diverse task needs. The Python Code Tool enables dynamic script generation and execution for complex computations. The LLM Tool supplements the controllerâs reasoning by integrating an auxiliary language model, enhancing knowledge access while minimizing hallucination risk. For multimodal inputs, the Image Tool supports image processing and extraction. Web-based tasks are handled by the Surfer Agent (based on the design by Hugging Face Agents (Roucher & Petrov, 2025)), which leverages tools like the Wikipedia Tool, granular navigation tools (PageUp, PageDown, Find), and SerpApi (SerpApi LLM, 2025) for search. Additional tools include the ExtractZip Tool for compressed files and the Text Inspector Tool for converting content from sources like MP3s and YouTube transcripts into Markdown. Finally, the user can seamlessly add a new tool by initializing the tool, passing in the logger object for tool use statistics, and appending the tool to the tool list via a Tool Manager object. We require all tools implemented to adhere to the LangChainâs BaseTool interface class. This way, the list of tools managed by the Tool Manager can be directly bound to the LLM Tool Executor via LangChain bind_tools, further facilitating new tools.
3.4 Ensuring High-Performance & Scalability
The used scalability optimizations include (1) asynchronous execution using asyncio (Python Software Foundation, 2025b) to parallelize LLM tool invocations, mitigating I/O bottlenecks and reducing idle time, (2) graph operation parallelism by reformulating LLM-generated Cypher queries to enable concurrent execution of independent operations in a graph database, and (3) MPI-based distributed processing, which decomposes workloads into atomic tasks distributed across ranks using a work-stealing algorithm to ensure balanced computational load and scalability.
3.5 Ensuring System Robustness
Robustness is ensured with two established mechanisms, Self-Consistency (Wang et al., 2023b) (via majority voting) and LLM-as-a-Judge (Gu et al., 2025) (other strategies such as embedding-based stability are also applicable (Besta et al., 2025d)). With Self-Consistency, we query the LLM multiple times when deciding whether to insert more data into the KG or retrieve existing data, when deciding which tool to use, and when parsing the final solution. This approach reduces the impact of single-instance errors or inconsistencies in various parts of the KGoT architecture. LLM-as-a-Judge further reinforces the robustness, by directly employing the LLM agent to make these decisions based on generated reasoning chains.
Overall, both Self-Consistency and LLM-as-a-Judge have been shown to significantly enhance the robustness of prompting. For example, MT-Bench and Chatbot Arena show that strong judges (e.g., GPT-4 class) match human preferences at 80% agreement or more, on par with human-human agreement (Zheng et al., 2023). Prometheus/Prometheus-2 further demonstrate open evaluator LMs with the highest correlations to both humans and proprietary judges across direct-assessment and pairwise settings, and AlpacaEval has been validated against approximately 20K human annotations, addressing earlier concerns about reproducibility at scale. Similarly reliable gains have been shown for Self-Consistency (Wang et al., 2023b).
3.6 Ensuring Layered Error Containment & Management
To manage LLM-generated syntax errors, KGoT includes LangChainâs JSON parsers that detect syntax issues. When a syntax error is detected, the system first attempts to correct it by adjusting the problematic syntax using different encoders, such as the âunicode escapeâ (Python Software Foundation, 2025a). If the issue persists, KGoT employs a retry mechanism (three attempts by default) that uses the LLM to rephrase the query/command and attempts to regenerate its output. If the error persists, the system logs it for further analysis, bypasses the problematic query, and continues with other iterations.
To handle API & system related errors, such as the OpenAI code 500, we employ exponential backoff, implemented using the tenacity library (Tenacity Developers, 2025a). Additionally, KGoT includes comprehensive logging systems as part of its error management framework. These systems track the errors encountered during system operation, providing valuable data that can be easily parsed and analyzed (e.g., snapshots of the knowledge graphs or responses from third-party APIs).
The Python Executor tool, a key component of the system, is containerized to ensure secure execution of LLM-generated code. This tool is designed to run code with strict timeouts and safeguards, preventing potential misuse or resource overconsumption.
3.7 Implementation Details
KGoT employs Docker (Docker Inc., 2025) and Sarus (Benedicic et al., 2019) for containerization, enabling a consistent and isolated runtime environment for all components. We containerize critical modules such as the KGoT controller, the Neo4j knowledge graph, and integrated tools (e.g., the Python Executor tool for safely running LLM-generated code with timeouts). Here, Docker provides a widely adopted containerization platform for local and cloud deployments that guarantees consistency between development and production environments. Sarus, a specialized container platform designed for high-performance computing (HPC) environments, extends KGoTâs portability to HPC settings where Docker is typically unavailable due to security constraints. This integration allows KGoT to operate efficiently in HPC environments, leveraging their computational power.
KGoT also harnesses LangChain (LangChain Inc., 2025a), an open-source framework specifically designed for creating and orchestrating LLM-driven applications. LangChain offers a comprehensive suite of tools and APIs that simplify the complexities of managing LLMs, including prompt engineering, tool integration, and the coordination of LLM outputs.
4 System Workflow
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flow Diagram: Knowledge Graph of Thoughts
### Overview
The image presents two flow diagrams illustrating the Knowledge Graph of Thoughts (KGoT) process. The first diagram provides a high-level overview, while the second offers a detailed view of the process. Both diagrams depict the flow of information and actions between components such as the Graph Store, Controller, and Integrated Tools, with a focus on the role of Large Language Models (LLMs).
### Components/Axes
**High-Level Overview:**
* **Title:** Knowledge Graph of Thoughts (high-level overview)
* **Components:**
* Graph Store: Contains a "Knowledge graph" and performs "Knowledge extraction method". It connects to a "Storage backend (e.g., a graph database)".
* Controller: Contains "LLM Graph Executor" and "LLM Tool Executor".
* Integrated Tools
* **Flow:** The flow is generally left to right, starting from the Graph Store, moving to the Controller, and then to Integrated Tools.
* **Input/Output:** "user question" and "KGOT response" are shown at the top.
**Detailed View:**
* **Title:** Knowledge Graph of Thoughts (detailed view)
* **Components:**
* User question
* Graph Store: Contains a "Knowledge graph" and is connected to a "Backend".
* Backend: Includes "Graph database (e.g., Neo4j)", "Lightweight backend (e.g., NetworkX)".
* LLM Graph Executor: Contains steps for processing the graph.
* Controller: Contains "LLM Tool Executor".
* Integrated Tools: Includes various tools like "Python code & math tool", "Image tool", "ExtractZIP tool", "Text inspector tool", "MDConverter", "YouTube transcriber", "Surfer", "Browser", "Wikipedia tool", "Find tool", "Visit tool", and "Active search".
* **Flow:** The flow is more complex, involving loops and conditional branches within the LLM Graph Executor.
* **Annotations:** "LLM" indicates that a given step is conducted using an LLM or that a given tool extensively uses an LLM.
### Detailed Analysis
**High-Level Overview:**
* The diagram shows a simplified flow from knowledge graph storage and extraction to the controller, which then interacts with integrated tools.
**Detailed View:**
1. **User question:** The process starts with a user question.
2. **LLM Graph Executor:**
* Step 1: "New graph state"
* Step 2: "Max. iterations?" (Number of iterations is a user parameter). If "no", the process moves to "Determine the next step".
* Step 3: If "yes" from "Max. iterations?", the process loops back to "Determine the next step".
* "Determine the next step" leads to "SOLVE or ENHANCE? (majority vote)".
* Step 6: "Run ENHANCE"
* Step 7: "Run SOLVE (Generate solution)"
* Step 8: "Apply additional mathematical processing"
* Step 9: "Parse solution"
3. **LLM Tool Executor:**
* Step 4: "Define tool calls"
* Step 5: "Run tool calls"
4. **Backend:**
* Graph database (e.g., Neo4j): Knowledge extraction using a graph query language.
* Lightweight backend (e.g., NetworkX): Knowledge extraction using a general-purpose programming language.
5. **Integrated Tools:**
* Python code & math tool
* Image tool
* ExtractZIP tool
* Text inspector tool: MDConverter, YouTube transcriber
* Surfer
* Browser: Wikipedia tool, Find tool, Visit tool, Active search
6. **KGOT response:** The process ends with a KGOT response.
### Key Observations
* The detailed view provides a granular breakdown of the LLM Graph Executor and LLM Tool Executor processes.
* The use of LLMs is highlighted throughout the process, indicated by the "LLM" annotations.
* The diagram emphasizes the iterative nature of the LLM Graph Executor, with loops for determining the next step and maximizing iterations.
* The Integrated Tools section showcases a variety of tools used in the process, including those for code execution, image processing, text inspection, and web browsing.
### Interpretation
The diagrams illustrate the architecture and workflow of a Knowledge Graph of Thoughts system. The high-level overview provides a simplified view of the system's components and their interactions, while the detailed view offers a more in-depth understanding of the process. The system leverages LLMs to process knowledge graphs, execute tools, and generate responses to user questions. The iterative nature of the LLM Graph Executor suggests a process of refinement and optimization, where the system continuously improves its understanding and response based on the available information. The integration of various tools highlights the system's ability to leverage external resources and capabilities to enhance its performance. The choice between "SOLVE" and "ENHANCE" suggests a decision-making process where the system selects the most appropriate strategy based on the specific task and available resources.
</details>
Figure 2: Architecture overview of KGoT (top part) and the design details combined with the workflow (bottom part).
We show the workflow in the bottom part of Figure 2. The workflow begins when the user submits a problem to the system
<details>
<summary>x3.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. The first step is to verify whether the maximum number of iterations allowed for solving the problem has been reached
<details>
<summary>x4.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. If the iteration limit is exceeded, the system will no longer try to gather additional information and insert it into the KG, but instead will return a solution with the existing data in the KG
<details>
<summary>x5.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. Otherwise, the majority vote (over several replies from the LLM) decides whether the system should proceed with the Enhance pathway (using tools to generate new knowledge) or directly proceed to the Solve pathway (gathering the existing knowledge in the KG and using it to deliver the task solution).
The Enhance Pathway If the majority vote indicates an Enhance pathway, the next step involves determining the tools necessary for completing the Enhance operation
<details>
<summary>x6.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. The system then orchestrates the appropriate tool calls based on the KG state
<details>
<summary>x7.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. Once the required data from the tools is collected, the system generates the Enhance query or queries to modify the KG appropriately. Each Enhance query is executed
<details>
<summary>x8.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
and its output is validated. If an error or invalid value is returned, the system attempts to fix the query, retrying a specified number of times. If retries fail, the query is discarded, and the operation moves on. After processing the Enhance operation, the system increments the iteration count and continues until the KG is sufficiently expanded or the iteration limit is reached. This path ensures that the knowledge graph is enriched with relevant and accurate information, enabling the system to progress toward a solution effectively.
The Solve Pathway If the majority vote directs the system to the Solve pathway, the system executes multiple solve operations iteratively
<details>
<summary>x9.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. If an execution produces an invalid value or error three times in a row, the system asks the LLM to attempt to correct the issue by recreating the used query. The query is then re-executed. If errors persist after three such retries, the query is regenerated entirely, disregarding the faulty result, and the process restarts. After the Solve operation returns the result, final parsing is applied, which includes potential mathematical processing to resolve potential calculations
<details>
<summary>x10.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
and refining the output (e.g., formatting the results appropriately)
<details>
<summary>x11.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
.
5 Evaluation
We now show advantages of KGoT over the state of the art. Additional results and full details on the evaluation setup are in Appendix D.
Comparison Baselines. We focus on the Hugging Face (HF) Agents (Roucher & Petrov, 2025), the most competitive scheme in the GAIA benchmark for the hardest level 3 tasks with the GPT-4 class of models. We also compare to two agentic frameworks, namely GPTSwarm (Zhuge et al., 2024) (a representative graph-enhanced multi-agent scheme) and Magentic-One (Fourney et al., 2024), an AI agent equipped with a central orchestrator and multiple integrated tool agents. Next, to evaluate whether database search outperforms graph-based knowledge extraction, we also consider two retrieval-augmented generation (RAG) (Lewis et al., 2020) schemes, a simple RAG scheme and GraphRAG (Edge et al., 2025). Both RAG baselines use the same tool-generated knowledge, chunking data at tool-call granularity (i.e., a chunk corresponds to individual tool call output). Simple RAG constructs a vector database from these tool outputs while GraphRAG instead models the tool outputs as a static KG of entities and relations, enabling retrieval via graph traversal. Finally, we use Zero-Shot schemes where a model answers without any additional agent framework.
KGoT variants. First, we experiment with graph query languages vs. general-purpose languages, cf. Section 2.3. For each option, we vary how the Solve operation is executed, by either having the LLM send a request to the backend (a Python script for NetworkX and a Cypher/SPARQL query for Neo4j/RDF4J) or by directly asking the LLM to infer the answer based on the KG (Direct Retrieval (DR)). We experiment with different query languages (Cypher vs. SPARQL). We also consider âfusionâ runs, which simulate the effect from KGoT runs with both graph backends available simultaneously (or both Solve operation variants harnessed for each task). Fusion runs only incur negligible additional storage overhead because the generated KGs are small (up to several hundreds of nodes). Finally, we experiment with different tool sets. To focus on the differences coming from harnessing the KG, we reuse several utilities from AutoGen (Wu et al., 2024) such as Browser and MDConverter, and tools from HF Agents, such as Surfer Agent, web browsing tools, and Text Inspector.
Considered Metrics We focus primarily on the number of solved tasks as well as token costs ($). Unless stated otherwise, we report single run results due to budget reasons.
Considered Datasets We use the GAIA benchmark (Mialon et al., 2024) focusing on the validation set (165 tasks) for budgetary reasons and also because it comes with the ground truth answers. The considered tasks are highly diverse in nature; many require parsing websites or analyzing PDF, image, and audio files. We focus on GAIA as this is currently the most comprehensive benchmark for general-purpose AI assistants, covering diverse domains such as web navigation, code execution, image reasoning, scientific QA, and multimodal tasks. We further evaluate on SimpleQA (Wei et al., 2024), a factuality benchmark of 4,326 questions, of which we sample 10% for budgetary reasons. The dataset spans diverse topics and emphasizes single, verifiable answers, making it effective for assessing factual accuracy.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Charts: Performance and Cost Comparison of Different Models
### Overview
The image presents two bar charts comparing the performance and average cost of different models, including GPT-4o, GPT-4o mini, and various knowledge graph-enhanced models (KGOT and KGOT fusion) and baselines. The left chart displays the number of solved tasks (higher is better), while the right chart shows the average cost in dollars (lower is better) on a logarithmic scale.
### Components/Axes
**Left Chart (Number of Solved Tasks):**
* **Title:** Number of Solved Tasks (the higher the better)
* **Y-axis:** Number of Solved Tasks, ranging from 0 to 70.
* **X-axis:** Different models and configurations: GPT-4o, GPT-4o mini, Neo4j + Query, NetworkX + Query, Neo4j + DR, NetworkX + DR, RDF4J + Query, Neo4j + NetworkX (Query + DR), Simple RAG, GraphRAG, GPTSwarm, Magentic-One, HF GPT-4o mini, HF GPT-4o.
* **Bar Colors (Legend, located at the top-center of the chart):**
* Level 1: Light Blue
* Level 2: Blue
* Level 3: Purple
* **Sections:** Zero-Shot, KGOT, KGOT (fusion), Baselines.
* **Maximum Solved Tasks:** Indicated by an arrow pointing to the top of the GPT-4o bar in the Baselines section, labeled "Max: 71".
**Right Chart (Average Cost):**
* **Title:** Average Cost ($) (the lower the better)
* **Y-axis:** Average Cost ($) on a logarithmic scale, ranging from 10^-3 to 10^0 (0.001 to 1).
* **X-axis:** Same models and configurations as the left chart.
* **Bar Color:** Purple
* **Sections:** Zero-Shot, KGOT, KGOT (fusion), Baselines.
* **Maximum Cost:** Indicated by an arrow pointing to the top of the GPT-4o bar in the Baselines section, labeled "Max: 3.403$".
### Detailed Analysis
**Left Chart (Number of Solved Tasks):**
* **GPT-4o:**
* Zero-Shot: Level 1: 10, Level 2: 17, Level 3: 2, Total: 29
* Baselines: Level 1: 22, Level 2: 31, Total: 53
* **GPT-4o mini:**
* Zero-Shot: Level 1: 4, Level 2: 13, Total: 17
* KGOT (fusion): Level 1: 20, Level 2: 18, Level 3: 1, Total: 39
* **Neo4j + Query:**
* KGOT: Level 1: 21, Level 2: 18, Level 3: 1, Total: 40
* **NetworkX + Query:**
* KGOT: Level 1: 21, Level 2: 16, Level 3: 3, Total: 40
* **Neo4j + DR:**
* KGOT: Level 1: 21, Level 2: 21, Total: 42
* KGOT (fusion): Level 1: 34, Level 2: 33, Level 3: 4, Total: 71
* **NetworkX + DR:**
* KGOT: Level 1: 20, Level 2: 18, Level 3: 2, Total: 40
* KGOT (fusion): Level 1: 29, Level 2: 24, Level 3: 4, Total: 57
* **RDF4J + Query:**
* KGOT: Level 1: 20, Level 2: 15, Level 3: 1, Total: 36
* **Neo4j + NetworkX (Query + DR):**
* KGOT (fusion): Level 1: 27, Level 2: 28, Level 3: 2, Total: 57
* **Simple RAG:**
* Baselines: Level 1: 18, Level 2: 15, Level 3: 2, Total: 35
* **GraphRAG:**
* Baselines: Level 1: 10, Level 2: 13, Total: 23
* **GPTSwarm:**
* Baselines: Level 1: 13, Level 2: 13, Total: 26
* **Magentic-One:**
* Baselines: Level 1: 13, Level 2: 18, Level 3: 1, Total: 32
* **HF GPT-4o mini:**
* Baselines: Level 1: 14, Level 2: 20, Total: 34
**Right Chart (Average Cost):**
* **GPT-4o:**
* Zero-Shot: 0.017$
* Baselines: 3.403$
* **GPT-4o mini:**
* Zero-Shot: 0.001$
* KGOT (fusion): 0.696$
* **Neo4j + Query:**
* KGOT: 0.098$
* **NetworkX + Query:**
* KGOT: 0.135$
* **Neo4j + DR:**
* KGOT: 0.119$
* KGOT (fusion): 0.145$
* **NetworkX + DR:**
* KGOT: 0.148$
* KGOT (fusion): 0.098$
* **RDF4J + Query:**
* KGOT: 0.091$
* **Neo4j + NetworkX (Query + DR):**
* KGOT (fusion): 0.146$
* **Simple RAG:**
* Baselines: 0.232$
* **GraphRAG:**
* Baselines: 0.006$
* **GPTSwarm:**
* Baselines: 0.696$
* **Magentic-One:**
* Baselines: 0.258$
* **HF GPT-4o mini:**
* Baselines: 0.258$
### Key Observations
* **Performance:** The Neo4j + DR model under KGOT (fusion) achieves the highest number of solved tasks (71), matching the maximum possible. GPT-4o also performs well, especially in the Baselines section.
* **Cost:** GPT-4o mini in the Zero-Shot setting has the lowest average cost (0.001$). GPT-4o in the Baselines section has the highest cost (3.403$).
* **Trade-off:** There is a clear trade-off between performance and cost. Models with higher performance tend to have higher costs.
* **KGOT Fusion:** KGOT fusion models generally outperform their KGOT counterparts in terms of the number of solved tasks.
### Interpretation
The data suggests that KGOT fusion, particularly with Neo4j + DR, significantly enhances the performance of the models in terms of the number of solved tasks. However, this performance comes at a cost, as these models are not the cheapest. GPT-4o mini offers a low-cost solution, but its performance is lower than the KGOT fusion models.
The choice of model depends on the specific requirements and constraints of the application. If performance is the primary concern, KGOT fusion with Neo4j + DR is a good option. If cost is a major factor, GPT-4o mini in the Zero-Shot setting might be more suitable. GPT-4o in the Baselines section is the most expensive and does not offer a proportionally higher number of solved tasks compared to other models, indicating a less efficient trade-off.
The different levels (Level 1, Level 2, Level 3) in the left chart likely represent different difficulty levels or types of tasks. The distribution of these levels across different models provides insights into their strengths and weaknesses in handling various types of tasks.
</details>
Figure 3: Advantages of different variants of KGoT over other baselines (Hugging Face Agents using both GPT-4o-mini and GPT-4o, Magentic-One, GPTSwarm, two RAG baselines, Zero-Shot GPT-4o mini, and Zero-Shot GPT-4o) on the validation dataset of the GAIA benchmark. DR stands for Direct Retrieval. The used model is GPT-4o mini unless noted otherwise.
5.1 Advantages of KGoT
Figure 3 shows the number of solved tasks (the left side) as well as the average cost per solved task (the right side) for different KGoT variants as well as all comparison baselines. While we focus on GPT-4o mini, we also show the results for HF Agents and Zero-Shot with GPT-4o. Additionally, we show the Pareto front in Figure 11 for the multidimensional optimization problem of improving accuracy (i.e., reducing failed tasks) and lowering cost. All variants of KGoT solve a greater number of tasks (up to 9 more) compared to HF Agents while also being more cost-efficient (between 42% to 62% lower costs). The key reason for the KGoT advantages stems from harnessing the knowledge graphâbased representation of the evolving task state.
The ideal fusion runs of Neo4j and NetworkX solve an even greater number of tasks (57 for both) than the single runs, they have a lower average cost (up to 62% lower than HF Agents), and they even outperform HF Agents with GPT-4o. The fusion of all combinations of backend and solver types solve by far the highest number of tasks (71) â more than twice as much as HF Agents â while also exhibiting 44% lower cost than HF Agents. The direct Zero-Shot use of GPT-4o mini and GPT-4o has the lowest average cost per solved task (just $0.0013 and $0.0164 respectively), making it the most cost-effective, however this approach is only able to solve 17 and 29 tasks, respectively. GPTSwarm is cheaper compared to KGoT, but also comes with fewer solved tasks (only 26). While Magentic-One is a capable agent with a sophisticated architecture, its performance with GPT-4o mini is limited, solving 31 tasks correctly, while also exhibiting significantly higher costs. Simple RAG yields somewhat higher costs than KGoT and it solves fewer tasks (35). GraphRAG performs even worse, solving only 23 tasks and incurring even higher cost. While neither RAG baseline can invoke new tools to gather missing information (reducing accuracy and adaptability), GraphRAGâs worse performance is due to the fact that it primarily targets query summarization and not tasks as diverse as those tested by GAIA. Overall, KGoT achieves the best cost-accuracy tradeoff, being both highly affordable and very effective.
5.2 Analysis of Methods for Knowledge Extraction
We explore different methods of extracting knowledge. Overall, in many situations, different methods have complementary strengths and weaknesses.
Graph queries with Neo4j excel at queries such as counting patterns. Yet, Cypher queries can be difficult to generate correctly, especially for graphs with more nodes and edges. Despite this, KGoTâs Cypher queries are able to solve many new GAIA tasks that could not be solved without harnessing Cypher. SPARQL (PĂ©rez et al., 2009) + RDF4J (Eclipse Foundation, 2025) is slightly worse (36 tasks solved) than Cypher + Neo4j (existing literature also indicates that LLMs have difficulties formulating effective SPARQL queries (Emonet et al., 2024; Mecharnia & dâAquin, 2025)).
Python with NetworkX offers certain advantages over Neo4j by eliminating the need for a separate database server, making it a lightweight choice for the KG. Moreover, NetworkX computations are fast and efficient for small to medium-sized graphs without the overhead of database transactions. Unlike Neo4j, which requires writing Cypher queries, we observe that in cases where Neo4j-based implementations struggle, NetworkX-generated graphs tend to be more detailed and provide richer vertex properties and relationships. This is likely due to the greater flexibility of Python code over Cypher queries for graph insertion, enabling more fine-grained control over vertex attributes and relationships. Another reason may be the fact that Python is likely more represented in the training data of the respective models than Cypher.
Our analysis of failed tasks indicates that, in many cases, the KG contains the required data, but the graph query fails to extract it. In such scenarios, Direct Retrieval, where the entire KG is included in the modelâs context, performs significantly better by bypassing query composition issues. However, Direct Retrieval demonstrates lower accuracy in cases requiring structured, multi-step reasoning.
We also found that Direct Retrieval excels at extracting dispersed information but struggles with structured queries, whereas graph queries are more effective for structured reasoning but can fail when the LLM generates incorrect query formulations. Although both Cypher and general-purpose queries occasionally are erroneous, Python scripts require more frequent corrections because they are often longer and more error-prone. However, despite the higher number of corrections, the LLM is able to fix Python code more easily than Cypher queries, often succeeding after a single attempt. During retrieval, the LLM frequently embeds necessary computations directly within the Python scripts while annotating its reasoning through comments, improving transparency and interpretability.
5.3 Advantages on the GAIA Test Set
Table 1: Comparison of KGoT with other current state-of-the-art open-source agents on the full GAIA test set. The baseline data, including for TapeAgent (Bahdanau et al., 2024), of the number of solved tasks is obtained through the GAIA Leaderboard (Mialon et al., 2025). We highlight the best performing scheme in a given category in bold. Model: GPT-4o mini.
| Agents | All | L1 | L2 | L3 |
| --- | --- | --- | --- | --- |
| GPTSwarm | 33 | 15 | 15 | 3 |
| Magentic-One | 43 | 22 | 18 | 3 |
| TapeAgent | 66 | 28 | 35 | 3 |
| Hugging Face Agents | 68 | 30 | 34 | 4 |
| KGoT (fusion) | 73 | 33 | 36 | 4 |
Furthermore, our approach achieves state-of-the-art performance on the GAIA test set with the GPT-4o mini model. The results are shown in Table 1, underscoring its effectiveness across all evaluation levels. The test set consists of 301 tasks (93 level 1 tasks, 159 level 2 tasks and 49 level 3 tasks).
5.4 Advantages beyond GAIA Benchmark
We also evaluate KGoT as well as HF Agents and GPTSwarm on a 10% sample (433 tasks) of the SimpleQA benchmark (detailed results are in Appendix D.1). KGoT performs best, solving 73.21%, while HF Agents and GPTSwarm exhibit reduced accuracy (66.05% and 53.81% respectively). KGoT incurs only 0.018$ per solved task, less than a third of the HF Agents costs (0.058$), while being somewhat more expensive than GPTSwarm (0.00093$).
We further evaluate KGoT on the entire SimpleQA benchmark (due to very high costs of running all SimpleQA questions, we limit the full benchmark evaluation to KGoT). We observe no degradation in performance with a 70.34% accuracy rate. When compared against the official F1-scores of various OpenAI and Claude models (OpenAI, 2025), KGoT outperforms all the available results. Specifically, our design achieves a 71.06% F1 score, significantly surpassing the 49.4% outcome of the top-performing reasoning model and improving upon all mini-reasoning models by at least 3.5 $Ă$ . Furthermore, KGoT exceeds the performance of all standard OpenAI models, from GPT-4oâs 40% F1 score to the best-scoring closed-source model, GPT-4.5, with 62.5%. More detailed results are available in Appendix D.1.
5.5 Ensuring Scalability and Mitigating Bottlenecks
The primary bottleneck in KGoT arises from I/O-bound and latency-sensitive LLM tool invocations (e.g., web browsing, text parsing), which account for 72% of the runtime, which KGoT mitigates through asynchronous execution and graph operation parallelism as discussed in Section 3.4. A detailed breakdown of the runtime is reported in Appendix D.3. Figure 10 confirms KGoTâs scalability, as increasing the number of parallelism consistently reduces the runtime. Moreover, due to the effective knowledge extraction process and the nature of the tasks considered, none of the tasks require large KGs. The maximum graph size that we observed was 522 nodes. This is orders of magnitude below any scalability concerns.
5.6 Impact from Various Design Decisions
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Number of Solved Tasks by Different Models
### Overview
The image is a bar chart comparing the performance of different language models on a set of tasks. The y-axis represents the number of solved tasks, with higher values indicating better performance. The x-axis represents different language models. The chart compares four different methods: GPTSwarm, HF Agents, KGOT (Neo4j + Query), and Zero-Shot.
### Components/Axes
* **Y-axis:** "Number of Solved Tasks (the higher the better)". The scale ranges from 0 to 50, with tick marks at intervals of 10.
* **X-axis:** Categorical axis representing different language models:
* Qwen2.5-32B
* DeepSeek-R1-70B
* GPT-4o mini
* DeepSeek-R1-32B
* QWQ-32B
* DeepSeek-R1-7B
* DeepSeek-R1-1.5B
* Qwen2.5-72B
* Qwen2.5-7B
* Qwen2.5-1.5B
* **Legend:** Located at the top-left of the chart.
* GPTSwarm (light pink)
* HF Agents (light purple)
* KGOT (Neo4j + Query) (blue)
* Zero-Shot (gray with diagonal lines)
### Detailed Analysis
Here's a breakdown of the number of solved tasks for each model and method:
* **Qwen2.5-32B:**
* GPTSwarm: 29
* HF Agents: 19
* KGOT (Neo4j + Query): 26
* Zero-Shot: 15
* **DeepSeek-R1-70B:**
* GPTSwarm: 10
* HF Agents: 16
* KGOT (Neo4j + Query): 22
* Zero-Shot: 0
* **GPT-4o mini:**
* GPTSwarm: 26
* HF Agents: 35
* KGOT (Neo4j + Query): 40
* Zero-Shot: 17
* **DeepSeek-R1-32B:**
* GPTSwarm: 6
* HF Agents: 17
* KGOT (Neo4j + Query): 21
* Zero-Shot: 14
* **QWQ-32B:**
* GPTSwarm: 0
* HF Agents: 16
* KGOT (Neo4j + Query): 20
* Zero-Shot: 0
* **DeepSeek-R1-7B:**
* GPTSwarm: 2
* HF Agents: 3
* KGOT (Neo4j + Query): 6
* Zero-Shot: 13
* **DeepSeek-R1-1.5B:**
* GPTSwarm: 0
* HF Agents: 0
* KGOT (Neo4j + Query): 2
* Zero-Shot: 0
* **Qwen2.5-72B:**
* GPTSwarm: 27
* HF Agents: 38
* KGOT (Neo4j + Query): 39
* Zero-Shot: 19
* **Qwen2.5-7B:**
* GPTSwarm: 11
* HF Agents: 12
* KGOT (Neo4j + Query): 12
* Zero-Shot: 9
* **Qwen2.5-1.5B:**
* GPTSwarm: 5
* HF Agents: 4
* KGOT (Neo4j + Query): 4
* Zero-Shot: 3
### Key Observations
* GPT-4o mini achieves the highest number of solved tasks using KGOT (Neo4j + Query) with a value of 40.
* Zero-Shot performance is generally lower than other methods across all models.
* The KGOT (Neo4j + Query) method consistently performs well across different models.
* DeepSeek-R1-1.5B performs poorly across all methods, with a maximum of 2 solved tasks.
### Interpretation
The chart provides a comparative analysis of different language models and methods for solving tasks. The KGOT (Neo4j + Query) method appears to be the most effective overall, as it consistently achieves high scores across different models. The Zero-Shot method generally underperforms compared to the other methods, suggesting that these models benefit from additional knowledge or prompting strategies. GPT-4o mini and Qwen2.5-72B show the best overall performance, indicating their effectiveness in solving the given tasks. The performance variations across different models and methods highlight the importance of selecting the appropriate model and strategy for specific tasks.
</details>
Figure 4: Performance on the GAIA validation set with KGoT (non-fusion) using various LLM models. For KGoT, we use Cypher queries for knowledge extraction from the Neo4j database.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Bar Chart: Number of Solved Tasks
### Overview
The image is a bar chart comparing the number of solved tasks across different configurations (Neo4j, NetworkX, Neo4j + NetworkX, and No KG) and task types (Query, Direct Retrieve, Query + DR, Single Run #1, Single Run #2, Fusion). The chart displays the number of solved tasks, broken down into three levels (Level 1, Level 2, and Level 3).
### Components/Axes
* **Title:** Number of Solved Tasks (the higher the better)
* **Y-axis:** Number of Solved Tasks (the higher the better), ranging from 0 to 80.
* **X-axis:** Task types: Query, Direct Retrieve, Query + DR, Single Run #1, Single Run #2, Fusion. These are grouped under the configurations Neo4j, NetworkX, Neo4j + NetworkX, and No KG.
* **Legend:** Located at the top of the chart.
* Level 1: Light Blue
* Level 2: Blue
* Level 3: Purple
* **Horizontal Dashed Line:** At y=71, labeled "Max: 71"
### Detailed Analysis
**Neo4j**
* **Query:**
* Level 1: 21
* Level 2: 18
* Level 3: 1
* **Direct Retrieve:**
* Level 1: 21
* Level 2: 16
* Level 3: 3
* **Query + DR:**
* Level 1: 29
* Level 2: 24
* Level 3: 4
**NetworkX**
* **Query:**
* Level 1: 20
* Level 2: 21
* Level 3: 2
* **Direct Retrieve:**
* Level 1: 20
* Level 2: 18
* Level 3: 2
* **Query + DR:**
* Level 1: 27
* Level 2: 28
* Level 3: 2
**Neo4j + NetworkX**
* **Query:**
* Level 1: 28
* Level 2: 25
* Level 3: 3
* **Direct Retrieve:**
* Level 1: 26
* Level 2: 24
* Level 3: 3
* **Query + DR:**
* Level 1: 34
* Level 2: 33
* Level 3: 4
**No KG**
* **Single Run #1:**
* Level 1: 14
* Level 2: 14
* Level 3: 2
* **Single Run #2:**
* Level 1: 17
* Level 2: 16
* Level 3: 2
* **Fusion:**
* Level 1: 18
* Level 2: 20
* Level 3: 2
### Key Observations
* The "Query + DR" task in the "Neo4j + NetworkX" configuration achieves the highest number of solved tasks.
* The "No KG" configuration generally has lower numbers of solved tasks compared to the other configurations.
* Level 1 and Level 2 contribute the most to the total number of solved tasks, with Level 3 contributing the least.
### Interpretation
The chart demonstrates the performance of different knowledge graph systems (Neo4j, NetworkX, and their combination) on various tasks. The "Neo4j + NetworkX" configuration appears to be the most effective, particularly for the "Query + DR" task. The "No KG" configuration serves as a baseline, showing the performance without a knowledge graph. The breakdown into levels could represent different difficulty levels or types of tasks, with Level 1 and Level 2 being more frequently solved than Level 3. The horizontal line at 71 indicates a maximum possible score, suggesting that the "Neo4j + NetworkX" configuration is approaching the optimal performance for the "Query + DR" task.
</details>
Figure 5: The impact coming from harnessing knowledge graphs (KGs) with different knowledge extraction methods (graph queries with Neo4j and Cypher, and general-purpose languages with Python and NetworkX), vs. using no KGs at all. DR stands for Direct Retrieval. Model: GPT-4o mini.
We also show the advantages of KGoT on different open models in Figure 5 over HF Agents and GPTSwarm for nearly all considered models (Yang et al., 2025; Guo et al., 2025). Interestingly, certain sizes of DeepSeek-R1 (Guo et al., 2025) offer high Zero-Shot performance that outperforms both KGoT and HF Agents, illustrating potential for further improvements specifically aimed at Reasoning Language Models (RLMs) (Besta et al., 2025a; c).
Finally, we investigate the impact on performance coming from harnessing KGs, vs. using no KGs at all (the âno KGâ baseline), which we illustrate in Figure 5. Harnessing KGs has clear advantages, with a nearly 2 $Ă$ increase in the number of solved tasks. This confirms the positive impact from structuring the task related knowledge into a graph format, and implies that our workflow generates high quality graphs. To further confirm this, we additionally verified these graphs manually and we discovered that the generated KGs do contain the actual solution (e.g., the solution can be found across nodes/edges of a given KG by string matching). This illustrates that in the majority of the solved tasks, the automatically generated KGs correctly represent the solution and directly enable solving a given task.
We offer further analyses in Appendix D, including studying the impact on performance from different tool sets, prompt formats as well as fusion types.
6 Related Work
Our work is related to numerous LLM domains.
First, we use LangChain (LangChain Inc., 2025a) to facilitate the integration of the LLM agents with the rest of the KGoT system. Other such LLM integration frameworks, such as MiniChain (Rush, 2023) or AutoChain (Forethought, 2023), could be used instead.
Agent collaboration frameworks are systems such as Magentic-One and numerous others (Zhuge et al., 2024; Tang et al., 2024; Liu et al., 2024b; Li et al., 2024; Chu et al., 2024; Wu et al., 2024; Chen et al., 2024; Hong et al., 2024; Shinn et al., 2023; Zhu et al., 2024; Kagaya et al., 2024; Zhao et al., 2024a; Stengel-Eskin et al., 2024; Significant Gravitas, 2025; Zhu et al., 2025). The core KGoT idea that can be applied to enhance such frameworks is that a KG can also be used as a common shared task representation for multiple agents solving a task together. Such a graph would be then updated by more than a single agent. This idea proves effective, as confirmed by the fact that KGoT outperforms highly competitive baselines (HF Agents, Magentic-One, GPTSwarm) in both GAIA and SimpleQA benchmarks.
Some agent frameworks explicitly use graphs for more effective collaboration. Examples are GPTSwarm (Zhuge et al., 2024), MacNet (Qian et al., 2025), and AgentPrune (Zhang et al., 2025). These systems differ from KGoT as they use a graph to model and manage multiple agents in a structured way, forming a hierarchy of tools. Contrarily, KGoT uses KGs to represent the task itself, including its intermediate state. These two design choices are orthogonal and could be combined together. Moreover, while KGoT only relies on in-context learning; both MacNet (Qian et al., 2025) and AgentPrune (Zhang et al., 2025) require additional training rounds, making their integration and deployment more challenging and expensive than KGoT.
Many works exist in the domain of general prompt engineering (Beurer-Kellner et al., 2024; Besta et al., 2025c; Yao et al., 2023a; Besta et al., 2024a; Wei et al., 2022; Yao et al., 2023b; Chen et al., 2023; Creswell et al., 2023; Wang et al., 2023a; Hu et al., 2024; Dua et al., 2022; Jung et al., 2022; Ye et al., 2023). One could use such schemes to further enhance respective parts of the KGoT workflow. While we already use prompts that are suited for encoding knowledge graphs, possibly harnessing other ideas from that domain could bring further benefits.
Task decomposition & planning increases the effectiveness of LLMs by dividing a task into subtasks. Examples include ADaPT (Prasad et al., 2024), ANPL (Huang et al., 2023), and others (Zhu et al., 2025; Shen et al., 2023). Overall, the whole KGoT workflow already harnesses recursive task decomposition: the input task is divided into numerous steps, and many of these steps are further decomposed into sub steps by the LLM Graph Executor if necessary. For example, when solving a task based on the already constructed KG, the LLM Graph Executor may decide to decompose this step similarly to ADaPT. Other decomposition schemes could also be tried, we leave this as future work.
Retrieval-Augmented Generation (RAG) is an important part of the LLM ecosystem, with numerous designs being proposed (Edge et al., 2025; Gao et al., 2024; Besta et al., 2025b; Zhao et al., 2024b; Hu & Lu, 2025; Huang & Huang, 2024; Yu et al., 2024a; Mialon et al., 2023; Li et al., 2022; Abdallah & Jatowt, 2024; Delile et al., 2024; Manathunga & Illangasekara, 2023; Zeng et al., 2024; Wewer et al., 2021; Xu et al., 2024; Sarthi et al., 2024; Asai et al., 2024; Yu et al., 2024b; Gutiérrez et al., 2024). RAG has been used primarily to ensure data privacy and to reduce hallucinations. We illustrate that it has lower performance than KGoT when applied to AI assistant tasks.
Another increasingly important part of the LLM ecosystem is the usage of tools to augment the abilities of LLMs (Beurer-Kellner et al., 2023; Schick et al., 2023; Xie et al., 2024). For example, ToolNet (Liu et al., 2024a) uses a directed graph to model the application of multiple tools while solving a task, however focuses specifically on the iterative usage of tools at scale. KGoT harnesses a flexible and adaptable hierarchy of various tools, which can easily be extended with ToolNet and such designs, to solve a wider range of complex tasks.
While KGoT focuses on classical AI assistant tasks, it can be extended to other applications. Promising directions could include supporting multi-stage, cost-efficient reasoning, for example to enhance the capabilities of the recent reasoning models such as DeepSeek-R1. Extending KGoT to this and other domains may require new ways of KG construction via predictive graph models (Besta et al., 2023a; 2024c), integration with neural graph databases (Besta et al., 2022), or deployment over distributed-memory clusters for scalability. Further, refining its reasoning strategies through advanced task decomposition schemes could improve performance on very long-horizon tasks. These directions highlight both the generality of the framework and current boundaries in tool orchestration, reasoning depth, and scalability, which we aim to address in future work.
7 Conclusion
In this paper, we introduce Knowledge Graph of Thoughts (KGoT), an AI assistant architecture that enhances the reasoning capabilities of low-cost models while significantly reducing operational expenses. By dynamically constructing and evolving knowledge graphs (KGs) that encode the task and its resolution state, KGoT enables structured knowledge representation and retrieval, improving task success rates on benchmarks such as GAIA and SimpleQA. Our extensive evaluation demonstrates that KGoT outperforms existing LLM-based agent solutions, for example achieving a substantial increase in task-solving efficiency of 29% or more over the competitive Hugging Face Agents baseline, while ensuring over 36 $Ă$ lower costs. Thanks to its modular design, KGoT can be extended to new domains that require complex multi-step reasoning integrated with extensive interactions with the external compute environment, for example automated scientific discovery or software design.
Acknowledgments
We thank Chi Zhang and Muyang Du for their contributions to the framework. We thank Hussein Harake, Colin McMurtrie, Mark Klein, Angelo Mangili, and the whole CSCS team granting access to the Ault, Daint and Alps machines, and for their excellent technical support. We thank Timo Schneider for help with infrastructure at SPCL. This project received funding from the European Research Council (Project PSAP, No. 101002047), and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No. 955513 (MAELSTROM). This project was supported by the ETH Future Computing Laboratory (EFCL), financed by a donation from Huawei Technologies. This project received funding from the European Unionâs HE research and innovation programme under the grant agreement No. 101070141 (Project GLACIATION). We gratefully acknowledge the Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/017103.
References
- Abdallah & Jatowt (2024) Abdelrahman Abdallah and Adam Jatowt. Generator-Retriever-Generator Approach for Open-Domain Question Answering, March 2024. URL https://arxiv.org/abs/2307.11278. arXiv:2307.11278.
- Asai et al. (2024) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 9112â9141, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/25f7be9694d7b32d5cc670927b8091e1-Abstract-Conference.html.
- Bahdanau et al. (2024) Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mitul Tiwari, and Quaizar Vohra. TapeAgents: A Holistic Framework for Agent Development and Optimization, December 2024. URL https://arxiv.org/abs/2412.08445. arXiv:2412.08445.
- Ben Mahria et al. (2021) Bilal Ben Mahria, Ilham Chaker, and Azeddine Zahi. An Empirical Study on the Evaluation of the RDF Storage Systems. Journal of Big Data, 8(1):100:1â100:20, July 2021. ISSN 2196-1115. doi: 10.1186/s40537-021-00486-y. URL https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00486-y.
- Benedicic et al. (2019) Lucas Benedicic, Felipe A. Cruz, Alberto Madonna, and Kean Mariotti. Sarus: Highly Scalable Docker Containers for HPC Systems. In MichĂšle Weiland, Guido Juckeland, Sadaf Alam, and Heike Jagode (eds.), Proceedings of the International Conference on High Performance Computing (ICS â19), volume 11887 of Lecture Notes in Computer Science, pp. 46â60, Frankfurt, Germany, June 2019. Springer International Publishing. ISBN 978-3-030-34356-9. doi: 10.1007/978-3-030-34356-9Ë5. URL https://link.springer.com/chapter/10.1007/978-3-030-34356-9_5.
- Besta et al. (2018) Maciej Besta, Dimitri Stanojevic, Tijana Zivic, Jagpreet Singh, Maurice Hoerold, and Torsten Hoefler. Log(Graph): A Near-Optimal High-Performance Graph Representation. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT â18, pp. 7:1â7:13, Limassol, Cyprus, November 2018. Association for Computing Machinery. ISBN 9781450359863. doi: 10.1145/3243176.3243198. URL https://doi.org/10.1145/3243176.3243198.
- Besta et al. (2022) Maciej Besta, Patrick Iff, Florian Scheidl, Kazuki Osawa, Nikoli Dryden, Michal Podstawski, Tiancheng Chen, and Torsten Hoefler. Neural Graph Databases. In Bastian Rieck and Razvan Pascanu (eds.), Proceedings of the First Learning on Graphs Conference, volume 198 of Proceedings of Machine Learning Research, pp. 31:1â31:38, Virtual Event, December 2022. PMLR. URL https://proceedings.mlr.press/v198/besta22a.html.
- Besta et al. (2023a) Maciej Besta, Afonso Claudino Catarino, Lukas Gianinazzi, Nils Blach, Piotr Nyczyk, Hubert Niewiadomski, and Torsten Hoefler. HOT: Higher-Order Dynamic Graph Representation Learning with Efficient Transformers. In Soledad Villar and Benjamin Chamberlain (eds.), Proceedings of the Second Learning on Graphs Conference, volume 231 of Proceedings of Machine Learning Research, pp. 15:1â15:20, Virtual Event, November 2023a. PMLR. URL https://proceedings.mlr.press/v231/besta24a.html.
- Besta et al. (2023b) Maciej Besta, Robert Gerstenberger, Marc Fischer, Michal Podstawski, Nils Blach, Berke Egeli, Georgy Mitenkov, Wojciech Chlapek, Marek Michalewicz, Hubert Niewiadomski, JĂŒrgen MĂŒller, and Torsten Hoefler. The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC â23, pp. 22:1â22:18, Denver, CO, USA, November 2023b. Association for Computing Machinery. ISBN 9798400701092. doi: 10.1145/3581784.3607068. URL https://doi.org/10.1145/3581784.3607068.
- Besta et al. (2023c) Maciej Besta, Robert Gerstenberger, Emanuel Peter, Marc Fischer, MichaĆ Podstawski, Claude Barthels, Gustavo Alonso, and Torsten Hoefler. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. ACM Comput. Surv., 56(2):31:1â31:40, September 2023c. ISSN 0360-0300. doi: 10.1145/3604932. URL https://doi.org/10.1145/3604932.
- Besta et al. (2024a) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682â17690, March 2024a. doi: 10.1609/aaai.v38i16.29720. URL https://ojs.aaai.org/index.php/AAAI/article/view/29720.
- Besta et al. (2024b) Maciej Besta, Robert Gerstenberger, Patrick Iff, Pournima Sonawane, Juan GĂłmez Luna, Raghavendra Kanakagiri, Rui Min, Onur Mutlu, Torsten Hoefler, Raja Appuswamy, and Aidan O Mahony. Hardware Acceleration for Knowledge Graph Processing: Challenges & Recent Developments, November 2024b. URL https://arxiv.org/abs/2408.12173. arXiv:2408.12173.
- Besta et al. (2024c) Maciej Besta, Florian Scheidl, Lukas Gianinazzi, Grzegorz KwaĆniewski, Shachar Klaiman, JĂŒrgen MĂŒller, and Torsten Hoefler. Demystifying Higher-Order Graph Neural Networks, December 2024c. URL https://arxiv.org/abs/2406.12841. arXiv:2406.12841.
- Besta et al. (2025a) Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz KwaĆniewski, JĂŒrgen MĂŒller, Ćukasz Flis, Hannes Eberhard, Zixuan Chen, Hubert Niewiadomski, and Torsten Hoefler. Reasoning Language Models: A Blueprint, June 2025a. URL https://arxiv.org/abs/2501.11223. arXiv:2501.11223.
- Besta et al. (2025b) Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, MichaĆ Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, JĂŒrgen MĂŒller, Hubert Niewiadomski, and Torsten Hoefler. Multi-Head RAG: Solving Multi-Aspect Problems with LLMs, July 2025b. URL https://arxiv.org/abs/2406.05085. arXiv:2406.05085.
- Besta et al. (2025c) Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz KwaĆniewski, JĂŒrgen MĂŒller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Aidan OâMahony, Onur Mutlu, and Torsten Hoefler. Demystifying Chains, Trees, and Graphs of Thoughts. IEEE Transactions on Pattern Analysis and Machine Intelligence, August 2025c. doi: 10.1109/TPAMI.2025.3598182. URL https://ieeexplore.ieee.org/document/11123142.
- Besta et al. (2025d) Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, and Torsten Hoefler. CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks, July 2025d. URL https://arxiv.org/abs/2406.02524. arXiv:2406.02524.
- Beurer-Kellner et al. (2023) Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Large Language Models are Zero-Shot Multi-Tool Users. In Proceedings of the ICML Workshop on Knowledge and Logical Reasoning in the Era of Data-Driven Learning, KLR â23, Honolulu, HI, USA, July 2023. URL https://files.sri.inf.ethz.ch/website/papers/lmql_actions.pdf.
- Beurer-Kellner et al. (2024) Luca Beurer-Kellner, Mark Niklas MĂŒller, Marc Fischer, and Martin Vechev. Prompt Sketching for Large Language Models. In Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 3674â3706, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/beurer-kellner24b.html.
- Bhattacharjya et al. (2024) Debarun Bhattacharjya, Junkyu Lee, Don Joven Agravante, Balaji Ganesan, and Radu Marinescu. Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning, February 2024. URL https://arxiv.org/abs/2402.01602. arXiv:2402.01602.
- Chen et al. (2024) Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. AutoAgents: A Framework for Automatic Agent Generation. In Kate Larson (ed.), Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI â24, pp. 22â30, Jeju, South Korea, August 2024. International Joint Conferences on Artificial Intelligence Organization. doi: 10.24963/ijcai.2024/3. URL https://www.ijcai.org/proceedings/2024/3.
- Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research, November 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.
- Chu et al. (2024) Zhixuan Chu, Yan Wang, Feng Zhu, Lu Yu, Longfei Li, and Jinjie Gu. Professional Agents â Evolving Large Language Models into Autonomous Experts with Human-Level Competencies, February 2024. URL https://arxiv.org/abs/2402.03628. arXiv:2402.03628.
- Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR â23, Kigali, Rwanda, May 2023. OpenReview. URL https://openreview.net/forum?id=3Pf3Wg6o-A4.
- Delile et al. (2024) Julien Delile, Srayanta Mukherjee, Anton Van Pamel, and Leonid Zhukov. Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge. In Proceedings of the Workshop ML for Life and Material Science: From Theory to Industry Applications, ML4LMS â24, Vienna, Austria, July 2024. OpenReview. URL https://openreview.net/forum?id=RUwfsPWrv3.
- Docker Inc. (2025) Docker Inc. Docker: Accelerated Container Applications. https://www.docker.com/, July 2025. Accessed: 2025-09-22.
- Dua et al. (2022) Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive Prompting for Decomposing Complex Questions. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP â22, pp. 1251â1265, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.81. URL https://aclanthology.org/2022.emnlp-main.81/.
- Eclipse Foundation (2025) Eclipse Foundation. RDF4J. https://rdf4j.org/, September 2025. Accessed: 2025-09-22.
- Edge et al. (2025) Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From Local to Global: A Graph RAG Approach to Query-Focused Summarization, February 2025. URL https://arxiv.org/abs/2404.16130. arXiv:2404.16130.
- Emonet et al. (2024) Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, and Ana Claudia Sima. LLM-Based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs. In Reham Alharbi, Jacopo de Berardinis, Paul Groth, Albert Meroño Peñuela, Elena Simperl, and Valentina Tamma (eds.), Proceedings of the Special Session on Harmonising Generative AI and Semantic Web Technologies (HGAIS â24), volume 3953 of Workshop Proceedings, Baltimore, MD, USA, November 2024. CEUR. URL https://ceur-ws.org/Vol-3953/355.pdf.
- Forethought (2023) Forethought. AutoChain. https://autochain.forethought.ai/, 2023. Accessed: 2025-09-22.
- Fourney et al. (2024) Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks, November 2024. URL https://arxiv.org/abs/2411.04468. arXiv:2411.04468.
- Francis et al. (2018) Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and AndrĂ©s Taylor. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the International Conference on Management of Data, SIGMOD â18, pp. 1433â1445, Houston, TX, USA, June 2018. Association for Computing Machinery. ISBN 9781450347037. doi: 10.1145/3183713.3190657. URL https://doi.org/10.1145/3183713.3190657.
- Gao et al. (2024) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey, March 2024. URL https://arxiv.org/abs/2312.10997. arXiv:2312.10997.
- Gu et al. (2025) Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A Survey on LLM-as-a-Judge, March 2025. URL https://arxiv.org/abs/2411.15594. arXiv:2411.15594.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, January 2025. URL https://arxiv.org/abs/2501.12948. arXiv:2501.12948.
- Guo et al. (2024) Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Kate Larson (ed.), Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI â24, pp. 8048â8057, Jeju, South Korea, August 2024. International Joint Conferences on Artificial Intelligence Organization. doi: 10.24963/ijcai.2024/890. URL https://www.ijcai.org/proceedings/2024/890. Survey Track.
- GutiĂ©rrez et al. (2024) Bernal JimĂ©nez GutiĂ©rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS â24), volume 37 of Advances in Neural Information Processing Systems, pp. 59532â59569, Vancouver, Canada, December 2024. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/6ddc001d07ca4f319af96a3024f6dbd1-Abstract-Conference.html.
- Hong et al. (2024) Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and JĂŒrgen Schmidhuber. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 23247â23275, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/6507b115562bb0a305f1958ccc87355a-Abstract-Conference.html.
- Hu et al. (2024) Hanxu Hu, Hongyuan Lu, Huajian Zhang, Wai Lam, and Yue Zhang. Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models, August 2024. URL https://arxiv.org/abs/2305.10276. arXiv:2305.10276.
- Hu & Lu (2025) Yucheng Hu and Yuxing Lu. RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, June 2025. URL https://arxiv.org/abs/2404.19543. arXiv:2404.19543.
- Huang et al. (2023) Di Huang, Ziyuan Nan, Xing Hu, Pengwei Jin, Shaohui Peng, Yuanbo Wen, Rui Zhang, Zidong Du, Qi Guo, Yewen Pu, and Yunji Chen. ANPL: Towards Natural Programming with Interactive Decomposition. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 69404â69440, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/dba8fa689ede9e56cbcd4f719def38fb-Abstract-Conference.html.
- Huang & Huang (2024) Yizheng Huang and Jimmy Huang. A Survey on Retrieval-Augmented Text Generation for Large Language Models, August 2024. URL https://arxiv.org/abs/2404.10981. arXiv:2404.10981.
- Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP â22, pp. 1266â1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.82. URL https://aclanthology.org/2022.emnlp-main.82/.
- Kaddour et al. (2023) Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and Applications of Large Language Models, July 2023. URL https://arxiv.org/abs/2307.10169. arXiv:2307.10169.
- Kagaya et al. (2024) Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You. RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents. In Proceedings of the Workshop on Open-World Agents, OWA â24, Vancouver, Canada, December 2024. OpenReview. URL https://openreview.net/forum?id=Xf49Dpxuox.
- Kim et al. (2024) Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. An LLM Compiler for Parallel Function Calling. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 24370â24391, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/kim24y.html.
- LangChain Inc. (2025a) LangChain Inc. LangChain. https://www.langchain.com/, 2025a. Accessed: 2025-09-22.
- LangChain Inc. (2025b) LangChain Inc. Dealing with API Errors. https://js.langchain.com/v0.1/docs/modules/data_connection/text_embedding/api_errors/, 2025b. Accessed: 2025-09-22.
- LangChain Inc. (2025c) LangChain Inc. LangChain Core Tools: BaseTool. https://api.python.langchain.com/en/latest/tools/langchain_core.tools.BaseTool.html, 2025c. Accessed: 2025-09-22.
- LangChain Inc. (2025d) LangChain Inc. How to parse JSON output. https://python.langchain.com/docs/how_to/output_parser_json/, 2025d. Accessed: 2025-09-22.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich KĂŒttler, Mike Lewis, Wen-tau Yih, Tim RocktĂ€schel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Proceedings of the Thirty-Fourth Annual Conference on Neural Information Processing Systems (NeurIPS â20), volume 33 of Advances in Neural Information Processing Systems, pp. 9459â9474, Virtual Event, December 2020. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
- Li & Vasarhelyi (2024) Huaxia Li and Miklos A. Vasarhelyi. Applying Large Language Models in Accounting: A Comparative Analysis of Different Methodologies and Off-the-Shelf Examples. Journal of Emerging Technologies in Accounting, 21(2):133â152, October 2024. ISSN 1554-1908. doi: 10.2308/JETA-2023-065. URL https://publications.aaahq.org/jeta/article-abstract/21/2/133/12800/.
- Li et al. (2022) Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. A Survey on Retrieval-Augmented Text Generation, February 2022. URL https://arxiv.org/abs/2202.01110. arXiv:2202.01110.
- Li et al. (2024) Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More Agents Is All You Need. Transactions on Machine Learning Research, October 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=bgzUSZ8aeg.
- Liu et al. (2024a) Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph, February 2024a. URL https://arxiv.org/abs/2403.00839. arXiv:2403.00839.
- Liu et al. (2024b) Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration. In Proceedings of the First Conference on Language Modeling, COLM â24, Philadelphia, PA, USA, October 2024b. OpenReview. URL https://openreview.net/forum?id=XII0Wp1XA9.
- Manathunga & Illangasekara (2023) S. S. Manathunga and Y. A. Illangasekara. Retrieval Augmented Generation and Representative Vector Summarization for Large Unstructured Textual Data in Medical Education, August 2023. URL https://arxiv.org/abs/2308.00479. arXiv:2308.00479.
- Mecharnia & dâAquin (2025) Thamer Mecharnia and Mathieu dâAquin. Performance and Limitations of Fine-Tuned LLMs in SPARQL Query Generation. In Genet Asefa Gesese, Harald Sack, Heiko Paulheim, Albert Merono-Penuela, and Lihu Chen (eds.), Proceedings of the Workshop on Generative AI and Knowledge Graphs, GenAIK â25, pp. 69â77, Abu Dhabi, United Arab Emirates, January 2025. International Committee on Computational Linguistics. URL https://aclanthology.org/2025.genaik-1.8/.
- Mialon et al. (2023) Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented Language Models: A Survey. Transactions on Machine Learning Research, July 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=jh7wH2AzKK. Survey Certification.
- Mialon et al. (2024) GrĂ©goire Mialon, ClĂ©mentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 9025â9049, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/25ae35b5b1738d80f1f03a8713e405ec-Abstract-Conference.html.
- Mialon et al. (2025) Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA Leaderboard. https://huggingface.co/spaces/gaia-benchmark/leaderboard, September 2025. Accessed: 2025-09-25.
- NetworkX Developers (2025) NetworkX Developers. NetworkX Documentation. https://networkx.org/, May 2025. Accessed: 2025-09-22.
- OpenAI (2025) OpenAI. simple-evals. https://github.com/openai/simple-evals, July 2025. Accessed: 2025-09-22.
- PĂ©rez et al. (2009) Jorge PĂ©rez, Marcelo Arenas, and Claudio Gutierrez. Semantics and Complexity of SPARQL. ACM Trans. Database Syst., 34(3):16:1â16:45, September 2009. ISSN 0362-5915. doi: 10.1145/1567274.1567278. URL https://doi.org/10.1145/1567274.1567278.
- Prasad et al. (2024) Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-Needed Decomposition and Planning with Language Models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of the Association for Computational Linguistics: NAACL 2024, pp. 4226â4252, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.264. URL https://aclanthology.org/2024.findings-naacl.264/.
- Python Software Foundation (2025a) Python Software Foundation. codecs â Codec registry and base classes. https://docs.python.org/3/library/codecs.html, September 2025a. Accessed: 2025-09-22.
- Python Software Foundation (2025b) Python Software Foundation. asyncio â Asynchronous I/O. https://docs.python.org/3/library/asyncio.html, September 2025b. Accessed: 2025-09-22.
- Qian et al. (2025) Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling Large Language Model-Based Multi-Agent Collaboration. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), Proceedings of the Thirteenth International Conference on Learning Representations, ICLR â25, pp. 41488â41505, Singapore, April 2025. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/66a026c0d17040889b50f0dfa650e5e0-Abstract-Conference.html.
- Robinson et al. (2015) Ian Robinson, Jim Webber, and Emil Eifrem. Graph Database Internals. In Graph Databases, chapter 7, pp. 149â170. OâReilly, Sebastopol, CA, USA, 2nd edition, 2015. ISBN 9781491930892.
- Roucher & Petrov (2025) Aymeric Roucher and Sergei Petrov. Beating GAIA with Transformers Agents. https://github.com/aymeric-roucher/GAIA, February 2025. Accessed: 2025-09-22.
- Rush (2023) Alexander Rush. MiniChain: A Small Library for Coding with Large Language Models. In Yansong Feng and Els Lefever (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP â23, pp. 311â317, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-demo.27. URL https://aclanthology.org/2023.emnlp-demo.27.
- Sarthi et al. (2024) Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher Manning. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 32628â32649, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/8a2acd174940dbca361a6398a4f9df91-Abstract-Conference.html.
- Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto DessĂŹ, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 68539â68551, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html.
- SerpApi LLM (2025) SerpApi LLM. SerpApi: Google Search API. https://serpapi.com/, 2025. Accessed: 2025-09-22.
- Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 38154â38180, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/77c33e6a367922d003ff102ffb92b658-Abstract-Conference.html.
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 8634â8652, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html.
- Significant Gravitas (2025) Significant Gravitas. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT, September 2025. Accessed: 2025-09-22.
- Singhal (2012) Amit Singhal. Introducing the Knowledge Graph: things, not strings. https://www.blog.google/products/search/introducing-knowledge-graph-things-not/, May 2012. Accessed: 2025-09-22.
- Stengel-Eskin et al. (2024) Elias Stengel-Eskin, Archiki Prasad, and Mohit Bansal. ReGAL: Refactoring Programs to Discover Generalizable Abstractions. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 46605â46624, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/stengel-eskin24a.html.
- Sumers et al. (2024) Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive Architectures for Language Agents. Transactions on Machine Learning Research, February 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=1i6ZCvflQJ. Survey Certification.
- Tang et al. (2024) Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and TegawendĂ© F. BissyandĂ©. CodeAgent: Autonomous Communicative Agents for Code Review. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP â24, pp. 11279â11313, Miami, FL, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.632. URL https://aclanthology.org/2024.emnlp-main.632/.
- Tenacity Developers (2025a) Tenacity Developers. Tenacity: Retrying Library. https://github.com/jd/tenacity, April 2025a. Accessed: 2025-09-22.
- Tenacity Developers (2025b) Tenacity Developers. Tenacity Documentation. https://tenacity.readthedocs.io/en/latest/, 2025b. Accessed: 2025-09-22.
- Wang et al. (2023a) Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalonâs Game of Thoughts: Battle Against Deception through Recursive Contemplation, October 2023a. URL https://arxiv.org/abs/2310.01320. arXiv:2310.01320.
- Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR â23, Kigali, Rwanda, May 2023b. OpenReview. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Wang et al. (2023c) Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian (Shawn) Ma, and Yitao Liang. Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 34153â34189, New Orleans, LA, USA, December 2023c. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/6b8dfb8c0c12e6fafc6c256cb08a5ca7-Abstract-Conference.html.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS â22), volume 35 of Advances in Neural Information Processing Systems, pp. 24824â24837, New Orleans, LA, USA, December 2022. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
- Wei et al. (2024) Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring Short-Form Factuality in Large Language Models, November 2024. URL https://arxiv.org/abs/2411.04368. arXiv:2411.04368.
- Wewer et al. (2021) Christopher Wewer, Florian Lemmerich, and Michael Cochez. Updating Embeddings for Dynamic Knowledge Graphs, September 2021. URL https://arxiv.org/abs/2109.10896. arXiv:2109.10896.
- Wu et al. (2024) Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In Proceedings of the First Conference on Language Modeling, COLM â24, Philadelphia, PA, USA, October 2024. OpenReview. URL https://openreview.net/forum?id=BAakY1hNKS.
- Xie et al. (2024) Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Zeju Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. OpenAgents: An Open Platform for Language Agents in the Wild. In Proceedings of the First Conference on Language Modeling, COLM â24, Philadelphia, PA, USA, October 2024. OpenReview. URL https://openreview.net/forum?id=sKATR2O1Y0.
- Xu et al. (2024) Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Chaojun Xiao, Zhiyuan Liu, Ge Yu, and Chenyan Xiong. ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents, October 2024. URL https://arxiv.org/abs/2402.13547. arXiv:2402.13547.
- Yang et al. (2025) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, et al. Qwen2.5 Technical Report, January 2025. URL https://arxiv.org/abs/2412.15115. arXiv:2412.15115.
- Yao et al. (2023a) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 11809â11822, New Orleans, LA, USA, December 2023a. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html.
- Yao et al. (2023b) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR â23, Kigali, Rwanda, May 2023b. OpenReview. URL https://openreview.net/forum?id=WE_vluYUL-X.
- Ye et al. (2023) Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large Language Models Are Versatile Decomposers: Decomposing Evidence and Questions for Table-Based Reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR â23, pp. 174â184, Taipei, Taiwan, July 2023. Association for Computing Machinery. ISBN 9781450394086. doi: 10.1145/3539618.3591708. URL https://doi.org/10.1145/3539618.3591708.
- Yu et al. (2024a) Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of Retrieval-Augmented Generation: A Survey. In Wenwu Zhu, Hui Xiong, Xiuzhen Cheng, Lizhen Cui, Zhicheng Dou, Junyu Dong, Shanchen Pang, Li Wang, Lanju Kong, and Zhenxiang Chen (eds.), Proceedings of the 12th CCF Conference, BigData, volume 2301 of Communications in Computer and Information Science (CCIS), pp. 102â120, Qingdao, China, August 2024a. Springer Nature. ISBN 978-981-96-1024-2. doi: 10.1007/978-981-96-1024-2Ë8. URL https://link.springer.com/chapter/10.1007/978-981-96-1024-2_8.
- Yu et al. (2024b) Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP â24, pp. 14672â14685, Miami, FL, USA, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.813. URL https://aclanthology.org/2024.emnlp-main.813/.
- Zeng et al. (2024) Huimin Zeng, Zhenrui Yue, Qian Jiang, and Dong Wang. Federated Recommendation via Hybrid Retrieval Augmented Generation. In Wei Ding, Chang-Tien Lu, Fusheng Wang, Liping Di, Kesheng Wu, Jun Huan, Raghu Nambiar, Jundong Li, Filip Ilievski, Ricardo Baeza-Yates, and Xiaohua Hu (eds.), Proceedings of the IEEE International Conference on Big Data, BigData â24, pp. 8078â8087, Washington, DC, USA, December 2024. IEEE Press. doi: 10.1109/BigData62323.2024.10825302. URL https://ieeexplore.ieee.org/document/10825302.
- Zhang et al. (2025) Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the Crap: An Economical Communication Pipeline for LLM-Based Multi-Agent Systems. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), Proceedings of the Thirteenth International Conference on Learning Representations, ICLR â25, pp. 75389â75428, Singapore, April 2025. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/bbc461518c59a2a8d64e70e2c38c4a0e-Abstract-Conference.html.
- Zhao et al. (2024a) Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19632â19642, March 2024a. doi: 10.1609/aaai.v38i17.29936. URL https://ojs.aaai.org/index.php/AAAI/article/view/29936.
- Zhao et al. (2024b) Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-Augmented Generation for AI-Generated Content: A Survey, June 2024b. URL https://arxiv.org/abs/2402.19473. arXiv:2402.19473.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 46595â46623, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
- Zhu et al. (2025) Yuqi Zhu, Shuofei Qiao, Yixin Ou, Shumin Deng, Shiwei Lyu, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen, and Ningyu Zhang. KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Findings of the Association for Computational Linguistics: NAACL 2025, pp. 3709â3732, Albuquerque, NM, USA, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025.findings-naacl.205/.
- Zhu et al. (2024) Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large Language Models Can Learn Rules, December 2024. URL https://arxiv.org/abs/2310.07064. arXiv:2310.07064.
- Zhuge et al. (2024) Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and JĂŒrgen Schmidhuber. GPTSwarm: Language Agents as Optimizable Graphs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 62743â62767, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/zhuge24a.html.
Appendix A Additional Examples of Knowledge Graph Representation of Tasks
We include selected snapshots of KG representation of tasks, covering a wide range of graph structures from simple chains to trees and cyclic graphs. Each snapshot captures the current KG state in a JSON file, exported using a predefined query that retrieves all labeled nodes and edges. Regardless of the underlying graph backend, the use of a consistent export format allows all snapshots to be visualized through Neo4jâs built-in web interface. In the following, we showcase illustrations of such snapshots and task statements from the GAIA validation set. Please note that the GAIA benchmark discourages making its tasks accessible to crawling. To honor their wishes, we replaced the names of entities with placeholders in the following examples, while keeping the overall structure intact.
<details>
<summary>x15.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph for Question Answering
### Overview
The image illustrates the process of resolving a question using an enhanced knowledge graph. It shows a question being processed and transformed into a structured knowledge graph representation.
### Components/Axes
* **Left Side:**
* **Question:** "What writer is quoted by Merriam-Webster for the Word of the Day from [date]?"
* **Question Number:** 59
* **Required Tool(s):**
* 1 Web browser (represented by a globe icon)
* 2 Search engine (represented by a magnifying glass icon)
* 3 Audio capability (represented by a speaker icon)
* **Middle:**
* **KGOT Task Resolution:** An arrow pointing from left to right, indicating the flow of information.
* **Right Side:**
* **Enhanced Knowledge Graph:** A purple rounded rectangle containing a knowledge graph.
* **Nodes:**
* Date (black circle)
* Concept (black circle, labeled as "[concept]")
* Quote (black circle)
* Word (white circle with "Word" inside)
* [firstname lastname] (white circle with "..." inside)
* **Edges:**
* HAS_DATE: Connects Date to Concept
* HAS QUOTE: Connects Concept to Quote
* QUOTED BY: Connects Quote to [firstname lastname]
### Detailed Analysis
The diagram shows how a question is transformed into a knowledge graph. The question requires the use of a web browser, search engine, and audio capability. The KGOT Task Resolution process converts the question into a knowledge graph with nodes representing Date, Concept, Quote, Word, and [firstname lastname]. The edges represent the relationships between these nodes: HAS_DATE, HAS_QUOTE, and QUOTED BY.
### Key Observations
* The question is about finding the writer quoted by Merriam-Webster for the Word of the Day.
* The knowledge graph represents the relationships between the date, the concept, the quote, the word, and the writer's name.
* The use of "[date]" and "[firstname lastname]" indicates that these are placeholders for specific values that would be extracted during the task resolution process.
### Interpretation
The diagram illustrates a knowledge-based approach to question answering. The question is parsed and transformed into a structured representation (the knowledge graph), which can then be used to retrieve the answer. The knowledge graph captures the key entities and relationships involved in the question, allowing for more efficient and accurate information retrieval. The diagram highlights the importance of tools like web browsers, search engines, and audio capabilities in the question answering process.
</details>
Figure 6: Example of a chain structure. This task requires 7 intermediate steps and the usage of 3 tools. The expected solution is â[firstname lastname]â. KGoT invokes the Surfer agent to search for relevant pages, locate the relevant quote, and find the person who said it. All intermediate information is successfully retrieved and used for enhancing the dynamically constructed KG. The quote contains two properties, significance and text. âsignificanceâ stores the meaning of the quote, whereas âtextâ stores the actual quote.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph for Question Answering
### Overview
The image presents a diagram illustrating the process of answering a question using a knowledge graph. It shows a question, the required tools, and a simplified knowledge graph representing relationships between entities.
### Components/Axes
* **Left Panel:** Contains the question and required tools.
* **Question:** "Question: 51. The [museum name] has a portrait in its collection with an accession number of [number]. Of the consecrators and co-consecrators of this portrait's subject as a bishop, what is the name of the one who never became pope?"
* **Required Tool(s):**
* "1 Web browser" (with a spider icon)
* "2 Search engine" (with a magnifying glass icon)
* **Middle:** An arrow labeled "KGOT Task Resolution" pointing from left to right.
* **Right Panel:** Shows an "Enhanced Knowledge Graph" within a light purple box.
* **Nodes:** Represent entities (Bishop, Pope, people).
* A node labeled "Bishop" is white with a black border.
* Two nodes labeled "Pope" are white with black borders.
* Other nodes are solid black.
* **Edges:** Represent relationships between entities, labeled "CO_CONSECRATED".
* **Node Labels:**
* "[firstname1 lastname1]" (top)
* "[firstname2 lastname2]" (bottom-left)
* "[firstname3 lastname3]" (right)
* "[popename]" (bottom-center)
### Detailed Analysis or ### Content Details
* **Question:** The question is about identifying a person who consecrated a bishop but never became a pope. It requires information about a portrait in a museum's collection.
* **KGOT Task Resolution:** This indicates the use of a Knowledge Graph Optimization Toolkit to resolve the task.
* **Enhanced Knowledge Graph:**
* The "Bishop" node is connected to a black node labeled "[firstname2 lastname2]" via a "CO_CONSECRATED" edge.
* The "[firstname2 lastname2]" node is also connected to the "[popename]" node via a "CO_CONSECRATED" edge.
* The "[popename]" node is connected to the "[firstname3 lastname3]" node labeled "Pope" via a "CO_CONSECRATED" edge.
* The "[popename]" node is connected to the "[firstname1 lastname1]" node via a "CO_CONSECRATED" edge.
* The "[firstname1 lastname1]" node is connected to the "Pope" node via a "CO_CONSECRATED" edge.
### Key Observations
* The diagram illustrates how a question can be answered by traversing a knowledge graph.
* The "CO_CONSECRATED" relationship is central to the question, linking bishops and popes.
* The graph suggests a path to identify the person who consecrated the bishop but was not a pope.
### Interpretation
The diagram demonstrates a simplified knowledge graph used for question answering. The question requires identifying a person who consecrated a bishop but never became pope. The knowledge graph represents entities (people, roles) and relationships (co-consecration). By traversing the graph, the system can potentially identify the correct individual. The diagram highlights the importance of structured knowledge representation in automated question answering systems. The use of "[...]" placeholders indicates that the actual names and accession numbers would be populated from a real knowledge base.
</details>
Figure 7: Example of a tree structure. This task requires 6 intermediate steps and the usage of 2 tools. The expected solution is â[firstname1 lastname1]â. The Surfer agent is also invoked for this task. In this KG representation of the task, [popename] is identified as the consecrator, where [firstname1 lastname1], [firstname2 lastname2] and [firstname3 lastname3] are all co-consecrators. Subsequently, the correct answer is obtained from the KGoT from the KG by correctly identifying [firstname1 lastname1] as the one without any labels.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Diagram: Knowledge Graph Task Resolution
### Overview
The image presents a diagram illustrating the resolution of a question using an enhanced knowledge graph. The question involves finding the number of studio albums published by a specific artist within a given time frame. The diagram shows the process of using web browser and search engine tools to query a knowledge graph and retrieve the necessary information.
### Components/Axes
* **Left Side:**
* **Question:** "Question: 6"
* **Question Text:** "How many studio albums were published by [firstname lastname] between [year] and [year] (included)? You can use the latest 2022 version of english wikipedia."
* **Required Tools:**
* "1 Web browser" (with a web browser icon)
* "2 Search engine" (with a magnifying glass icon)
* **Middle:**
* **KGOT Task Resolution:** Text label.
* **Arrow:** A thick black arrow pointing from left to right, indicating the flow of task resolution.
* **Right Side:**
* **Enhanced Knowledge Graph:** Title of the knowledge graph diagram, enclosed in a rounded rectangle with a light purple background.
* **Nodes:**
* A central node labeled "[firstname lastname]".
* Four nodes labeled "[album name 1]", "[album name 2]", "[album name 3]", and "[album name 4]".
* Each album node has a smaller node attached to it labeled "YEAR".
* **Edges:**
* Edges labeled "RELEASED" connect the central node to each of the album nodes.
### Detailed Analysis or Content Details
* **Question:** The question requires finding the number of studio albums published by a specific artist within a specified year range.
* **Tools:** The required tools are a web browser and a search engine, suggesting that the information needs to be retrieved from online sources.
* **Knowledge Graph:** The knowledge graph represents the relationships between the artist and their albums. The central node represents the artist, and the surrounding nodes represent the albums. The "RELEASED" edges indicate that the artist released these albums. The "YEAR" nodes associated with each album indicate the release year of the album.
### Key Observations
* The diagram illustrates a process of using external tools (web browser, search engine) to query a knowledge graph.
* The knowledge graph is structured to represent the relationship between an artist and their albums, including the release year of each album.
* The diagram suggests that the question can be answered by querying the knowledge graph for albums released by the specified artist within the specified year range.
### Interpretation
The diagram demonstrates how a question requiring external knowledge can be resolved using a knowledge graph. The question is first processed to identify the necessary information (artist, year range). Then, external tools are used to query a knowledge graph, which contains structured information about the artist and their albums. The knowledge graph is then used to retrieve the relevant information and answer the question. The diagram highlights the importance of knowledge graphs in question answering and information retrieval. The use of the 2022 version of English Wikipedia suggests that the knowledge graph is based on up-to-date information.
</details>
Figure 8: Example of a tree structure. This task requires 4 intermediate steps and the usage of 2 tools. The expected solution is â4â. This is a trap question where only the studio albums should be taken into account. In addition to years, the type of the albums is also stored as a property in the KG. Please note that the original GAIA task has a different solution, which we do not want to reveal.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Diagram: Enhanced Knowledge Graph for Task Resolution
### Overview
The image presents a task resolution process using an enhanced knowledge graph. It outlines the steps involved in executing a Python script, compiling C++ code, and performing calculations. The diagram visually represents the flow of information and dependencies between different components.
### Components/Axes
* **Title:** Enhanced Knowledge Graph
* **Subtitle:** KGOT Task Resolution
* **Nodes:** Script, URL, SourceCode, Array, SortedArray, Integer (x3)
* **Edges:** GENERATES, LEADS\_TO, PROCESSES, SORTS\_TO, HAS\_INTEGER (x2), RESULTS\_IN, SUMS\_WITH
* **Arrow:** A large black arrow pointing from left to right, indicating the flow of the task resolution process.
* **Required Tools:**
1. Web browser
2. Search engine
3. File handling
4. Computer vision OCR
5. Code execution
6. Calculator
### Detailed Analysis
The diagram illustrates the following process:
1. **Script** GENERATES a **URL**.
2. The **URL** LEADS\_TO **SourceCode**.
3. The **SourceCode** PROCESSES an **Array**.
4. The **Array** SORTS\_TO a **SortedArray**.
5. The **SortedArray** HAS\_INTEGER values of 42 and 23.
6. The **Integer** value 23 RESULTS\_IN another **Integer**.
7. The **SortedArray** and another **Integer** SUMS\_WITH the **Integer** value 42.
8. The final **Integer** has a value of 65.
The left side of the image contains the following text:
"Question: 106
The attached image contains a Python script. Run the Python code against an array of strings, listed below. The output of the Python script will be a URL containing C++ source code. Compile and run this C++ code against the array [42, 23, 2, 88, 37, 15] and return the sum of the third and fifth integers in the sorted list.
arr = \['URL', 'ele', 'me', 'nts', 'as', 'sho', 'rt', 'str', 'ings']"
### Key Observations
* The diagram visually represents the steps described in the text on the left.
* The process starts with a Python script and ends with a calculated integer value.
* The diagram highlights the dependencies between different components, such as the URL leading to source code and the array being sorted.
* The final result of the process is the integer 65, which is derived from the sorted array.
### Interpretation
The enhanced knowledge graph provides a visual representation of the task resolution process. It demonstrates how different components, such as the Python script, C++ code, and array manipulation, are interconnected to achieve the final result. The diagram helps to understand the flow of information and dependencies between these components. The task involves running a Python script that outputs a URL containing C++ source code. This C++ code is then compiled and run against the array [42, 23, 2, 88, 37, 15]. The goal is to sort this array, identify the third and fifth integers in the sorted list, and return their sum. The sorted array would be [2, 15, 23, 37, 42, 88]. The third integer is 23, and the fifth integer is 42. Their sum is 23 + 42 = 65. The diagram accurately reflects this process and the final result.
</details>
Figure 9: Example of a cyclic graph structure. This task requires 7 intermediate steps and the usage of 6 tools. The expected solution is â65â. Here, array has the property âvaluesâ with $[42,23,2,88,37,15]$ , SortedArray contains the correctly sorted values $[2,15,23,37,42,88]$ . The final solution â65â is correctly retrieved and parsed as KGoT response. Please note that we used different array values than in the original GAIA task.
A.1 Graph Storage Representation of Knowledge Graph Examples
We now illustrate two examples of knowledge graphs and how they are represented in Neo4j and NetworkX respectively as well as the queries used to extract the final solution. Please note again, that we either replaced the values with placeholders (first question) or with different values (second question) in order to not leak the GAIA benchmark questions.
We start with GAIA question 59, which is illustrated in Figure 6. The knowledge graph stored in Neo4j after the first iteration is shown in the code snippet below.
Neo4j KG representation while processing question 59.
%****âŁappendix-kgs.texâŁLineâŁ75âŁ**** Nodes: Label: Writer {neo4j_id:0, properties:{ânameâ: â[firstname lastname]â}} Label: WordOfTheDay {neo4j_id:1, properties:{âpronunciationâ: â[con-cept]â, âdefinitionâ: âtextual definitionâ, âcounterâ: 1, âoriginâ: âsome war between year-yearâ, âwordâ: â[concept]â, âdateâ: â[date1]â}} Label: Quote {neo4j_id:2, properties:{âtextâ: â[quote]â, âsourceâ: â[newspaper name]â, âdateâ: â[date2]â}} Relationships: Label: QUOTED_FOR {source: {neo4j_id: 0, label: Writer}, target: {neo4j_id: 1, label: WordOfTheDay}, properties: {}} Label: QUOTED_IN {source: {neo4j_id: 0, label: Writer}, target: {neo4j_id: 2, label: Quote}, properties: {}}
The Cypher query used to extract the solution was the following:
Cypher query to extract the solution for question 59.
MATCH (w:Writer)-[:QUOTED_FOR]->(wod:WordOfTheDay {date: â[date1]â}) RETURN w.name AS writer_name
To illustrate the use of NetworkX, we use a knowledge graph for question 106 (shown in Figure 9) from the GAIA benchmark after the second iteration.
NetworkX KG representation while processing question 106.
Existing Nodes: Label: Function [{id:A1, properties:{ânameâ: âimage_inspectorâ}}, {id:call_X2CcPnp5acMUPAp1Qx3OTvKx, properties:{ânameâ: âimage_inspectorâ, âargsâ: {âquestionâ: âWhat Python script is depicted in the attached image?â, âfull_path_to_imageâ: â[filepath].pngâ}}}] Label: Script [{id:A2, properties:{âdescriptionâ: âPython script to construct a URL by combining a base URL with specific indices from an arrayâ}}] Label: Array [{id:A3, properties:{âcontentâ: "[âURLâ, âeleâ, âmeâ, ântsâ, âasâ, âshoâ, ârtâ, âstrâ, âingsâ]"}}] Label: URL [{id:A4, properties:{âbaseâ: â[base URL]â, âindicesâ: [some indices]}}] Existing Relationships: Label: uses [{source: {id: A1}, target: {id: A2}, properties: {}}] Label: contains [{source: {id: A2}, target: {id: A3}, properties: {}}] Label: constructs [{source: {id: A2}, target: {id: A4}, properties: {}}] Label: None [{source: {id: call_X2CcPnp5acMUPAp1Qx3OTvKx}, target: {id: A2}, properties: {}}]
The following Python code was used to extract the final solution:
Python code to extract the solution for question 106.
# Retrieve the base URL and indices to construct the final URL base_url = self.G.nodes[âA4â][âbaseâ] indices = self.G.nodes[âA4â][âindicesâ] # Retrieve the array content arr = eval(self.G.nodes[âA3â][âcontentâ]) # Construct the URL using the specified indices constructed_url = base_url + ââ.join(arr[i] for i in indices) # The next step would be to compile and run the C++ code from the constructed URL, but # since we cannot execute external code, we will simulate the sorting and summing # process in Python. # Simulating the C++ code execution with the given array sorted_arr = sorted([2, 15, 23, 37, 42, 88]) # Sum of the third and fifth integers in the sorted list result = sorted_arr[2] + sorted_arr[4]
After the code execution, the correct solution of 65 is obtained.
Appendix B Additional Details on System Design & Implementation
B.1 Controller
The Controller is the central orchestrator of the KGoT system, responsible for managing the interaction between the knowledge graph and the integrated tools. When a user submits a query, the Controller initiates the reasoning process by interpreting the task and coordinating the steps required for its resolution.
To offer fine-grained control over the KGoT control logic, the following parameters can be configured:
- num_next_steps_decision: Number of times to prompt an LLM on how to proceed (Solve/Enhance). Defaults to 5.
- max_retrieve_query_retry: Maximum retries for a Solve query when the initial attempt fails. Defaults to 3.
- max_cypher_fixing_retry: Maximum retries for fixing a Cypher query that encounter errors. Defaults to 3.
- max_final_solution_parsing: Maximum retries of parsing the final solution from the output of the Solve query. Defaults to 3.
- max_tool_retries: Maximum number of retries when a tool invocation fails. Defaults to 6.
Controller classes derived from the ControllerInterface abstract class embed such parameters with default values defined for their class. Users can experiment with custom parameters as well. We discuss how the choice of these parameters impacts the system robustness in Appendix B.2.
B.1.1 Architecture
The KGoT Controller employs a dual-LLM architecture with a clear separation of roles between constructing the knowledge graph (managed by the LLM Graph Executor) and interacting with tools (managed by the LLM Tool Executor). The following discussion provides additional specifics to the workflow description in Section 4.
The LLM Graph Executor is responsible for decision making and orchestrating the knowledge graph-based task resolution workflow, leading to different pathways (Solve or Enhance).
- define_next_step: Determine the next step. This function is invoked up to num_next_steps_decision times to collect replies from an LLM, which are subsequently used with a majority vote to decide whether to retrieve information from the knowledge graph for solving the task (Solve) or insert new information (Enhance).
- _insert_logic: Run Enhance. Once we have successfully executed tool calls and gathered new information, the system generates the Enhance query or queries to modify the knowledge graph accordingly. Each Enhance query is executed and its output is validated.
- _retrieve_logic: Run Solve. If the majority vote directs the system to the Solve pathway, a predefined solution technique (direct or query-based retrieve) is used for the solution generation.
- _get_math_response: Apply additional mathematical processing (optional).
- parse_solution_with_llm: Parse the final solution into a suitable format and prepare it as the KGoT response.
The LLM Tool Executor decides which tools to use as well as handling the interaction with these tools.
- define_tool_calls: Define tool calls. The system orchestrates the appropriate tool calls based on the knowledge graph state.
- _invoke_tools_after_llm_response, _invoke_tool_with_retry: Run tool calls with or without retry.
B.2 Enhancing System Robustness
Given the non-deterministic nature of LLMs and their potential for generating hallucinations (Kaddour et al., 2023), the robustness of KGoT has been a fundamental focus throughout its design and implementation. Ensuring that the system consistently delivers accurate and reliable results across various scenarios is paramount. One of the key strategies employed to enhance robustness is the use of majority voting, also known as Self-Consistency (Wang et al., 2023b). In KGoT, majority voting is implemented by querying the LLM multiple times (by default 5 times) when deciding the next step, whether to insert more data into the knowledge graph or retrieve existing data. This approach reduces the impact of single-instance errors or inconsistencies, ensuring that the decisions made reflect the LLMâs most consistent reasoning paths.
The choice of defaulting to five iterations for majority voting is a strategic balance between reliability and cost management, and was based on the work by Wang et al. (2023b), which showed diminishing returns beyond this point.
In addition, KGoT uses a separate default iteration count of seven for executing its full range of functions during problem-solving. These seven iterations correspond to the typical number of tool calls required to thoroughly explore the problem space, including multiple interactions with tools like the Surfer agent and the external LLM. Unlike the five iterations used for majority voting used to ensure robustness, this strategy ensures the system leverages its resources effectively across multiple tool invocations before concluding with a âNo Solutionâ response if the problem remains unresolved.
Layered Error-Checking: KGoT integrates multiple error-checking mechanisms to safeguard against potential issues. The system continuously monitors for syntax errors and failures in API calls. These mechanisms are complemented by custom parsers and retry protocols. The parsers, customized from LangChain (LangChain Inc., 2025d), are designed to extract the required information from the LLMâs responses, eliminating the need for manual parsing. In cases where errors persist despite initial correction attempts, the system employs retry mechanisms. These involve the LLM rephrasing the Cypher queries and try them again. The Controllerâs design includes a limit on the number of retries for generating Cypher queries and invoking tools, balancing the need for error resolution with the practical constraints of time and computational resources. More information can be found in the subsequent section.
B.3 Error Management Techniques
B.3.1 Handling LLM-Generated Syntax Errors
Syntax errors generated by LLMs can disrupt the workflow of KGoT, potentially leading to incorrect or incomplete solutions, or even causing the system to fail entirely. To manage these errors, KGoT includes LangChainâs JSON parsers (LangChain Inc., 2025d) that detect syntax issues.
When a syntax error is detected, the system first attempts to correct it by adjusting the problematic syntax using different encoders, such as "unicode_escape" (Python Software Foundation, 2025a). If the issue persists, KGoT employs a retry mechanism that uses the LLM to rephrase the query/command and attempts to regenerate its output. This retry mechanism is designed to handle up to three attempts, after which the system logs the error for further analysis, bypasses the problematic query, and continues with other iterations in the hope that another tool or LLM call will still be able to resolve the problem.
A significant issue encountered with LLM-generated responses is managing the escape characters, especially when returning a Cypher query inside the standard JSON structure expected by the LangChain parser. The combination of retries using different encoders and parsers has mitigated the problem, though not entirely resolved it. Manual parsing and the use of regular expressions have also been attempted but with limited success.
B.3.2 Managing API and System Errors
API-related errors, such as the OpenAI code â500â errors, are a common challenge in the operation of KGoT, especially when the external servers are overwhelmed. To manage these errors, the primary strategy employed is exponential backoff, which is a technique where the system waits for progressively longer intervals before retrying a failed API call, reducing the likelihood of repeated failures due to temporary server issues or rate limits (Tenacity Developers, 2025b). In KGoT, this approach is implemented using the tenacity library, with a retry policy that waits for random intervals ranging from 1 to 60 seconds and allows for up to six retry attempts (wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)).
Additionally, KGoT includes comprehensive logging systems as part of its error management framework. These systems track the errors encountered during system operation, providing valuable data that can be easily parsed and analyzed (e.g. snapshots of the knowledge graphs or responses from third-party APIs). This data can then be used to refine the systemâs error-handling protocols and improve overall reliability.
It is also important to note that the systemâs error management strategies are built on top of existing errors systems provided by external tools, such as the LangChain interface for OpenAI, which already implements a default exponential backoff strategy with up to six retries (LangChain Inc., 2025b). These built-in mechanisms complement KGoTâs own error-handling strategies, creating a multi-layered defense against potential failures and ensuring high levels of system reliability.
B.4 Detailed Tool Description
Tools are a fundamental component of the KGoT framework, enabling seamless interaction with external resources such as the web and various file formats. KGoT currently supports the following tools:
- Python Code Tool: Executes code snippets provided by the LLM in a secure Python environment hosted within a Docker (or Sarus) container. This ensures that any potential security risks from executing untrusted code are mitigated. Besides running code, this tool is also utilized for mathematical computations.
- Large Language Model (LLM) Tool: Allows the LLM Tool Executor to request data generation from another instance of the same LLM. It is primarily employed for simple, objective tasks where no other tool is applicable.
- Surfer Agent: This web browser agent leverages SerpAPI to perform efficient Google searches and extract relevant webpage data. Built on Hugging Face Agents (Roucher & Petrov, 2025), this tool combines the capabilities with our WebCrawler and Wikipedia tools while adding support for JavaScript-rendered pages. It uses viewpoint segmentation to prevent the âlost in the middle effectâ and incorporates additional navigation functionalities, such as search and page traversal.
- ExtractZip Tool: Extracts data from compressed files (e.g., ZIP archives). It was enhanced through integration with the TextInspector Tool, enabling seamless analysis of extracted files without requiring additional iterations to process the data.
- TextInspector Tool: A versatile tool for extracting data from multiple file types, including PDFs, spreadsheets, MP3s, and YouTube videos. It organizes extracted content in Markdown format, enhancing readability and integration into the Knowledge Graph. The tool was augmented with the best components from our original MultiModal Tool and the Hugging Face Agents TextInspector Tool. It can directly process questions about extracted content without returning the raw data to the LLM.
- Image Tool: Extracts information from images, such as text or objects, and returns it in a structured format. This tool is crucial for tasks requiring image processing and analysis. We selected the best prompts from our original tool set as well as Hugging Face Agents to optimize data extraction and analysis.
Tool integration within the KGoT framework is crucial for extending the systemâs problem-solving capabilities beyond what is achievable by LLMs alone. The strategy is designed to be modular, scalable, and efficient, enabling the system to leverage a diverse array of external tools for tasks such as data retrieval, complex computations, document processing, and more.
B.4.1 Modular Tool Architecture
All tools integrated into the KGoT system are built upon the BaseTool abstraction provided by the LangChain framework (LangChain Inc., 2025c). This standardized approach ensures consistency and interoperability among different tools, facilitating seamless integration and management of new tools. Each tool implementation adheres to the following structure:
- tool_name: A unique identifier for the tool, used by the system to reference and invoke the appropriate functionality.
- description: A detailed explanation of the toolâs purpose, capabilities, and appropriate usage scenarios. This description assists the LLM Tool Executor in selecting the right tool for specific tasks. Including few-shot examples is recommended, though the description must adhere to the 1024-character limit imposed by BaseTool.
- args_schema: A schema defining the expected input arguments for the tool, including their types and descriptions. This schema ensures that the LLM Tool Executor provides correctly formatted and valid inputs when invoking the tool.
This structured definition enables the LLM Tool Executor to dynamically understand and interact with a wide array of tools, promoting flexibility and extensibility within the KGoT system.
B.4.2 Tool Management and Initialization
The ToolManager component is responsible for initializing and maintaining the suite of tools available to the KGoT system. It handles tasks such as loading tool configurations, setting up necessary environment variables (e.g., API keys), and conducting initial tests to verify tool readiness, such as checking whether the RunPythonCodeTool âs Docker container is running. The ToolManager ensures that all tools are properly configured and available for use during the systemâs operation.
Simplified example of ToolManager initialization.
class ToolManager: def __init__(self): self.set_env_keys() self.tools = [ LLM_tool(...), image_question_tool(...), textInspectorTool(...), search_tool(...), run_python_tool(...), extract_zip_tool(...), # Additional tools can be added here ] self.test_tools() def get_tools(self): return self.tools
This modular setup allows for the easy addition or removal of tools, enabling the system to adapt to evolving requirements and incorporate new functionalities as needed.
B.4.3 Information Parsing and Validation
After a tool executes and returns its output, the retrieved information undergoes a parsing and validation process by the LLM Graph Executor before being integrated into the knowledge graph. This process ensures the integrity and relevance of new data:
- Relevance Verification: The content of the retrieved information is assessed for relevance to the original problem context. This step may involve cross-referencing with existing knowledge, checking for logical consistency, and filtering out extraneous or irrelevant details. The LLM Graph Executor handles this during Cypher query generation.
- Integration into Knowledge Graph: Validated and appropriately formatted information is then seamlessly integrated into the knowledge graph by executing each Cypher query (with required error managements as mentioned in section B.3.1), enriching the systemâs understanding and enabling more informed reasoning in future iterations.
B.4.4 Benefits
This structured and systematic approach to tool integration and selection offers several key benefits:
- Enhanced Capability: By leveraging specialized tools, KGoT can handle a wide range of complex tasks that go beyond the inherent capabilities of LLMs, providing more comprehensive and accurate solutions.
- Scalability: The modular architecture allows for easy expansion of the tool set, enabling the system to adapt to new domains and problem types with minimal reconfiguration.
- Flexibility: The systemâs ability to adaptively select and coordinate multiple tools in response to dynamic problem contexts ensures robust and versatile problem-solving capabilities.
B.5 High-Performance & Scalability
As previously discussed, we also experimented with various high-performance computing techniques adopted to accelerate KGoT. This section outlines additional design details.
The acceleration strategies can be classified into two categories: those targeting the speedup of a single task, and those aimed at accelerating the execution of KGoT on a batch of tasks such as the GAIA benchmark.
Optimizations in the first category are:
- Asynchronous Execution: Profiling of the KGoT workflow reveals that a substantial portion of runtime is spent on LLM model calls and tool invocations. As this represents a typical I/O-intensive workload, Python multi-threading is sufficient to address the bottleneck. KGoT dynamically schedules independent I/O operations (based on the current graph state and execution logic) using asyncio to achieve full concurrency.
- Graph Operation Parallelism: KGoT maintains a graph storage backend for managing the knowledge graph. When new knowledge is obtained from the tools, KGoT generates a list of queries, which represent a sequence of graph operations to add or modify nodes, properties, and edges. However, executing these operations sequentially in the graph storage backend can be time-consuming. A key observation is that many of these operations exhibit potential independence. We leveraged this potential parallelism to accelerate these graph storage operations. Our solution involves having KGoT request an LLM to analyze dependencies within the operations and return multiple independent chains of graph storage operations. These chains are then executed concurrently using the asynchronous method proposed earlier, enabling parallel execution of queries on the graph storage. This approach effectively harnesses the inherent parallelism to significantly improve processing speed.
The applied optimizations result in an overall speedup of 2.30 $Ă$ compared to the sequential baseline for a single KGoT task.
The second category focuses on accelerating a batch of tasks, for which MPI-based distributed processing is employed. Additional optimizations have also been implemented to further enhance performance.
- Work Stealing: The work-stealing algorithm operates by allowing idle processors to âstealâ tasks from the queues of busy processors, ensuring balanced workload distribution. Each processor maintains its task queue, prioritizing local execution, while stealing occurs only when its queue is empty. This approach reduces idle time and enhances parallel efficiency. Our implementation of the work-stealing algorithm for KGoT adopts a novel approach tailored for distributed atomic task execution in an MPI environment. Each question is treated as an atomic task, initially distributed evenly across all ranks to ensure balanced workload allocation. When a rank completes all its assigned tasks, it enters a work-stealing phase, prioritizing the rank with the largest queue of remaining tasks. Operating in a peer-to-peer mode without a designated master rank, each rank maintains a work-stealing monitor to handle task redistribution. This monitor tracks incoming requests and facilitates the transfer of the last available task to the requesting rank whenever feasible. The system ensures continuous work-stealing, dynamically redistributing tasks to idle ranks, thus minimizing idle time and maximizing computational efficiency across all ranks. This decentralized and adaptive strategy significantly enhances the parallel processing capabilities of KGoT.
- Container Pool: The container pool implementation for KGoT ensures modular and independent execution of each tasks on separate ranks by running essential modules, such as Neo4j and the Python tool, within isolated containers, with one container assigned per rank. We use a Kubernetes-like container orchestration tool specifically designed for KGoT running with MPI. The container pool supports Docker and Sarus to be compatible with local and cluster environments. Our design guarantees that each task operates independently without interfering with each other, while trying to minimize latency between the KGoT controller and the containers.
Ultimately, our experiments achieved a 12.74 $Ă$ speedup over the sequential baseline on the GAIA benchmark when executed with 8 ranks in MPI, as illustrated in Figure 10. This demonstrates the significant performance improvement of the KGoT system achieved on a consumer-grade platform.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Chart: Speedup vs. Number of Processing Elements
### Overview
The image is a line chart comparing the speedup achieved by a "Work Stealing" and "Non Work Stealing" approach as the number of processing elements (p) increases in a Message Passing Interface (MPI). The chart displays speedup on the y-axis and the number of processing elements on the x-axis. The experiment was conducted with 30 questions and 2 measurements on an Apple M3 Pro chip @ 4.056GHz (12 cores) with 18GB of memory.
### Components/Axes
* **Title:** None explicitly present on the chart itself, but the context suggests it's a comparison of speedup with and without work stealing.
* **X-axis:** "Number of Processing Elements (p) in Message Passing Interface (MPI)". The scale ranges from 1 to 10, with integer markers at each value.
* **Y-axis:** "Speedup" represented by the formula (S = T1/Tp, where T1 is sequential execution time and Tp is parallel execution time with p processors). The scale ranges from 2 to 12, with integer markers at each value.
* **Legend:** Located in the top-center of the chart.
* **Work Stealing:** Represented by a dark red line with circular markers.
* **Non Work Stealing:** Represented by a dark green line with 'x' markers.
* **Additional Text:**
* Top-left: "# of Questions = 30", "# of Measurement = 2", "Chip: Apple M3 Pro @ 4.056GHz (12 cores)", "Memory: 18GB"
* Top-right: "Peak: 12.74x at p = 8" (associated with the Work Stealing line)
### Detailed Analysis
**Work Stealing (Dark Red Line with Circular Markers):**
* **Trend:** Initially increases rapidly, peaks at p=8, then decreases slightly.
* **Data Points:**
* p=1: Speedup â 1.6
* p=2: Speedup â 3.0
* p=4: Speedup â 5.4
* p=6: Speedup â 6.6
* p=8: Speedup â 12.74 (Peak)
* p=10: Speedup â 11.2
**Non Work Stealing (Dark Green Line with 'x' Markers):**
* **Trend:** Increases steadily, with a slight dip between p=6 and p=8, but does not reach the same peak as "Work Stealing".
* **Data Points:**
* p=1: Speedup â 1.6
* p=2: Speedup â 3.5
* p=4: Speedup â 4.5
* p=6: Speedup â 9.2
* p=8: Speedup â 7.8
* p=10: Speedup â 9.5
### Key Observations
* Both "Work Stealing" and "Non Work Stealing" approaches start with similar speedup values at p=1.
* "Work Stealing" achieves a significantly higher peak speedup (12.74x) at p=8 compared to "Non Work Stealing".
* After p=8, the speedup for "Work Stealing" decreases, while "Non Work Stealing" continues to increase slightly.
* The "Non Work Stealing" approach shows a more consistent and gradual increase in speedup as the number of processing elements increases, but it never reaches the peak performance of "Work Stealing".
### Interpretation
The data suggests that "Work Stealing" is more effective in achieving higher speedup, especially around 8 processing elements. However, its performance degrades slightly beyond this point. "Non Work Stealing" provides a more stable and consistent speedup increase, although it does not reach the same peak performance.
The choice between "Work Stealing" and "Non Work Stealing" would depend on the specific application and the number of processing elements available. If the application can effectively utilize a larger number of processing elements and needs the absolute highest speedup, "Work Stealing" at p=8 might be preferred. If stability and consistent performance across a wider range of processing elements are more important, "Non Work Stealing" might be a better choice. The slight decrease in "Work Stealing" performance after p=8 could indicate overhead or diminishing returns associated with the work-stealing mechanism as the number of processors increases.
</details>
Figure 10: Measured parallel speedup of KGoT task execution across varying numbers of MPI processes, under two scheduling strategies: with and without work stealing. Each task corresponds to a GAIA benchmark question, and each data point represents the average of 2 measurements on an Apple M3 Pro (12 cores @ 4.056GHz) and 18GB Memory. The dashed grey line indicates the expected theoretical speedup curve ( $S={2.2985}Ă p$ ) based on the asynchronous optimizations applied to individual tasks. As previously discussed, acceleration strategies are categorized into (1) single-task optimizationsâincluding asynchronous I/O scheduling and graph operation parallelismâand (2) batch-level parallelism using MPI-based distributed processing. The work-stealing variant consistently outperforms the non-stealing baseline by minimizing idle time and dynamically redistributing atomic question tasks across ranks. These combined strategies result in a 12.74 $Ă$ speedup over the sequential baseline when using 8 processes.
B.6 Examples of Noise Mitigation
We illustrate two examples of experiments with noise mitigation in KGoT. As before, we have replaced the specific values with placeholders to prevent the leakage of the GAIA benchmark tasks.
B.6.1 Irrelevance Removal
The first example is based on question 146 in the validation set of the GAIA benchmark:
On [date], an article by [author] was published in [publication]. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by [researcher] supported by?
The example KG has been populated with data directly related to the answer as well as information that is relevant to the question but not necessary for answering it. Removing this extraneous data makes it easier for KGoT to reason about the KG content and extract data relevant to the answer. The data to be removed is marked in red.
Question 146: Initial state of the knowledge graph.
Nodes: Label: Funding {neo4j_id:0, properties:{âaward_numberâ: â[award_number]â}} Label: Researcher {neo4j_id:13, properties:{ânameâ: â[researcher]â}} Label: Article {neo4j_id:11, properties:{âauthorâ: â[author]â, âtitleâ: â[title]â, âsourceâ: â[publication]â, âpublication_dateâ: â[date]â}} Label: Paper {neo4j_id:12, properties:{âtitleâ: â[paper]â}} Relationships: Label: SUPPORTED_BY {source: {neo4j_id: 13, label: Researcher}, target: {neo4j_id: 0, label: Funding}, properties: {}} Label: LINKED_TO {source: {neo4j_id: 11, label: Article}, target: {neo4j_id: 12, label: Paper}, properties: {}} Label: INVOLVES {source: {neo4j_id: 12, label: Paper}, target: {neo4j_id: 13, label: Researcher}, properties: {}}
Question 146: Denoised knowledge graph.
Nodes: Label: Funding {neo4j_id:0, properties:{âaward_numberâ: â[award_numberâ}} Label: Researcher {neo4j_id:13, properties:{ânameâ: â[researcher]â}} Relationships: Label: SUPPORTED_BY {source: {neo4j_id: 13, label: Researcher}, target: {neo4j_id: 0, label: Funding}, properties: {}}
B.6.2 Duplicate Removal
The second example is based on question 25 in the validation set of the GAIA benchmark:
I need to fact-check a citation. This is the citation from the bibliography: [citation1] And this is the in-line citation: Our relationship with the authors of the works we read can often be â[quote]â ([citation2]). Does the quoted text match what is actually in the article? If Yes, answer Yes, otherwise, give me the word in my citation that does not match with the correct one (without any article).
In the example, the knowledge graph has been populated by two nearly identical nodes. The nodes and relationships marked for removal are shown in red.
Question 25: Initial state of the knowledge graph.
Nodes: Label: Quote {neo4j_id:22, properties:{âtextâ: â[quote]â}} {neo4j_id:0, properties:{âtextâ: â[near_identical_quote]â}} Label: Article {neo4j_id:3, properties:{âjournalâ: â[journal]â, âpage_startâ: [page_start], âauthorâ: â[author]â, âpage_endâ: [page_end], âtitleâ: â[title]â, âissueâ: [issue], âvolumeâ: [volume], âyearâ: [year], âdoiâ: â[year]â}} Relationships: Label: CONTAINS {source: {neo4j_id: 3, label: Article}, target: {neo4j_id: 22, label: Quote}, properties: {}} {source: {neo4j_id: 3, label: Article}, target: {neo4j_id: 0, label: Quote}, properties: {}}
Question 25: Denoised knowledge graph.
Nodes: Label: Quote {neo4j_id:22, properties:{âtextâ: â[quote]â}} Label: Article {neo4j_id:3, properties:{âjournalâ: â[journal]â, âpage_startâ: [page_start], âauthorâ: â[author]â, âpage_endâ: [page_end], âtitleâ: â[title]â, âissueâ: [issue], âvolumeâ: [volume], âyearâ: [year], âdoiâ: â[year]â}} Relationships: Label: CONTAINS {source: {neo4j_id: 3, label: Article}, target: {neo4j_id: 22, label: Quote}, properties: {}}
Appendix C Additional Details on Prompt Engineering
The primary objectives in our prompt design include improving decision-making processes, effectively managing complex scenarios, and allowing the LLM to adapt to diverse problem domains while maintaining high accuracy and efficiency. To achieve this, we leverage prompt engineering techniques, particularly the use of generic few-shot examples embedded in prompt templates. These examples guide the LLM in following instructions step by step (chain-of-thought) and reducing errors in generating graph queries with complex syntax.
C.1 Prompt for Majority Voting
At the beginning of each iteration, the LLM Graph Executor uses the following prompt to decide whether the task can be solved with the current KG or if more information is needed. For system robustness, it is run multiple times with varying reasoning paths, and a majority vote (Self-Consistency) is applied to the responses. The prompt also explicitly instructs the model to decide on either the Solve or the Enhance pathway. By requiring the model to output an indicator (query_type = âRETRIEVEâ or âINSERTâ), we can programmatically branch the workflow allowing for control of reasoning pathways.
Graph Executor: Determine the next step
<task> You are a problem solver using a Neo4j database as a knowledge graph to solve a given problem. Note that the database may be incomplete. </task> <instructions> Understand the initial problem, the initial problem nuances, *ALL the existing data* in the database and the tools already called. Can you solve the initial problem using the existing data in the database? âą
If you can solve the initial problem with the existing data currently in the database return the final answer and set the query_type to RETRIEVE. Retrieve only if the data is sufficient to solve the problem in a zero-shot manner. âą
If the existing data is insufficient to solve the problem, return why you could not solve the initial problem and what is missing for you to solve it, and set query_type to INSERT. âą
Remember that if you donât have ALL the information requested, but only partial (e.g. there are still some calculations needed), you should continue to INSERT more data. </instructions> <examples> <examples_retrieve> <!-- In-context few-shot examples --> </examples_retrieve> <examples_insert> <!-- In-context few-shot examples --> </examples_insert> </examples> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <tool_calls_made> {tool_calls_made} </tool_calls_made>
C.2 Prompts for Enhance Pathway
If the majority voting deems the current knowledge base as âinsufficientâ, we enter the Enhance Pathway. To identify the knowledge gap, a list of reasons why the task is not solvable and what information is missing is synthesized by the LLM Graph Executor to a single, consistent description.
Graph Executor: Identify missing information
<task> You are a logic expert, your task is to determine why a given problem cannot be solved using the existing data in a Neo4j database. </task> <instructions> You are provided with a list of reasons. Your job is to combine these reasons into a single, coherent paragraph, ensuring that there are no duplicates. âą
Carefully review and understand each reason provided. âą
Synthesize the reasons into one unified text. </instructions> <list_of_reasons> {list_of_reasons} </list_of_reasons>
By providing both the current graph state and the identified missing information, the LLM Tool Executor defines context-aware tool calls to bridge the knowledge gap identified by the LLM Graph Executor.
Tool Executor: Define tool calls
<task> You are an information retriever tasked with populating a Neo4j database with the necessary information to solve the given initial problem. </task> <instructions> <! - - In-context few-shot examples covering the following aspects: 1. **Understand Requirements** 2. **Gather Information** 3. **Detailed Usage** 4. **Utilize Existing Data** 5. **Avoid Redundant Calls** 6. **Ensure Uniqueness of Tool Calls** 7. **Default Tool** 8. **Do Not Hallucinate** - - > </instructions> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <missing_information> {missing_information} </missing_information> <tool_calls_made> {tool_calls_made} </tool_calls_made>
Afterwards specialized tools such as a web browser or code executor are invoked to perform data retrieval from external resources. The newly acquired information is then used to enhance the KG. The LLM Graph Executor is asked to analyze the retrieved information in the context of the initial user query and the current state of the KG. The following prompt is carefully designed to guide the LLM to generate semantically correct and context-aware Cypher queries with concrete examples.
Graph Executor: Create Cypher for data ingestion
<task> You are a problem solver tasked with updating an incomplete Neo4j database used as a knowledge graph. You have just acquired new information that needs to be integrated into the database. </task> <instructions> <! - - In-context few-shot examples covering following aspects: 0. **Understand the Context** 1. **Use Provided New Information Only** 2. **No Calculations** 3. **Avoid Duplicates** 4. **Combine Operations with WITH Clauses** 5. **Group Related Queries** 6. **Omit RETURN Statements** 7. **Omit ID Usage** 8. **Merge Existing Nodes** 9. **Correct Syntax and Semantics** 10. **Use Correct Relationships** 11. **Escape Characters** - - > </instructions> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <missing_information> {missing_information} </missing_information> <new_information> {new_information} </new_information>
C.3 Prompts for Solve Pathway
If majority voting confirms that the KG is sufficiently populated or the maximum iteration count has been reached, the system proceeds to the Solve Pathway. The iteratively refined KG serves as a reliable information source for LLMs to solve the initial query. To provide a robust response, we introduced two approaches, a query-based approach and Direct Retrieval, for knowledge extraction.
C.3.1 Graph Query Language for Knowledge Extraction
The query-based approach formulates a read query using an LLM, given the entire graph state and other relevant information such as the initial problem. The LLM-generated query is then executed on the graph database to return the final solution. Please note KGoT iteratively executes the solve operations collected from the majority voting.
In-context few-shot examples for query-based knowledge extraction
<examples_retrieve> <example_retrieve_1> Initial problem: Retrieve all books written by ââJ.K. Rowlingââ. Existing entities: Author: [{{name: ââJ.K. Rowlingââ, author_id: ââA1ââ}, {{name: ââGeorge R.R. Martinââ, author_id: ââA2ââ}}], Book: [{{title: ââHarry Potter and the Philosopherâs Stoneââ, book_id: ââB1ââ}, {{title: ââHarry Potter and the Chamber of Secretsââ, book_id: ââB2ââ}, {{title: ââA Game of Thronesââ, book_id: ââB3ââ}}] Existing relationships: (A1)-[:WROTE]->(B1), (A1)-[:WROTE]->(B2), (A2)-[:WROTE]->(B3) Solution: query: â MATCH (a:Author {{name: ââJ.K. Rowlingââ}})-[:WROTE]->(b:Book) RETURN b.title AS book_titleâ query_type: RETRIEVE </example_retrieve_1> <example_retrieve_2> Initial problem: List all colleagues of ââBobââ. Existing entities: Employee: [{{name: ââAliceââ, employee_id: ââE1ââ}, {{name: ââBobââ, employee_id: ââE2ââ}, {{name: ââCharlieââ, employee_id: ââE3ââ}}], Department: [{{name: ââHRââ, department_id: ââD1ââ}, {{name: ââEngineeringââ, department_id: ââD2ââ}}] Existing relationships: (E1)-[:WORKS_IN]->(D1), (E2)-[:WORKS_IN]->(D1), (E3)-[:WORKS_IN]->(D2) Solution: query: â MATCH (e:Employee {name: "Bob"})-[:WORKS_IN]->(d:Department) <-[:WORKS_IN]-(colleague:Employee) WHERE colleague.name <> "Bob" RETURN colleague.name AS colleague_name â query_type: RETRIEVE </example_retrieve_2> </examples_retrieve>
If the attempt to fix a previously generated query fails or the query did not return any results, KGoT will try to regenerate the query from scratch by providing the initial problem statement, the existing data as well as additionally the incorrect query.
Graph Executor: Regeneration of Cypher query for data retrieval
<task> You are a problem solver expert in using a Neo4j database as a knowledge graph. Your task is to solve a given problem by generating a correct Cypher query. You will be provided with the initial problem, existing data in the database, and a previous incorrect Cypher query that returned an empty result. Your goal is to create a new Cypher query that returns the correct results. </task> <instructions> 1.
Understand the initial problem, the problem nuances and the existing data in the database. 2.
Analyze the provided incorrect query to identify why it returned an empty result. 3.
Write a new Cypher query to retrieve the necessary data from the database to solve the initial problem. You can use ALL Cypher/Neo4j functionalities. 4.
Ensure the new query is accurate and follows correct Cypher syntax and semantics. </instructions> <examples> <!-- In-context few-shot examples --> </examples> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <wrong_query> {wrong_query} </wrong_query>
C.3.2 Direct Retrieval for Knowledge Extraction
Direct Retrieval refers to directly asking the LLM to formulate the final solution, given the entire graph state, without executing any LLM-generated read queries on the graph storage.
In-context few-shot examples for DR-based knowledge extraction
<examples_retrieve> <example_retrieve_1> Initial problem: Retrieve all books written by ââJ.K. Rowlingââ. Existing entities: Author: [{{name: ââJ.K. Rowlingââ, author_id: ââA1ââ}, {{name: ââGeorge R.R. Martinââ, author_id: ââA2ââ}}], Book: [{{title: ââHarry Potter and the Philosopherâs Stoneââ, book_id: ââB1ââ}, {{title: ââHarry Potter and the Chamber of Secretsââ, book_id: ââB2ââ}, {{title: ââA Game of Thronesââ, book_id: ââB3ââ}}] Existing relationships: (A1)-[:WROTE]->(B1), (A1)-[:WROTE]->(B2), (A2)-[:WROTE]->(B3) Solution: query: âHarry Potter and the Philosopherâs Stone, Harry Potter and the Chamber of Secretsâ query_type: RETRIEVE </example_retrieve_1> <example_retrieve_2> Initial problem: List all colleagues of ââBobââ. Existing entities: Employee: [{{name: ââAliceââ, employee_id: ââE1ââ}, {{name: ââBobââ, employee_id: ââE2ââ}, {{name: ââCharlieââ, employee_id: ââE3ââ}}], Department: [{{name: ââHRââ, department_id: ââD1ââ}, {{name: ââEngineeringââ, department_id: ââD2ââ}}] Existing relationships: (E1)-[:WORKS_IN]->(D1), (E2)-[:WORKS_IN]->(D1), (E3)-[:WORKS_IN]->(D2) Solution: query: âAliceâ query_type: RETRIEVE </example_retrieve_2> </examples_retrieve>
C.3.3 Formatting Final Solution
After successful knowledge extraction from the KG, we obtain a partial answer to our initial query. Next, we examine if further post-processing, such as intermediate calculation or formatting, needs to be performed. In the following prompt, we first detect if any unresolved calculation is required.
Solution formatting: Examine need for mathematical processing
<task> You are an expert in identifying the need for mathematical or probabilistic calculations in problem-solving scenarios. Given an initial query and a partial solution, your task is to determine whether the partial solution requires further mathematical or probabilistic calculations to arrive at a complete solution. You will return a boolean value: True if additional calculations are needed and False if they are not. </task> <instructions> âą
Analyze the initial query and the provided partial solution. âą
Identify any elements in the query and partial solution that suggest the further need for numerical analysis, calculations, or probabilistic reasoning. âą
Consider if the partial solution includes all necessary numerical results or if there are unresolved numerical aspects. âą
Return true if the completion of the solution requires more calculations, otherwise return false. âą
Focus on the necessity for calculations rather than the nature of the math or probability involved. </instructions> <examples> <!-- In-context few-shot examples --> </examples> <initial_problem> {initial_query} </initial_problem> <partial_solution> {partial_solution} </partial_solution>
If any further mathematical processing is needed, the Python Code Tool is invoked to refine the current partial solution by executing an LLM-generated Python script. This ensures accuracy by leveraging the strength of LLMs in scripting. Moreover, it effectively avoids hallucinations by grounding outputs through verifiable and deterministic code computation.
Solution formatting: Apply additional mathematical processing
<task> You are a math and python expert tasked with solving a mathematical problem. </task> <instructions> To complete this task, follow these steps: 1. **Understand the Problem**: âą
Carefully read and understand the initial problem and the partial solution. âą
Elaborate on any mathematical calculations from the partial solution that are required to solve the initial problem. 2. **Perform Calculations**: âą
Use the run_python_code Tool to perform any necessary mathematical calculations. âą
Craft Python code that accurately calculates the required values based on the partial solution and the initial problem. âą
Remember to add print statements to display the reasoning behind the calculations. âą
**ALWAYS** add print statement for the final answer. 3. **Do Not Hallucinate**: âą
**Do not invent information** that is not provided in the initial problem or the partial solution. âą
**Do not perform calculations manually**; use the run_python_code Tool for all mathematical operations. </instructions> <initial_problem> {initial_query} </initial_problem> <partial_solution> {current_solution} </partial_solution>
To produce a single, consistent answer and format the final solution to the initial user query, we guide the LLM with a dedicated prompt.
Solution formatting: Parse the final solution
<task> You are a formatter and extractor. Your task is to combine partial solution from a database and format them according to the initial problem statement. </task> <instructions> 1.
Understand the initial problem, the problem nuances, the desired output, and the desired output format. 2.
Review the provided partial solution. 3.
Integrate and elaborate on the various pieces of information from the partial solution to produce a complete solution to the initial problem. Do not invent any new information. 4.
Your final answer should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. 5.
ADDITIONALLY, your final answer MUST adhere to any formatting instructions specified in the original question (e.g., alphabetization, sequencing, units, rounding, decimal places, etc.) 6.
If you are asked for a number, express it numerically (i.e., with digits rather than words), donât use commas, do not round the number unless directly specified, and DO NOT INCLUDE UNITS such as $ or USD or percent signs unless specified otherwise. 7.
If you are asked for a string, donât use articles or abbreviations (e.g. for cities), unless specified otherwise. Donât output any final sentence punctuation such as â.â, â!â, or â?â. 8.
If you are asked for a comma separated list, apply the above rules depending on whether the elements are numbers or strings. </instructions> <examples> <!-- In-context few-shot examples --> </examples> <initial_problem> {initial_query} </initial_problem> <given_partial_solution> {partial_solution} </given_partial_solution>
C.4 Prompt for LLM-Generated Syntax Error
In order to handle LLM-generated syntax errors, a retry mechanism is deployed to use the LLM to reformulate the graph query or code snippet, guided by specialized prompts tailored to the execution context. For Python code, the prompt guides the model to fix the code and update dependencies if needed, ensuring successful execution.
Error handling: Fix invalid Python code
<task> You are an expert Python programmer. You will be provided with a block of Python code, a list of required packages, and an error message that occurred during code execution. Your task is to fix the code so that it runs successfully and provide an updated list of required packages if necessary. </task> <instructions> 1.
Carefully analyze the provided Python code and the error message. 2.
Identify the root cause of the error. 3.
Modify the code to resolve the error. 4.
Update the list of required packages if any additional packages are needed. 5.
Ensure that the fixed code adheres to best practices where possible. </instructions> <rules> âą
You must return both the fixed Python code and the updated list of required packages. âą
Ensure the code and package list are in proper format. </rules> <examples> <!-- In-context few-shot examples --> </examples> <code> {code} </code> <required_modules> {required_modules} </required_modules> <error> {error} </error>
For Cypher queries, the prompt helps the model diagnose syntax or escaping issues based on the error log and returns a corrected version.
Error handling: Fix invalid Cypher query
<task> You are a Cypher expert, and you need to fix the syntax and semantic of a given incorrect Cypher query. </task> <instructions> Given the incorrect Cypher and the error log: 1.
Understand the source of the error (especially look out for wrongly escaped/not escaped characters). 2.
Correct the Cypher query 3.
Return the corrected Cypher query. </instructions> <wrong_cypher> {cypher_to_fix} </wrong_cypher> <error_log> {error_log} </error_log>
Both prompts are reusable across pathways and enforce minimal, well-scoped corrections grounded in the provided error context.
Appendix D Additional Results
We plot the results from Figure 3 also as a Pareto front in Figure 11.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Chart: Task Failure vs. Cost Comparison
### Overview
The image is a scatter plot comparing the number of failed tasks against the total cost for various systems. The goal is to have both values as low as possible, indicating better performance. The plot includes data points for KGOT (fusion), KGOT, Baselines, and Zero-Shot systems, with specific configurations like "Query" and "DR" (likely referring to different data retrieval methods). The plot also includes shaded regions.
### Components/Axes
* **X-axis:** Total Cost ($) (the lower the better). Scale ranges from 0.00 to 10.00, with tick marks at intervals of 2.00.
* **Y-axis:** Number of Failed Tasks (the lower the better). Scale ranges from 90 to 150, with tick marks at intervals of 10.
* **Legend (bottom-left):**
* KGOT (fusion): Represented by a dark gray "X" marker.
* KGOT: Represented by a gray star marker.
* Baselines: Represented by a purple circle marker.
* Zero-Shot: Represented by a white diamond marker.
### Detailed Analysis
* **KGOT (fusion):**
* Neo4j (Query + DR): Located at approximately (5.5, 103).
* Neo4j + NetworkX (Query + DR): Located at approximately (9.5, 93).
* **KGOT:**
* Neo4j (Query): Located at approximately (3.5, 125).
* RDF4J (Query): Located at approximately (3.5, 129).
* Neo4j (DR): Located at approximately (5.5, 125).
* NetworkX (Query): Located at approximately (5.5, 120).
* NetworkX (DR): Located at approximately (5.5, 123).
* NetworkX (Query + DR): Located at approximately (7.5, 112).
* **Baselines:**
* GPTSwarm: Located at approximately (0.5, 139).
* Simple RAG: Located at approximately (5.5, 130).
* GraphRAG: Located at approximately (5.5, 143).
* HF Agents (GPT-4o mini): Located at approximately (9.5, 130).
* **Zero-Shot:**
* GPT-4o: Located at approximately (0.5, 136).
* GPT-4o mini: Located at approximately (0.5, 148).
### Key Observations
* The KGOT (fusion) data points generally have lower failed tasks and higher costs compared to other KGOT configurations.
* The Zero-Shot data points have very low cost but high failed tasks.
* The Baseline data points are spread across the plot, with some having lower costs and others having lower failed tasks.
* There are two shaded regions, one in the top-left and one in the top-right.
### Interpretation
The plot visualizes the trade-off between the cost and the number of failed tasks for different systems. The ideal system would be located in the bottom-left corner of the plot, indicating low cost and low failed tasks.
* KGOT (fusion) appears to be more robust (fewer failed tasks) but at a higher cost.
* Zero-Shot methods are cheap but unreliable (high number of failed tasks).
* The Baseline methods show a range of performance, suggesting that their effectiveness depends on the specific configuration.
The shaded regions likely represent areas of unacceptable performance, either due to high cost or high failure rate. The systems that fall outside these regions are likely considered more viable options.
</details>
Figure 11: Pareto front plot of cost and error counts. We report results for answering 165 GAIA validation questions across different comparison targets, using the GPT-4o mini model with each baseline. For the Zero-Shot inference, we also include results for GPT-4o for comparison. Please note that we omit the results for Magentic-One and HF Agents (GPT-4o) as their high costs would heavily disturb the plot. DR means Direct Retrieval.
We also plot the relative improvements of KGoT over Hugging Face Agents and GPTSwarm respectively in Figure 12, which is based on the results shown in Figure 5.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Bar Chart: Tasks Improved with KGOT Compared to HF Agents
### Overview
The bar chart compares the number of tasks improved by various language models when using KGOT (Knowledge Graph Optimized Training) compared to using HF (Hugging Face) Agents. The y-axis represents the number of tasks improved, and the x-axis lists the different language models. The chart also includes a horizontal line indicating the arithmetic mean of the improvements.
### Components/Axes
* **Y-axis:** "Tasks Improved with KGOT (compared to HF Agents)". Scale ranges from 0 to 8.
* **X-axis:** Categorical axis listing the language models:
* Qwen2.5-32B
* DeepSeek-R1-70B
* GPT-4o mini
* DeepSeek-R1-32B
* QWQ-32B
* DeepSeek-R1-7B
* DeepSeek-R1-1.5B
* Qwen2.5-72B
* Qwen2.5-7B
* Qwen2.5-1.5B
* **Bars:** Represent the number of tasks improved for each language model. The first five bars are light green, and the last five are light gray.
* **Arithmetic Mean Line:** A dashed horizontal line at y = 3.3, labeled "Arithmetic Mean: +3.3".
### Detailed Analysis
The chart displays the following data points:
* **Qwen2.5-32B:** +7 tasks improved (light green)
* **DeepSeek-R1-70B:** +6 tasks improved (light green)
* **GPT-4o mini:** +5 tasks improved (light green)
* **DeepSeek-R1-32B:** +4 tasks improved (light green)
* **QWQ-32B:** +4 tasks improved (light green)
* **DeepSeek-R1-7B:** +3 tasks improved (light gray)
* **DeepSeek-R1-1.5B:** +2 tasks improved (light gray)
* **Qwen2.5-72B:** +1 task improved (light gray)
* **Qwen2.5-7B:** +1 task improved (light gray)
* **Qwen2.5-1.5B:** 0 tasks improved (light gray)
The first five models (Qwen2.5-32B to QWQ-32B) show a higher improvement in tasks compared to the last five models (DeepSeek-R1-7B to Qwen2.5-1.5B).
### Key Observations
* Qwen2.5-32B shows the highest improvement with +7 tasks.
* Qwen2.5-1.5B shows no improvement (0 tasks).
* The arithmetic mean improvement is +3.3 tasks.
* There is a clear distinction between the performance of the first five models (light green bars) and the last five models (light gray bars).
### Interpretation
The data suggests that KGOT significantly improves the performance of certain language models compared to using HF Agents. The models Qwen2.5-32B, DeepSeek-R1-70B, GPT-4o mini, DeepSeek-R1-32B, and QWQ-32B benefit the most from KGOT. The models DeepSeek-R1-7B, DeepSeek-R1-1.5B, Qwen2.5-72B, and Qwen2.5-7B show a moderate improvement, while Qwen2.5-1.5B does not show any improvement. The difference in performance could be attributed to the architecture, size, or training data of the models. The arithmetic mean provides a general benchmark for the average improvement across all models. The chart highlights the effectiveness of KGOT for specific language models, indicating that KGOT is not universally beneficial and its impact varies depending on the model.
</details>
(a) Hugging Face Agents
<details>
<summary>x22.png Details</summary>

### Visual Description
## Bar Chart: Tasks Improved with KGOT (compared to GPTSwarm)
### Overview
The image is a bar chart comparing the performance of different language models on a set of tasks when using KGOT, relative to their performance using GPTSwarm. The y-axis represents the improvement in tasks, and the x-axis lists the different language models. The chart also includes a horizontal line indicating the arithmetic mean of the improvements.
### Components/Axes
* **Y-axis:** "Tasks Improved with KGOT (compared to GPTSwarm)". The scale ranges from -5 to 20, with gridlines at intervals of 5.
* **X-axis:** Lists the following language models:
* Qwen2.5-32B
* DeepSeek-R1-70B
* GPT-4o mini
* DeepSeek-R1-32B
* QwQ-32B
* DeepSeek-R1-7B
* DeepSeek-R1-1.5B
* Qwen2.5-72B
* Qwen2.5-7B
* Qwen2.5-1.5B
* **Bars:** Each bar represents the performance improvement of a specific language model. The bars are colored green for positive improvements, red for negative improvements, and gray for smaller positive improvements.
* **Arithmetic Mean Line:** A horizontal dashed line is present at y = 7.5, labeled "Arithmetic Mean: +7.5".
### Detailed Analysis
Here's a breakdown of the performance improvements for each language model:
* **Qwen2.5-32B:** -3 (Red bar, indicating a decrease in performance)
* **DeepSeek-R1-70B:** +12 (Green bar, indicating an improvement)
* **GPT-4o mini:** +14 (Green bar, indicating an improvement)
* **DeepSeek-R1-32B:** +15 (Green bar, indicating an improvement)
* **QwQ-32B:** +20 (Green bar, indicating the highest improvement)
* **DeepSeek-R1-7B:** +4 (Gray bar, indicating a smaller improvement)
* **DeepSeek-R1-1.5B:** +2 (Gray bar, indicating a smaller improvement)
* **Qwen2.5-72B:** +12 (Green bar, indicating an improvement)
* **Qwen2.5-7B:** 0 (No improvement)
* **Qwen2.5-1.5B:** -1 (Red bar, indicating a decrease in performance)
### Key Observations
* QwQ-32B shows the highest improvement with a value of +20.
* Qwen2.5-32B and Qwen2.5-1.5B show a decrease in performance with values of -3 and -1, respectively.
* The arithmetic mean of the improvements is +7.5.
* The majority of the models show a positive improvement when using KGOT compared to GPTSwarm.
### Interpretation
The bar chart illustrates the impact of using KGOT on the performance of various language models. The positive values indicate that KGOT generally improves performance compared to GPTSwarm. However, some models (Qwen2.5-32B and Qwen2.5-1.5B) experience a decrease in performance, suggesting that KGOT may not be universally beneficial and its effectiveness can depend on the specific model architecture or size. The arithmetic mean provides a general sense of the average improvement across all models tested. The significant improvement observed with QwQ-32B suggests that KGOT is particularly well-suited for this model. The gray bars indicate smaller improvements, suggesting that KGOT's impact is less pronounced on those models.
</details>
(b) GPTSwarm
Figure 12: Relative improvement of KGoT over Hugging Face Agents (left) and GPTSwarm (right) on the GAIA validation set using various LLM models.
Table 2: Comparison of KGoT with other current state-of-the-art open-source agents on the GAIA benchmark. We provide both the absolute (number of solved tasks) and relative (percentage) results. The baseline data on the test set is obtained through the leaderboard. We highlight the best performing scheme in a given category in bold. The validation set consists of 165 tasks in total (53 in level 1, 86 in level 2 and 26 in level 3), whereas the test set contains 301 tasks (93 in level 1, 159 in level 2 and 49 in level 3). DR stands for Direct Retrieval.
| | | Absolute | Relative | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Agents | Model | All | L1 | L2 | L3 | Avg. | L1 | L2 | L3 |
| Test Set | | | | | | | | | |
| GPTSwarm | GPT-4o mini | 33 | 15 | 15 | 3 | 10.96 | 16.13 | 9.43 | 6.12 |
| Magentic-One | GPT-4o mini | 43 | 22 | 18 | 3 | 14.29 | 23.66 | 11.32 | 6.12 |
| TapeAgent | GPT-4o mini | 66 | 28 | 35 | 3 | 21.93 | 30.11 | 22.01 | 6.12 |
| Hugging Face Agents | GPT-4o mini | 68 | 30 | 34 | 4 | 22.59 | 32.26 | 21.38 | 8.16 |
| KGoT (fusion) | GPT-4o mini | 73 | 33 | 36 | 4 | 24.25 | 35.48 | 22.64 | 8.16 |
| Validation Set | | | | | | | | | |
| Simple RAG | GPT-4o mini | 35 | 18 | 15 | 2 | 21.21 | 33.96 | 17.44 | 7.69 |
| GraphRAG | GPT-4o mini | 23 | 10 | 13 | 0 | 13.94 | 18.87 | 15.12 | 0.00 |
| Magentic-One | GPT-4o mini | 31 | 13 | 18 | 0 | 18.79 | 24.53 | 20.93 | 0.00 |
| No KG (Single Run #1) | GPT-4o mini | 30 | 14 | 14 | 2 | 18.18 | 26.42 | 16.28 | 7.69 |
| No KG (Single Run #2) | GPT-4o mini | 33 | 17 | 16 | 0 | 20.00 | 32.08 | 18.60 | 0.00 |
| No KG (Fusion) | GPT-4o mini | 40 | 18 | 20 | 2 | 24.24 | 33.96 | 23.26 | 7.69 |
| KGoT (Neo4j + DR) | GPT-4o mini | 40 | 21 | 16 | 3 | 24.24 | 39.62 | 18.60 | 11.54 |
| KGoT (NetworkX + Query) | GPT-4o mini | 44 | 21 | 21 | 2 | 26.67 | 39.62 | 24.42 | 7.69 |
| KGoT (NetworkX + DR) | GPT-4o mini | 40 | 20 | 18 | 2 | 24.24 | 37.74 | 20.93 | 7.69 |
| KGoT (RDF4J + Query) | GPT-4o mini | 36 | 20 | 15 | 1 | 21.82 | 37.74 | 17.44 | 3.85 |
| KGoT (fusion) (Neo4j; Query + DR) | GPT-4o mini | 57 | 29 | 24 | 4 | 34.55 | 54.72 | 27.91 | 15.38 |
| KGoT (fusion) (NetworkX; Query + DR) | GPT-4o mini | 57 | 27 | 28 | 2 | 34.55 | 50.94 | 32.56 | 7.69 |
| KGoT (fusion) (Neo4j + NetworkX; Query + DR) | GPT-4o mini | 71 | 34 | 33 | 4 | 43.03 | 64.15 | 38.37 | 15.38 |
| Zero-Shot | GPT-4o mini | 17 | 4 | 13 | 0 | 10.30 | 7.55 | 15.12 | 0.00 |
| Zero-Shot | GPT-4o | 29 | 10 | 17 | 2 | 17.58 | 18.87 | 19.77 | 7.69 |
| Zero-Shot | Qwen2.5-1.5B | 3 | 2 | 1 | 0 | 1.81 | 3.77 | 1.16 | 0.00 |
| Zero-Shot | Qwen2.5-7B | 9 | 4 | 5 | 0 | 5.45 | 7.55 | 5.81 | 0.00 |
| Zero-Shot | Qwen2.5-32B | 15 | 7 | 8 | 0 | 9.09 | 13.21 | 9.30 | 0.00 |
| Zero-Shot | Qwen2.5-72B | 19 | 6 | 13 | 0 | 11.52 | 11.32 | 15.12 | 0.00 |
| Zero-Shot | QwQ-32B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Zero-Shot | DeepSeek-R1-1.5B | 5 | 3 | 2 | 0 | 3.03 | 5.66 | 2.33 | 0.00 |
| Zero-Shot | DeepSeek-R1-7B | 13 | 8 | 5 | 0 | 7.88 | 15.09 | 5.81 | 0.00 |
| Zero-Shot | DeepSeek-R1-32B | 14 | 8 | 6 | 0 | 8.48 | 15.09 | 6.98 | 0.00 |
| Zero-Shot | DeepSeek-R1-70B | 20 | 9 | 10 | 1 | 12.12 | 16.98 | 11.63 | 3.85 |
| GPTSwarm | GPT-4o mini | 26 | 13 | 13 | 0 | 15.76 | 24.53 | 15.12 | 0.00 |
| GPTSwarm | Qwen2.5-1.5B | 5 | 4 | 1 | 0 | 3.03 | 7.55 | 1.16 | 0.00 |
| GPTSwarm | Qwen2.5-7B | 12 | 8 | 4 | 0 | 7.27 | 15.09 | 4.65 | 0.00 |
| GPTSwarm | Qwen2.5-32B | 29 | 15 | 14 | 0 | 17.58 | 28.30 | 16.28 | 0.00 |
| GPTSwarm | Qwen2.5-72B | 27 | 13 | 14 | 0 | 16.36 | 24.53 | 16.28 | 0.00 |
| GPTSwarm | QwQ-32B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| GPTSwarm | DeepSeek-R1-1.5B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| GPTSwarm | DeepSeek-R1-7B | 2 | 0 | 2 | 0 | 1.21 | 0.00 | 2.33 | 0.00 |
| GPTSwarm | DeepSeek-R1-32B | 6 | 3 | 3 | 0 | 3.64 | 5.66 | 3.49 | 0.00 |
| GPTSwarm | DeepSeek-R1-70B | 10 | 5 | 5 | 0 | 6.06 | 9.43 | 5.81 | 0.00 |
| Hugging Face Agents | GPT-4o mini | 35 | 14 | 20 | 1 | 21.21 | 26.42 | 23.26 | 3.85 |
| Hugging Face Agents | GPT-4o | 55 | 22 | 31 | 2 | 33.33 | 41.51 | 36.05 | 7.69 |
| Hugging Face Agents | Qwen2.5-1.5B | 4 | 2 | 2 | 0 | 2.42 | 3.77 | 2.33 | 0.00 |
| Hugging Face Agents | Qwen2.5-7B | 11 | 7 | 4 | 0 | 6.66 | 13.21 | 4.65 | 0.00 |
| Hugging Face Agents | Qwen2.5-32B | 19 | 10 | 9 | 0 | 11.52 | 18.87 | 11.63 | 0.00 |
| Hugging Face Agents | Qwen2.5-72B | 38 | 16 | 22 | 0 | 23.03 | 30.19 | 25.58 | 0.00 |
| Hugging Face Agents | QwQ-32B | 16 | 9 | 7 | 0 | 9.70 | 16.98 | 8.14 | 0.00 |
| Hugging Face Agents | DeepSeek-R1-1.5B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Hugging Face Agents | DeepSeek-R1-7B | 3 | 2 | 1 | 0 | 1.81 | 3.77 | 1.16 | 0.00 |
| Hugging Face Agents | DeepSeek-R1-32B | 17 | 9 | 7 | 1 | 10.30 | 16.98 | 8.14 | 3.85 |
| Hugging Face Agents | DeepSeek-R1-70B | 16 | 9 | 6 | 1 | 9.70 | 16.98 | 6.98 | 3.85 |
| KGoT (Neo4j + Query) | GPT-4o mini | 40 | 21 | 18 | 1 | 24.24 | 39.62 | 20.93 | 3.85 |
| KGoT (Neo4j + Query) | Qwen2.5-1.5B | 4 | 3 | 1 | 0 | 2.42 | 5.66 | 1.16 | 0.00 |
| KGoT (Neo4j + Query) | Qwen2.5-7B | 12 | 7 | 5 | 0 | 7.27 | 13.21 | 5.81 | 0.00 |
| KGoT (Neo4j + Query) | Qwen2.5-32B | 26 | 12 | 14 | 0 | 15.76 | 22.64 | 16.28 | 0.00 |
| KGoT (Neo4j + Query) | Qwen2.5-72B | 39 | 18 | 21 | 0 | 23.64 | 33.96 | 24.42 | 0.00 |
| KGoT (Neo4j + Query) | QwQ-32B | 20 | 11 | 9 | 0 | 12.12 | 20.75 | 10.47 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-1.5B | 2 | 1 | 1 | 0 | 1.21 | 1.89 | 1.16 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-7B | 6 | 3 | 3 | 0 | 3.64 | 5.66 | 3.49 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-32B | 21 | 12 | 9 | 0 | 12.73 | 22.64 | 10.47 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-70B | 22 | 11 | 10 | 1 | 13.33 | 20.75 | 11.63 | 3.85 |
D.1 SimpleQA Results
Table 3: Comparison of KGoT, HF Agents and GPTSwarm on a subset of SimpleQA as well as the results for KGoT on the full benchmark. We highlight the best performing scheme in given category in bold. Model: GPT-4o mini.
| | | Not | | Correct | | | Cost per |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Correct | attempted | Incorrect | given at- | | Total | solved | |
| Framework | (%) | (%) | (%) | tempted (%) | F-score | cost ($) | task ($) |
| GPTSwarm | 53.8106 | 6.2356 | 39.9538 | 57.3892 | 55.5 | 0.2159 | 0.00092660 |
| HF Agents | 66.0508 | 18.0139 | 15.9353 | 80.5634 | 72.6 | 16.7117 | 0.05843265 |
| KGoT | 73.2102 | 1.6166 | 25.1732 | 74.4131 | 73.8 | 5.6432 | 0.01780182 |
| KGoT (Full) | 70.3421 | 2.0342 | 27.8548 | 71.8027 | 71.1 | 59.1538 | 0.01943931 |
Table 4: F1-score comparison of KGoT, OpenAI and Claude models on SimpleQA. OpenAI and Claude results were taken from the official repository (OpenAI, 2025). Model for KGoT: GPT-4o mini.
| Reasoning Models | F1-score | Assistant Models | F1-score |
| --- | --- | --- | --- |
| o1 | 42.6 | gpt-4.1-2025-04-14 | 41.6 |
| o1-preview | 42.4 | gpt-4.1-mini-2025-04-14 | 16.8 |
| o3-high | 48.6 | gpt-4.1-nano-2025-04-14 | 7.6 |
| o3 | 49.4 | gpt-4o-2024-11-20 | 38.8 |
| o3-low | 49.4 | gpt-4o-2024-08-06 | 40.1 |
| o1-mini | 7.6 | gpt-4o-2024-05-13 | 39.0 |
| o3-mini-high | 13.8 | gpt-4o-mini-2024-07-18 | 9.5 |
| o3-mini | 13.4 | gpt-4.5-preview-2025-02-27 | 62.5 |
| o3-mini-low | 13.0 | gpt-4-turbo-2024-04-09 | 24.2 |
| o4-mini-high | 19.3 | Claude 3.5 Sonnet | 28.9 |
| o4-mini | 20.2 | Claude 3 Opus | 23.5 |
| o4-mini-low | 20.2 | | |
| KGoT | 71.1 | | |
D.2 Impact from Various Design Decisions
Table 5: Analysis of different design decisions and tool sets in KGoT. â ST â stands for the type of the solve operation and pathway (â GQ â: graph query, â DR â: Direct Retrieval), â PF â for the prompt format (â MD â: Markdown) and â merged â stands for a combination of the original KGoT tools and the Hugging Face Agents tools.
| Configuration | Metrics | | | | |
| --- | --- | --- | --- | --- | --- |
| Tools | ST | PF | Solved | Time (h) | Cost |
| HF | DR | XML | 37 | 11.87 | $7.84 |
| HF | GQ | MD | 33 | 9.70 | $4.28 |
| merged | GQ | XML | 31 | 10.62 | $5.43 |
| HF | GQ | XML | 30 | 13.02 | $4.90 |
| original KGoT | GQ | XML | 27 | 27.57 | $6.85 |
We explored different tool sets, with selected results presented in Table 5. Initially, we examined the limitations of our original tools and subsequently integrated the complete Hugging Face Agents tool set into the KGoT framework, which led to improvements in accuracy, runtime, and cost efficiency. A detailed analysis allowed us to merge the most effective components from both tool sets into an optimized hybrid tool set, further enhancing accuracy and runtime while only moderately increasing costs. Key improvements include a tighter integration between the ExtractZip tool and the Text Inspector tool, which now supports Markdown, as well as enhancements to the Surfer Agent, incorporating a Wikipedia tool and augmenting viewpoint segmentation with full-page summarization. This optimized tool set was used for all subsequent experiments.
We further evaluated different prompt formats in the initial iterations of KGoT. While our primary format was XML-based, we conducted additional tests using Markdown. Initial experiments with the Hugging Face Agents tool set (see Table 5) combined with Markdown and GPT-4o mini yielded improved accuracy, reduced runtime, and lower costs. However, these results were not consistently reproducible with GPT-4o. Moreover, Markdown-based prompts interfered with optimizations such as Direct Retrieval, ultimately leading us to retain the XML-based format.
<details>
<summary>x23.png Details</summary>

### Visual Description
## Stacked Bar Chart: Number of Solved Tasks by Level
### Overview
The image is a stacked bar chart comparing the number of solved tasks across different configurations of Neo4j and NetworkX, categorized by task difficulty levels (Level 1, Level 2, and Level 3). The y-axis represents the number of solved tasks, ranging from 0 to 80. The x-axis represents different configurations of Neo4j and NetworkX, such as using Query and DR (Data Reduction) or using Query only or DR only.
### Components/Axes
* **Y-axis:** "Number of Solved Tasks", ranging from 0 to 80 in increments of 20.
* **X-axis:** Categorical axis representing different configurations of Neo4j and NetworkX:
* Neo4j (Query + DR)
* NetworkX (Query + DR)
* NetworkX + Neo4j (with Query only)
* NetworkX + Neo4j (with DR only)
* Neo4j + NetworkX (Query + DR)
* **Legend:** Located at the top of the chart.
* Level 1: Light teal color
* Level 2: Blue color
* Level 3: Light purple color
### Detailed Analysis
The chart presents the number of solved tasks for each configuration, broken down by difficulty level.
* **Neo4j (Query + DR):**
* Level 1: 29
* Level 2: 24
* Level 3: 4
* **NetworkX (Query + DR):**
* Level 1: 27
* Level 2: 28
* Level 3: 2
* **NetworkX + Neo4j (with Query only):**
* Level 1: 28
* Level 2: 25
* Level 3: 3
* **NetworkX + Neo4j (with DR only):**
* Level 1: 26
* Level 2: 24
* Level 3: 3
* **Neo4j + NetworkX (Query + DR):**
* Level 1: 34
* Level 2: 33
* Level 3: 4
### Key Observations
* The "Neo4j + NetworkX (Query + DR)" configuration has the highest number of solved tasks overall.
* Level 1 tasks are generally the most solved across all configurations.
* Level 3 tasks are the least solved across all configurations.
* The "NetworkX (Query + DR)" configuration has the lowest number of solved Level 3 tasks (2).
### Interpretation
The chart compares the performance of different configurations of Neo4j and NetworkX in solving tasks of varying difficulty levels. The "Neo4j + NetworkX (Query + DR)" configuration appears to be the most effective, solving the highest number of tasks overall. The distribution of solved tasks across difficulty levels suggests that Level 1 tasks are the easiest, while Level 3 tasks are the most challenging for all configurations. The data suggests that combining Neo4j and NetworkX with both Query and DR leads to better performance in solving tasks.
</details>
Figure 13: Comparison of different fusion types in respect to the task solve operation as well as the graph backend type. We report results for answering 165 GAIA validation questions across different comparison targets. DR stands for Direct Retrieval. Model: GPT-4o mini.
Graph Backend vs. Task Solve Operation We provide more detailed results in Figure 13, studying the performance of the following configurations: NetworkX + Neo4j (with query only) and NetworkX + Neo4j (with DR only) as well as Neo4j (query + DR) and NetworkX (query + DR). Overall, the fusion of backends (with DR only) offers smaller advantages than other types of fusion. This indicates that different graph querying languages have different strengths and their fusion comes with the largest combined advantage.
D.3 Runtime
We provide a runtime overview of running KGoT on the validation set of the GAIA benchmark with GPT4o-mini, Neo4j and query-based retrieval in Figure 14. The right part follows the categorization in Appendix C. We provide a more detailed analysis of the runtime in Figure 17.
<details>
<summary>x24.png Details</summary>

### Visual Description
## Donut Chart: KGOT Runtime Distribution
### Overview
The image is a donut chart illustrating the runtime distribution of KGOT, broken down into four categories: tools, Neo4j, control logic, and postprocessing. The chart also displays the total runtime.
### Components/Axes
* **Title:** KGOT Runtime Distribution
* **Categories:**
* tools
* Neo4j
* control logic
* postprocessing
* **Values:** Represented as percentages of the total runtime.
* **Total Runtime:** 35817.29 s
### Detailed Analysis
* **tools:** 71.5% (teal color) - Occupies the largest portion of the donut chart.
* **Neo4j:** 11.2% (blue color) - Located on the right side of the chart.
* **control logic:** 11.1% (light green color) - Positioned at the top-right of the chart.
* **postprocessing:** 6.07% (light green color) - Located at the top-left of the chart.
* **Total Runtime:** 35817.29 s - Displayed in the center of the donut chart.
### Key Observations
* The "tools" category accounts for the vast majority (71.5%) of the total runtime.
* "Neo4j" and "control logic" have similar runtime percentages (11.2% and 11.1% respectively).
* "postprocessing" has the smallest runtime percentage (6.07%).
### Interpretation
The donut chart provides a clear visualization of the runtime distribution for KGOT. The dominance of the "tools" category suggests that the majority of the processing time is spent on tasks related to tools. The relatively small percentage for "postprocessing" indicates that this stage is less time-consuming compared to the other categories. The total runtime of 35817.29 seconds provides a benchmark for evaluating the overall performance of KGOT.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
## Donut Chart: KGoT Runtime Distribution
### Overview
The image is a donut chart illustrating the runtime distribution of KGoT, broken down into five categories: tool invocations, system robustness, graph executor, solution formatting, and tool executor. The chart displays the percentage of total runtime each category consumes. The total runtime is also provided.
### Components/Axes
* **Title:** KGoT Runtime Distribution
* **Categories:**
* tool invocations
* system robustness
* graph executor
* solution formatting
* tool executor
* **Total Runtime:** 35817.29 s
### Detailed Analysis
* **tool invocations:** 71.5% (Light Blue) - This category accounts for the largest portion of the runtime.
* **system robustness:** 13.6% (Dark Blue) - The second largest portion of the runtime.
* **graph executor:** 7.06% (Teal) - A smaller, but still significant, portion of the runtime.
* **solution formatting:** 6.07% (Light Green) - A smaller portion of the runtime.
* **tool executor:** 1.76% (Pale Green) - This category accounts for the smallest portion of the runtime.
### Key Observations
* tool invocations dominate the runtime, accounting for nearly three-quarters of the total.
* tool executor has the smallest runtime percentage.
* The total runtime is 35817.29 seconds.
### Interpretation
The donut chart clearly shows that tool invocations are the most time-consuming aspect of the KGoT runtime. System robustness also contributes a significant portion. The other three categories (graph executor, solution formatting, and tool executor) have relatively smaller contributions to the overall runtime. This suggests that optimizing tool invocations would likely have the greatest impact on reducing the total runtime of KGoT.
</details>
Figure 14: Different runtime categorizations of the same data. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
D.4 Compute Resources
Because of the long runtime, we executed most experiments using the OpenAI API as an external resource on server compute nodes containing a AMD EPYC 7742 CPU with 128 cores running at 2.25GHz, with a total memory of 256GB. However when the LLM is called as an external resource, KGoT is able to run on commodity hardware with minimal effects on runtime.
Our experiments with locally run LLMs were executed with compute nodes containing 4x NVIDIA GH200, a respective GPU memory of 96GB, and a total memory of 896GB. In these cases, the minimum hardware requirements are dictated by the resources needed to run each LLM locally.
High-performance & scalability experiments were performed on an Apple M3 Pro with 12 cores at 4.056GHz and a total memory of 18GB.
D.5 GAIA Result Visualizations
We also implemented various automatic scripts that plot various aspects once a GAIA run is finished. In the following we provide example plots for Neo4j with query retrieval.
We provide a breakdown for each level of the GAIA benchmark into the categories that KGoTâs answers for the tasks fall into in Figure 15. We measure the runtime and costs of the various components of KGoT and illustrate them in Figure 17. We also provide insights into the tool usage, starting with the number of tasks for which a specific tools is used and whether that task was successful or not (see Figure 16). A more detailed analysis into the tool selection is provided in the plots of Figures 18 and 19 as well as the number of times the tools are used in Figure 20.
We provide now a brief explanation of the more opaque function names listed in Figure 17.
- Any function marked as not logged refers to function or tool calls that do not incur an LLM-related cost or where usage costs are logged within the tool itself.
- WebSurfer.forward submits a query to SerpApi.
- Define Cypher query given new information constructs a Cypher insert query based on newly gathered information.
- Fix JSON corrects malformed or invalid JSON for services like Neo4j.
- Define forced retrieve queries generates a Cypher retrieval query when the maximum number of iterations is reached.
- Generate forced solution generates a solution based on the state of the knowledge graph if no viable solution has been parsed after a Cypher retrieve or if the forced retrievals fails after exhausting all iterations.
<details>
<summary>figures/all_plot_all_stats.png Details</summary>

### Visual Description
## Bar Chart: Performance Rates by Level
### Overview
The image is a bar chart comparing the rates of different outcomes (Correct, Correct forced, Close call, Wrong forced, Other error, and Wrong) across three levels (1, 2, and 3). The y-axis represents the rate in percentage, ranging from 0% to 100%. Each level has a set of bars representing the rates of each outcome. The chart includes a legend in the top-left corner to identify the color associated with each outcome.
### Components/Axes
* **X-axis:** "Level" with categories 1, 2, and 3.
* **Y-axis:** "Rate (%)" ranging from 0 to 100, with gridlines at intervals of 20.
* **Legend:** Located in the top-left corner, mapping colors to outcomes:
* Green: Correct
* Cyan: Correct forced
* Blue: Close call
* Yellow: Wrong forced
* Orange: Other error
* Red: Wrong
### Detailed Analysis
**Level 1:**
* **Correct (Green):** 37% (20/53)
* **Correct forced (Cyan):** 1% (1/53)
* **Close call (Blue):** 0% (0/53)
* **Wrong forced (Yellow):** 1% (1/53)
* **Other error (Orange):** 3% (2/53)
* **Wrong (Red):** 54% (29/53)
**Level 2:**
* **Correct (Green):** 20% (18/86)
* **Correct forced (Cyan):** 0% (0/86)
* **Close call (Blue):** 0% (0/86)
* **Wrong forced (Yellow):** 5% (5/86)
* **Other error (Orange):** 0% (0/86)
* **Wrong (Red):** 73% (63/86)
**Level 3:**
* **Correct (Green):** 3% (1/26)
* **Correct forced (Cyan):** 0% (0/26)
* **Close call (Blue):** 0% (0/26)
* **Wrong forced (Yellow):** 3% (1/26)
* **Other error (Orange):** 0% (0/26)
* **Wrong (Red):** 92% (24/26)
### Key Observations
* The "Wrong" outcome (red) increases significantly from Level 1 (54%) to Level 2 (73%) to Level 3 (92%).
* The "Correct" outcome (green) decreases from Level 1 (37%) to Level 2 (20%) to Level 3 (3%).
* "Correct forced," "Close call," and "Other error" outcomes are consistently low across all levels.
* "Wrong forced" is low, but slightly higher in Level 2 (5%) compared to Levels 1 and 3 (1% and 3% respectively).
### Interpretation
The data suggests that as the level increases, the rate of "Wrong" outcomes increases dramatically, while the rate of "Correct" outcomes decreases. This indicates a potential increase in difficulty or complexity as the level progresses, leading to more errors. The consistently low rates of "Correct forced," "Close call," and "Other error" suggest these outcomes are relatively rare across all levels. The increase in "Wrong forced" at level 2 could indicate a specific challenge or change in the task at that level. Overall, the chart highlights a clear trend of decreasing performance with increasing level.
</details>
Figure 15: Number of tasks per level that succeeded or fall into a given error category. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_category_success.png Details</summary>

### Visual Description
## Bar Chart: Question Success by GAIA Categories
### Overview
The image is a horizontal bar chart displaying the success rate of questions categorized by GAIA tools. The chart compares the number of successful and failed questions for each tool category. The total number of questions is 165.
### Components/Axes
* **Title:** Question Success by GAIA Categories
* **Subtitle:** Total Questions: 165
* **X-axis:** Number of Questions, ranging from 0 to 120.
* **Y-axis:** GAIA Categories (list below)
* **Legend:** Located in the top-right corner.
* Successful (Green)
* Failed (Red)
* **Categories (Y-axis):**
* search\_information\_tools
* calculator
* image\_recognition\_processing\_tools
* pdf\_tools
* spreadsheet\_tools
* text\_processing\_analysis\_tools
* video\_tools
* programming\_code\_tools
* audio\_tools
* document\_access\_tools
* specialized\_tools
* search\_location\_tools
* general\_utilities
### Detailed Analysis
The chart presents the number of successful and failed questions for each GAIA category. The values are displayed directly on the bars.
* **search\_information\_tools:** 98 Failed, 23 Successful
* **calculator:** 36 Failed, 7 Successful
* **image\_recognition\_processing\_tools:** 28 Failed, 2 Successful
* **pdf\_tools:** 10 Failed, 6 Successful
* **spreadsheet\_tools:** 9 Failed, 5 Successful
* **text\_processing\_analysis\_tools:** 8 Failed, 2 Successful
* **video\_tools:** 7 Failed, 2 Successful
* **programming\_code\_tools:** 6 Failed, 1 Successful
* **audio\_tools:** 3 Failed, 3 Successful
* **document\_access\_tools:** 4 Failed, 1 Successful
* **specialized\_tools:** 3 Failed, 1 Successful
* **search\_location\_tools:** 2 Failed, 2 Successful (Note: The successful bar is not visible, implying a very small value or zero)
* **general\_utilities:** 2 Failed, 2 Successful (Note: The successful bar is not visible, implying a very small value or zero)
### Key Observations
* The "search\_information\_tools" category has the highest number of questions, with a significant number of failed questions.
* The ratio of failed to successful questions varies across categories. Some categories, like "audio\_tools", have a relatively balanced ratio.
* "search\_location\_tools" and "general\_utilities" have very few questions overall.
### Interpretation
The chart provides insights into the performance of different GAIA tool categories based on question success rates. The data suggests that some tool categories, such as "search\_information\_tools," may require further attention due to the high number of failed questions. The balanced ratio in "audio\_tools" indicates a potentially well-performing category. The low question counts in "search\_location\_tools" and "general\_utilities" might suggest these tools are less frequently used or tested.
</details>
Figure 16: Overview over how many tasks use a given tool and whether they are successful or not. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_cost_summary_cost.png Details</summary>

### Visual Description
## Bar Chart: Tool Performance
### Overview
The image is a bar chart displaying the performance of various tools. The y-axis represents a numerical value, and the x-axis lists the names of the tools. The chart also includes horizontal lines indicating the arithmetic mean, minimum, and maximum values.
### Components/Axes
* **X-axis:** Tool names (listed below)
* **Y-axis:** Numerical value, ranging from 0.0 to 2.5, with increments of 0.5.
* **Bars:** Blue bars representing the performance value for each tool.
* **Horizontal Lines:**
* Dotted line at approximately 2.4, labeled "Max: $2.41e+00"
* Dashed line at approximately 0.18, labeled "Arithmetic Mean: $1.86e-01"
* Dashed line at approximately 0.0006, labeled "Min: $6.63e-04"
### Detailed Analysis
The following is a list of the tools and their approximate values, extracted from the bar chart.
* **SurferTool:** ~2.4
* **define\_next\_step:** ~0.38
* **parse\_solution\_with\_llm:** ~0.3
* **Wikipedia.get\_page\_content:** ~0.14
* **define\_cypher\_query\_given\_new\_information:** ~0.13
* **fix\_cypher:** ~0.12
* **define\_need\_for\_math\_before\_parsing:** ~0.1
* **define\_math\_tool\_call:** ~0.09
* **WebSurfer.forward:** ~0.09
* **merge\_reasons\_to\_insert:** ~0.08
* **define\_tool\_calls:** ~0.08
* **define\_final\_solution:** ~0.07
* **define\_retrieve\_query:** ~0.02
* **Wikipedia.ask\_LLM\_which\_article\_to\_explore:** ~0.01
* **TextInspector:** ~0.01
* **define\_forced\_retrieve\_queries:** ~0.01
* **ImageQuestion.\_run:** ~0.01
* **generate\_forced\_solution:** ~0.005
* **LLMTool.\_run:** ~0.005
* **RunPythonCodeTool.\_fix\_code:** ~0.005
* **fix\_json:** ~0.005
### Key Observations
* **SurferTool** has a significantly higher value than all other tools.
* Most tools have relatively low values, clustered near the bottom of the chart.
* The arithmetic mean is significantly higher than the minimum value, indicating a skewed distribution.
### Interpretation
The bar chart illustrates the relative performance of different tools. The SurferTool appears to be the most effective or frequently used, as indicated by its high value. The other tools have considerably lower values, suggesting they are either less effective, less frequently used, or have a different scale of measurement. The arithmetic mean provides a sense of the average performance, while the minimum and maximum values highlight the range of performance across all tools. The skewed distribution suggests that a few tools perform significantly better than the majority.
</details>
(a) Cost in dollar.
<details>
<summary>figures/all_cost_summary_number_of_calls.png Details</summary>

### Visual Description
## Bar Chart: Frequency of Actions
### Overview
The image is a bar chart displaying the frequency of different actions. The x-axis lists the actions, and the y-axis represents the count or frequency of each action. The bars are colored in purple. The chart also includes horizontal lines indicating the arithmetic mean, maximum, and minimum values.
### Components/Axes
* **X-axis:** Lists various actions, including "define\_next\_step", "SurferTool", "parse\_solution\_with\_llm", "define\_need\_for\_math\_before\_parsing", "fix\_cypher", "define\_cypher\_query\_given\_new\_information", "merge\_reasons\_to\_insert", "define\_math\_tool\_call", "define tool calls", "ask\_search\_agent\_NOT\_LOGGED", "run\_python\_code\_NOT LOGGED", "define\_final\_solution", "define\_retrieve\_query", "Wikipedia.get\_page\_content", "define\_forced\_retrieve\_queries", "inspect\_file\_as\_text\_NOT\_LOGGED", "Wikipedia.ask\_LLM\_which\_article", "generate\_forced\_solution", "WebSurfer.forward", "TextInspector", "Ilm\_which\_article\_to\_explore", "IlmTool.\_run", "Ilm\_query\_NOT\_LOGGED", "image\_inspector\_NOT\_LOGGED", "Image Question. run", "extract\_zip\_NOT\_LOGGED", "RunPython Code Tool.\_fix\_code", "fix\_json", "Audio TranscriptionLoader.transcribe\_audio". The labels are rotated to fit.
* **Y-axis:** Numerical scale ranging from 0 to 2000, with increments of 500.
* **Bars:** Purple bars representing the frequency of each action.
* **Horizontal Lines:**
* Dashed line at approximately y=339, labeled "Arithmetic Mean: 339".
* Dashed line at approximately y=2160, labeled "Max: 2160".
* **Minimum Value:** "Min: 3" is indicated near the bottom-right of the chart.
### Detailed Analysis
The bar chart shows a clear distribution of action frequencies.
* **Top Actions:** The most frequent actions are "define\_next\_step", "SurferTool", and "parse\_solution\_with\_llm", with frequencies around 2100, 2080, and 2030 respectively.
* **Mid-Range Actions:** Actions like "fix\_cypher", "define\_cypher\_query\_given\_new\_information", "merge\_reasons\_to\_insert", "define\_math\_tool\_call", and "define tool calls" have frequencies between approximately 300 and 700.
* **Low-Frequency Actions:** Many actions have very low frequencies, close to the minimum value of 3. These include "Wikipedia.ask\_LLM\_which\_article", "generate\_forced\_solution", "WebSurfer.forward", "TextInspector", "Ilm\_which\_article\_to\_explore", "IlmTool.\_run", "Ilm\_query\_NOT\_LOGGED", "image\_inspector\_NOT\_LOGGED", "Image Question. run", "extract\_zip\_NOT\_LOGGED", "RunPython Code Tool.\_fix\_code", "fix\_json", and "Audio TranscriptionLoader.transcribe\_audio".
Specific data points (approximate due to visual estimation):
* define\_next\_step: ~2100
* SurferTool: ~2080
* parse\_solution\_with\_llm: ~2030
* define\_need\_for\_math\_before\_parsing: ~700
* fix\_cypher: ~650
* define\_cypher\_query\_given\_new\_information: ~400
* merge\_reasons\_to\_insert: ~300
* define\_math\_tool\_call: ~280
* define tool calls: ~270
* ask\_search\_agent\_NOT\_LOGGED: ~250
* run\_python\_code\_NOT LOGGED: ~240
* define\_final\_solution: ~200
* define\_retrieve\_query: ~150
* Wikipedia.get\_page\_content: ~120
* define\_forced\_retrieve\_queries: ~90
* inspect\_file\_as\_text\_NOT\_LOGGED: ~70
* Wikipedia.ask\_LLM\_which\_article: ~40
* generate\_forced\_solution: ~30
* WebSurfer.forward: ~25
* TextInspector: ~20
* Ilm\_which\_article\_to\_explore: ~15
* IlmTool.\_run: ~12
* Ilm\_query\_NOT\_LOGGED: ~10
* image\_inspector\_NOT\_LOGGED: ~8
* Image Question. run: ~7
* extract\_zip\_NOT\_LOGGED: ~6
* RunPython Code Tool.\_fix\_code: ~5
* fix\_json: ~4
* Audio TranscriptionLoader.transcribe\_audio: ~3
### Key Observations
* The top three actions are significantly more frequent than all other actions.
* There is a long tail of actions with very low frequencies.
* The arithmetic mean is 339, but most actions have frequencies below this value, indicating a skewed distribution.
* Several actions are marked with "NOT LOGGED", suggesting that data logging might be incomplete or inconsistent for these actions.
### Interpretation
The data suggests that certain initial steps ("define\_next\_step", "SurferTool", "parse\_solution\_with\_llm") are heavily used, possibly indicating core functionalities or common starting points in a workflow. The long tail of less frequent actions could represent more specialized or less common tasks. The "NOT LOGGED" labels raise questions about the completeness and reliability of the data for those specific actions. The skewed distribution highlights that a few actions dominate the overall usage, while many others are used infrequently. This information could be valuable for optimizing workflows, prioritizing development efforts, and improving data logging practices.
</details>
(b) Number of calls.
<details>
<summary>figures/all_cost_summary_duration.png Details</summary>

### Visual Description
## Bar Chart: Task Execution Times
### Overview
The image is a bar chart displaying the execution times of various tasks. The x-axis represents the task names, and the y-axis represents the execution time in seconds. The bars are colored in a shade of red. The chart also includes horizontal lines indicating the maximum execution time and the arithmetic mean.
### Components/Axes
* **X-axis:** Task names (listed below in "Detailed Analysis")
* Labels are rotated ~45 degrees for readability.
* **Y-axis:** Execution time in seconds (s)
* Scale: 0 to 12000, with increments of 2000.
* **Bars:** Represent the execution time for each task. All bars are the same shade of red.
* **Horizontal Lines:**
* Dotted line at y = 12237.19, labeled "Max: 12237.19 s"
* Dashed line at y = 1279.19, labeled "Arithmetic Mean: 1279.19 s"
* **Minimum Value:** "Min: 0.01 s" is noted at the bottom right.
### Detailed Analysis
The following tasks are listed on the x-axis, along with their approximate execution times (estimated from the bar heights):
1. **ask\_search\_agent\_NOT\_LOGGED:** ~10 s
2. **SurferTool:** ~12200 s
3. **define\_next\_step:** ~9200 s
4. **define\_math\_tool\_call:** ~3000 s
5. **fix\_cypher:** ~2300 s
6. **define\_tool\_calls:** ~2000 s
7. **parse\_solution\_with\_Ilm:** ~1800 s
8. **define\_cypher\_query\_given\_new\_information:** ~1500 s
9. **merge\_reasons\_to\_insert:** ~500 s
10. **define\_need\_for\_math\_before\_parsing:** ~400 s
11. **inspect\_file\_as\_text\_NOT\_LOGGED:** ~400 s
12. **WebSurfer.forward:** ~300 s
13. **Wikipedia.get\_page\_content:** ~300 s
14. **TextInspector:** ~300 s
15. **define\_retrieve\_query:** ~200 s
16. **image\_inspector\_NOT\_LOGGED:** ~200 s
17. **define\_final\_solution:** ~200 s
18. **ImageQuestion.\_run:** ~150 s
19. **run\_python\_code\_NOT\_LOGGED:** ~150 s
20. **Tlm\_query\_NOT LOGGED:** ~100 s
21. **define\_forced\_retrieve\_queries:** ~100 s
22. **Wikipedia.ask\_LLM\_which\_article\_to\_explore:** ~100 s
23. **LLMTool.\_run:** ~100 s
24. **RunPythonCodeTool.\_fix\_code:** ~50 s
25. **generate\_forced\_solution:** ~50 s
26. **fix\_json:** ~50 s
27. **AudioTranscriptionLoader.transcribe\_audio:** ~50 s
28. **extract\_zip\_NOT\_LOGGED:** ~50 s
### Key Observations
* The "SurferTool" task has a significantly higher execution time compared to all other tasks.
* "define\_next\_step" also has a high execution time, though not as high as "SurferTool".
* Most tasks have relatively low execution times, clustered near the bottom of the chart.
* The arithmetic mean execution time (1279.19 s) is heavily influenced by the two tasks with very high execution times.
### Interpretation
The data suggests that the "SurferTool" and "define\_next\_step" tasks are the most time-consuming operations. This could be due to the complexity of these tasks, inefficient code, or external factors such as network latency. The large difference in execution times between these tasks and the others indicates a potential area for optimization. The "NOT\_LOGGED" suffix on some tasks may indicate that execution time logging was not enabled for those tasks, or that those tasks did not generate log data. The arithmetic mean is not a good representation of the typical execution time due to the outliers.
</details>
(c) Duration in seconds.
<details>
<summary>figures/all_cost_summary_cost_token.png Details</summary>

### Visual Description
## Bar Chart: Performance of Different Tools
### Overview
The image is a bar chart comparing the performance of different tools. The y-axis represents a value scaled by 10^-7, and the x-axis lists the names of the tools. The chart shows a decreasing trend in performance from left to right.
### Components/Axes
* **Y-axis:** Labeled as "x10^-7". The scale ranges from 0 to 4 with tick marks at every integer.
* **X-axis:** Lists the names of different tools. The labels are rotated for readability.
* **Bars:** Represent the performance of each tool. All bars are the same shade of blue.
* **Horizontal Gridlines:** Dotted lines at each integer value on the y-axis.
* **Max Value Annotation:** Located at the top-right, indicating "Max: $4.75e-07".
* **Min Value Annotation:** Located near the bottom-right, indicating "Min: $1.02e-07".
### Detailed Analysis
The following is a list of the tools and their approximate values, read from the bar chart:
1. **LLMTool.\_run:** 4.7 x 10^-7
2. **define\_math\_tool\_call:** 3.1 x 10^-7
3. **ImageQuestion.\_run:** 2.7 x 10^-7
4. **RunPythonCodeTool.\_fix\_code:** 2.6 x 10^-7
5. **fix\_json:** 2.6 x 10^-7
6. **fix\_cypher:** 2.4 x 10^-7
7. **define\_cypher\_query\_given\_new\_information:** 2.4 x 10^-7
8. **TextInspector:** 2.2 x 10^-7
9. **merge\_reasons\_to\_insert:** 2.2 x 10^-7
10. **generate\_forced\_solution:** 1.9 x 10^-7
11. **define\_final\_solution:** 1.6 x 10^-7
12. **WebSurfer.forward:** 1.6 x 10^-7
13. **define\_need\_for\_math\_before\_parsing:** 1.5 x 10^-7
14. **parse\_solution\_with\_llm:** 1.5 x 10^-7
15. **Wikipedia.get\_page\_content:** 1.5 x 10^-7
16. **Wikipedia.ask\_LLM\_which\_article\_to\_explore:** 1.4 x 10^-7
17. **define\_forced\_retrieve\_queries:** 1.4 x 10^-7
18. **define\_retrieve\_query:** 1.2 x 10^-7
19. **SurferTool:** 1.1 x 10^-7
20. **define\_next\_step:** 1.05 x 10^-7
21. **define\_tool\_calls:** 1.02 x 10^-7
### Key Observations
* The performance varies significantly across the different tools.
* "LLMTool.\_run" has the highest performance, almost double that of some other tools.
* The performance generally decreases from left to right, with some minor fluctuations.
* The minimum value is close to 1 x 10^-7, while the maximum is close to 4.75 x 10^-7.
### Interpretation
The bar chart visualizes the relative performance of various tools, likely in the context of a specific task or benchmark. The wide range of performance suggests that some tools are significantly more effective than others. The "LLMTool.\_run" tool stands out as a top performer. The chart provides a clear comparison, allowing for easy identification of the most and least effective tools. The values are scaled by 10^-7, which could represent a metric like time taken, cost, or error rate. The specific meaning of the y-axis would provide more context.
</details>
(d) Cost per token in dollar.
<details>
<summary>figures/all_cost_summary_cost_second.png Details</summary>

### Visual Description
## Bar Chart: Task Performance
### Overview
The image is a bar chart displaying the performance of various tasks. The y-axis represents a value scaled by 10^-4, and the x-axis lists the tasks. The chart shows the relative performance of each task, with the highest and lowest values explicitly marked.
### Components/Axes
* **Y-axis:** The y-axis is labeled with "x10^-4" and ranges from 0.0 to 3.5, with increments of 0.5.
* **X-axis:** The x-axis lists the following tasks:
* Wikipedia.get\_page\_content
* Wikipedia.ask\_LLM\_which\_article\_to\_explore
* SurferTool
* WebSurfer.forward
* generate\_forced\_solution
* define\_need\_for\_math\_before\_parsing
* parse\_solution\_with\_llm
* define\_next\_step
* define\_final\_solution
* define\_forced\_retrieve\_queries
* define\_retrieve\_queries
* define\_tool\_calls
* define\_cypher\_query\_given\_new\_information
* TextInspector
* merge\_reasons\_to\_insert
* fix\_json
* RunPythonCodeTool.\_fix\_code
* fix\_cypher
* ImageQuestion.\_run
* define\_math\_tool\_call
* LLMTool.\_run
* **Bars:** The bars are all colored in a consistent blue.
* **Maximum Value:** "Max: 3.79e-04" is displayed near the top-right of the chart, indicating the maximum value among all tasks. A dotted horizontal line extends from the y-axis value of approximately 3.79 to the right, visually marking the maximum value.
* **Minimum Value:** "Min: 3.26e-05" is displayed near the bottom-right of the chart, indicating the minimum value among all tasks. A dotted horizontal line extends from the y-axis value of approximately 0.3 to the right, visually marking the minimum value.
### Detailed Analysis
The bar chart presents the performance of different tasks, with the height of each bar representing the task's value. The values are scaled by 10^-4.
* **Wikipedia.get\_page\_content:** The bar height is approximately 3.7 x 10^-4, making it the highest performing task.
* **Wikipedia.ask\_LLM\_which\_article\_to\_explore:** The bar height is approximately 3.7 x 10^-4, making it the second highest performing task.
* **SurferTool:** The bar height is approximately 2.6 x 10^-4.
* **WebSurfer.forward:** The bar height is approximately 2.3 x 10^-4.
* **generate\_forced\_solution:** The bar height is approximately 2.2 x 10^-4.
* **define\_need\_for\_math\_before\_parsing:** The bar height is approximately 2.2 x 10^-4.
* **parse\_solution\_with\_llm:** The bar height is approximately 2.0 x 10^-4.
* **define\_next\_step:** The bar height is approximately 1.3 x 10^-4.
* **define\_final\_solution:** The bar height is approximately 1.2 x 10^-4.
* **define\_forced\_retrieve\_queries:** The bar height is approximately 1.2 x 10^-4.
* **define\_retrieve\_queries:** The bar height is approximately 1.0 x 10^-4.
* **define\_tool\_calls:** The bar height is approximately 0.8 x 10^-4.
* **define\_cypher\_query\_given\_new\_information:** The bar height is approximately 0.8 x 10^-4.
* **TextInspector:** The bar height is approximately 0.8 x 10^-4.
* **merge\_reasons\_to\_insert:** The bar height is approximately 0.7 x 10^-4.
* **fix\_json:** The bar height is approximately 0.7 x 10^-4.
* **RunPythonCodeTool.\_fix\_code:** The bar height is approximately 0.7 x 10^-4.
* **fix\_cypher:** The bar height is approximately 0.6 x 10^-4.
* **ImageQuestion.\_run:** The bar height is approximately 0.5 x 10^-4.
* **define\_math\_tool\_call:** The bar height is approximately 0.3 x 10^-4.
* **LLMTool.\_run:** The bar height is approximately 0.3 x 10^-4, making it the lowest performing task.
### Key Observations
* The tasks "Wikipedia.get\_page\_content" and "Wikipedia.ask\_LLM\_which\_article\_to\_explore" have the highest values, significantly higher than the other tasks.
* The tasks "define\_math\_tool\_call" and "LLMTool.\_run" have the lowest values.
* There is a general downward trend from left to right, indicating that the tasks on the left tend to perform better than those on the right.
### Interpretation
The bar chart provides a comparison of the performance of different tasks. The "Wikipedia.get\_page\_content" and "Wikipedia.ask\_LLM\_which\_article\_to\_explore" tasks significantly outperform the others, suggesting that these tasks are more efficient or better optimized. The "define\_math\_tool\_call" and "LLMTool.\_run" tasks have the lowest performance, indicating potential areas for improvement or optimization. The downward trend suggests that the earlier tasks in the sequence are generally more effective than the later tasks. This information can be used to prioritize optimization efforts and identify areas where performance can be improved.
</details>
(e) Cost per time in dollar/s.
<details>
<summary>figures/all_cost_summary_tokens_per_second.png Details</summary>

### Visual Description
## Bar Chart: Task Performance
### Overview
The image is a bar chart displaying the performance of various tasks, likely measured in operations per second. The chart shows a clear ranking of tasks from highest to lowest performance, with "Wikipedia.ask_LLM_which_article_to_explore" and "Wikipedia.get_page_content" showing the highest performance and "LLMTool._run" showing the lowest.
### Components/Axes
* **X-axis:** Categorical axis listing the names of the tasks. The labels are rotated for readability.
* Categories:
* Wikipedia.ask\_LLM\_which\_article\_to\_explore
* Wikipedia.get\_page\_content
* SurferTool
* WebSurfer.forward
* define\_need\_for\_math\_before\_parsing
* generate\_forced\_solution
* parse\_solution\_with\_llm
* define\_forced\_retrieve\_queries
* define\_next\_step
* define\_tool\_calls
* define\_retrieve\_queries
* define\_final\_solution
* merge\_reasons\_to\_insert
* define\_cypher\_query\_given\_new\_information
* TextInspector
* RunPythonCodeTool.\_fix\_code
* fix\_json
* fix\_cypher
* ImageQuestion.\_run
* define\_math\_tool\_call
* LLMTool.\_run
* **Y-axis:** Numerical axis representing the performance in operations per second (/s). The scale ranges from 0 to 2500, with gridlines at intervals of 500.
* Scale: 0, 500, 1000, 1500, 2000, 2500
* **Bars:** Green bars representing the performance value for each task.
* **Annotations:**
* "Max: 2731.51 /s" is located at the top-right of the chart.
* "Min: 68.70 /s" is located near the bottom-right of the chart.
### Detailed Analysis
The bar chart presents a clear performance ranking of the listed tasks. The performance values are as follows (approximate, based on bar height):
* **Wikipedia.ask\_LLM\_which\_article\_to\_explore:** \~2650 /s
* **Wikipedia.get\_page\_content:** \~2650 /s
* **SurferTool:** \~2350 /s
* **WebSurfer.forward:** \~1450 /s
* **define\_need\_for\_math\_before\_parsing:** \~1400 /s
* **generate\_forced\_solution:** \~1350 /s
* **parse\_solution\_with\_llm:** \~1300 /s
* **define\_forced\_retrieve\_queries:** \~1200 /s
* **define\_next\_step:** \~1150 /s
* **define\_tool\_calls:** \~900 /s
* **define\_retrieve\_queries:** \~800 /s
* **define\_final\_solution:** \~400 /s
* **merge\_reasons\_to\_insert:** \~350 /s
* **define\_cypher\_query\_given\_new\_information:** \~350 /s
* **TextInspector:** \~300 /s
* **RunPythonCodeTool.\_fix\_code:** \~300 /s
* **fix\_json:** \~250 /s
* **fix\_cypher:** \~200 /s
* **ImageQuestion.\_run:** \~100 /s
* **define\_math\_tool\_call:** \~75 /s
* **LLMTool.\_run:** \~70 /s
### Key Observations
* Two tasks, "Wikipedia.ask\_LLM\_which\_article\_to\_explore" and "Wikipedia.get\_page\_content", significantly outperform all other tasks.
* The performance drops off sharply after the first three tasks.
* The last few tasks ("fix\_cypher", "ImageQuestion.\_run", "define\_math\_tool\_call", and "LLMTool.\_run") have very low performance compared to the others.
* The maximum performance is 2731.51 /s, and the minimum is 68.70 /s.
### Interpretation
The chart indicates a wide range of performance across different tasks. The Wikipedia-related tasks are the most efficient, while tasks related to fixing code and running specific tools are significantly slower. This could be due to the complexity of the tasks, the efficiency of the algorithms used, or the resources required for each task. The large performance gap suggests that optimizing the slower tasks could lead to significant overall performance improvements. The tasks with the lowest performance may be bottlenecks in a larger system or workflow.
</details>
(f) Tokens per second.
Figure 17: Overview over the execution time as well as the cost in dollar. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_match.png Details</summary>

### Visual Description
## Stacked Bar Chart: Tool Choice Correctness Analysis
### Overview
The image is a stacked bar chart that analyzes the correctness of tool choices. The chart displays the distribution of correct, partially correct (medium and low match), and wrong tool choices. The y-axis represents the number of questions, and the x-axis represents the total questions analyzed. The chart provides percentage values for each category within the stacked bar.
### Components/Axes
* **Title:** Tool Choice Correctness Analysis
* **Y-axis Title:** Number of Questions
* Y-axis scale ranges from 0 to 160, with tick marks at intervals of 20 (0, 20, 40, 60, 80, 100, 120, 140, 160).
* **X-axis Title:** Total Questions Analyzed: 165
* **Legend:** Located on the top-right of the chart.
* Red: Wrong Tool Choice
* Orange: Partially Correct (Low Match)
* Yellow: Partially Correct (Medium Match)
* Green: Correct Tool Choice
### Detailed Analysis
The stacked bar is composed of four colored segments, each representing a different level of tool choice correctness.
* **Green (Correct Tool Choice):** The bottom segment of the bar is green, representing the number of questions with the correct tool choice. It accounts for 36.4% of the total. The green bar extends to approximately 60 on the y-axis.
* **Yellow (Partially Correct - Medium Match):** The next segment is yellow, representing partially correct tool choices with a medium match. It accounts for 35.8% of the total. The yellow bar extends to approximately 120 on the y-axis.
* **Orange (Partially Correct - Low Match):** Above the yellow segment is an orange segment, representing partially correct tool choices with a low match. It accounts for 10.9% of the total. The orange bar extends to approximately 140 on the y-axis.
* **Red (Wrong Tool Choice):** The top segment is red, representing the number of questions with the wrong tool choice. It accounts for 17.0% of the total. The red bar extends to approximately 165 on the y-axis.
### Key Observations
* The "Correct Tool Choice" and "Partially Correct (Medium Match)" categories have the highest percentages, at 36.4% and 35.8% respectively.
* The "Wrong Tool Choice" and "Partially Correct (Low Match)" categories have the lowest percentages, at 17.0% and 10.9% respectively.
* The total number of questions analyzed is 165.
### Interpretation
The chart indicates that in the analyzed dataset, the tool choices were either correct or partially correct (medium match) in the majority of cases. The percentage of wrong tool choices is relatively low. This suggests that the tool selection process is generally effective, but there is still room for improvement, particularly in reducing the number of "Wrong Tool Choice" and "Partially Correct (Low Match)" selections. The data highlights the distribution of tool choice correctness, providing insights into areas where users may need additional guidance or training to improve their tool selection skills.
</details>
Figure 18: Analysis of the tool selection. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_choice_analysis.png Details</summary>

### Visual Description
## Sankey Diagram: Tool Correctness to Question Success Analysis
### Overview
The image is a Sankey diagram illustrating the relationship between the correctness of a tool's match and the success of a question. The diagram shows how different levels of tool match (Partial Low, Correct, Partial Medium, Wrong) correspond to the outcomes of GAIA questions (Failed, Successful). The width of the connecting flows represents the number of instances for each combination.
### Components/Axes
* **Title:** Tool Correctness to Question Success Analysis
* **Left Axis (Tool Choice):**
* ToolMatch.PARTIAL\_LOW (Orange): N = 18
* ToolMatch.CORRECT (Green): N = 60
* ToolMatch.PARTIAL\_MEDIUM (Yellow): N = 59
* ToolMatch.WRONG (Red): N = 28
* **Right Axis (GAIA Question):**
* Failed (Dark Gray): N = 125
* Successful (Dark Gray): N = 40
### Detailed Analysis
* **ToolMatch.PARTIAL\_LOW (Orange):**
* 18 instances total.
* Connects primarily to "Failed" with a smaller portion to "Successful".
* **ToolMatch.CORRECT (Green):**
* 60 instances total.
* Connects primarily to "Successful" with a smaller portion to "Failed".
* **ToolMatch.PARTIAL\_MEDIUM (Yellow):**
* 59 instances total.
* Connects to both "Failed" and "Successful" in roughly equal proportions.
* **ToolMatch.WRONG (Red):**
* 28 instances total.
* Connects primarily to "Failed" with a smaller portion to "Successful".
* **GAIA Question - Failed (Dark Gray):**
* 125 instances total.
* Receives input from all tool match categories, with the largest contribution from "Partial Medium" and "Wrong".
* **GAIA Question - Successful (Dark Gray):**
* 40 instances total.
* Receives input primarily from "Correct" and "Partial Medium".
### Key Observations
* A "Correct" tool match is strongly associated with a "Successful" question outcome.
* A "Wrong" tool match is strongly associated with a "Failed" question outcome.
* "Partial Medium" tool matches have a relatively even distribution between "Failed" and "Successful" question outcomes.
* "Partial Low" tool matches are more likely to result in a "Failed" question outcome.
### Interpretation
The Sankey diagram suggests a clear correlation between the correctness of the tool match and the success of the question. A correct tool match significantly increases the likelihood of a successful question, while a wrong tool match increases the likelihood of failure. Partial matches show a mixed outcome, with "Partial Medium" having a more balanced distribution and "Partial Low" leaning towards failure. This data could be used to evaluate the effectiveness of the tool and identify areas for improvement. The diagram highlights the importance of accurate tool matching for achieving successful question outcomes.
</details>
Figure 19: Analysis of the tool selection. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_usage_count.png Details</summary>

### Visual Description
## Donut Chart: KGOT Tool Usage Distribution
### Overview
The image is a donut chart illustrating the distribution of usage for six different tools within the KGOT framework. The chart shows the percentage of usage for each tool, with "ask_search_agent" having the highest usage and "extract_zip" having the lowest. The chart also indicates that these tools were used for 165 GAIA questions, with a total tool usage count of 173.
### Components/Axes
* **Title:** KGOT Tool Usage Distribution
* **Subtitle:** 6 unique tools for 165 GAIA questions
* **Center Text:** Total Tool Usage Count: 173
* **Segments (Tools and Percentages):**
* ask\_search\_agent: 61.3% (Blue)
* inspect\_file\_as\_text: 15.6% (Teal)
* llm\_query: 11% (Light Teal)
* image\_inspector: 5.78% (Light Green)
* run\_python\_code: 5.2% (Light Green)
* extract\_zip: 1.16% (Light Yellow)
### Detailed Analysis
The donut chart is divided into six segments, each representing a different tool. The size of each segment corresponds to the percentage of times that tool was used.
* **ask\_search\_agent:** This tool accounts for the largest portion of the usage, at 61.3%. The segment is colored blue and occupies a significant portion of the donut.
* **inspect\_file\_as\_text:** This tool represents 15.6% of the usage. The segment is teal.
* **llm\_query:** This tool accounts for 11% of the usage. The segment is light teal.
* **image\_inspector:** This tool represents 5.78% of the usage. The segment is light green.
* **run\_python\_code:** This tool accounts for 5.2% of the usage. The segment is light green.
* **extract\_zip:** This tool has the lowest usage, at 1.16%. The segment is light yellow.
The total tool usage count is 173, which is displayed in the center of the donut chart.
### Key Observations
* The "ask\_search\_agent" tool is used far more frequently than any other tool, accounting for over 60% of the total usage.
* The "extract\_zip" tool is used very infrequently, representing only a small fraction of the total usage.
* The remaining tools ("inspect\_file\_as\_text", "llm\_query", "image\_inspector", and "run\_python\_code") have moderate usage, ranging from approximately 5% to 16%.
### Interpretation
The data suggests that the "ask\_search\_agent" tool is the most valuable or frequently needed tool for answering GAIA questions within the KGOT framework. The significant difference in usage between "ask\_search\_agent" and the other tools may indicate that it is either more effective, more versatile, or more user-friendly. The low usage of "extract\_zip" could indicate that it is less relevant to the types of questions being asked, or that it is less efficient or reliable than other tools. The distribution of tool usage provides insights into the relative importance and utility of each tool within the KGOT framework for addressing GAIA questions.
</details>
Figure 20: Analysis of the tool usage. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.