# Affordable AI Assistants with Knowledge Graph of Thoughts
**Authors**: Maciej Besta, ETH Zurich, &Lorenzo Paleari, ETH Zurich, &Jia Hao Andrea Jiang, ETH Zurich, &Robert Gerstenberger, ETH Zurich, &You Wu, ETH Zurich, &JĂłn Gunnar Hannesson, ETH Zurich, &Patrick Iff, ETH Zurich, &Ales Kubicek, ETH Zurich, &Piotr Nyczyk, &Diana Khimey, ETH Zurich, &Nils Blach, ETH Zurich, &Haiqiang Zhang, ETH Zurich, &Tao Zhang, ETH Zurich, &Peiran Ma, ETH Zurich, &Grzegorz KwaĆniewski, ETH Zurich, &Marcin Copik, ETH Zurich, &Hubert Niewiadomski, &Torsten Hoefler, ETH Zurich
> corresponding author
Abstract
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36 $Ă$ compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
Website & code: https://github.com/spcl/knowledge-graph-of-thoughts
1 Introduction
Large Language Models (LLMs) are transforming the world. However, training LLMs is expensive, time-consuming, and resource-intensive. In order to democratize the access to generative AI, the landscape of agent systems has massively evolved during the last two years (LangChain Inc., 2025a; Rush, 2023; Kim et al., 2024; Sumers et al., 2024; Hong et al., 2024; Guo et al., 2024; Edge et al., 2025; Besta et al., 2025c; Zhuge et al., 2024; Beurer-Kellner et al., 2024; Shinn et al., 2023; Kagaya et al., 2024; Zhao et al., 2024a; Stengel-Eskin et al., 2024; Wu et al., 2024). These schemes have been applied to numerous tasks in reasoning (Creswell et al., 2023; Bhattacharjya et al., 2024; Besta et al., 2025c), planning (Wang et al., 2023c; Prasad et al., 2024; Shen et al., 2023; Huang et al., 2023), software development (Tang et al., 2024), and many others (Xie et al., 2024; Li & Vasarhelyi, 2024; Schick et al., 2023; Beurer-Kellner et al., 2023).
Among the most impactful applications of LLM agents is the development of AI assistants capable of helping with a wide variety of tasks. These assistants promise to serve as versatile tools, enhancing productivity and decision-making across domains. From aiding researchers with complex problem-solving to managing day-to-day tasks for individuals, AI assistants are becoming an indispensable part of modern life. Developing such systems is highly relevant, but remains challenging, particularly in designing solutions that are both effective and economically viable.
The GAIA benchmark (Mialon et al., 2024) has become a key standard for evaluating LLM-based agent systems across diverse tasks, including web navigation, code execution, image reasoning, scientific QA, and multimodal challenges. Despite its introduction nearly two years ago, top-performing solutions still struggle with many tasks. Moreover, operational costs remain high: running all validation tasks with Hugging Face Agents (Roucher & Petrov, 2025) and GPT-4o costs $â$ $200, underscoring the need for more affordable alternatives . Smaller models like GPT-4o mini significantly reduce expenses but suffer from steep drops in task success, making them insufficient. Open large models also pose challenges due to demanding infrastructure needs, while smaller open models, though cheaper to run, lack sufficient capabilities.
To address these challenges, we propose Knowledge Graph of Thoughts (KGoT), a novel AI assistant architecture that significantly reduces task execution costs while maintaining a high success rate (contribution #1). The central innovation of KGoT lies in its use of a knowledge graph (KG) (Singhal, 2012; Besta et al., 2024b) to represent knowledge relevant to a given task. A KG organizes information into triples, providing a structured representation of knowledge that small, cost-effective models can efficiently process. Hence, KGoT âturns the unstructured into the structuredâ, i.e., KGoT turns the often unstructured data such as website contents or PDF files into structured KG triples. This approach enhances the comprehension of task requirements, enabling even smaller models to achieve performance levels comparable to much larger counterparts, but at a fraction of the cost.
The KGoT architecture (contribution #2) implements this concept by iteratively constructing a KG from the task statement, incorporating tools as needed to gather relevant information. The constructed KG is kept in a graph store, serving as a repository of structured knowledge. Once sufficient information is gathered, the LLM attempts to solve the task by either directly embedding the KG in its context or querying the graph store for specific insights. This approach ensures that the LLM operates with a rich and structured knowledge base, improving its task-solving ability without incurring the high costs typically associated with large models. The architecture is modular and extensible towards different types of graph query languages and tools.
Our evaluation against top GAIA leaderboard baselines demonstrates its effectiveness and efficiency (contribution #3). KGoT with GPT-4o mini solves $>$ 2 $Ă$ more tasks from the validation set than Hugging Face Agents with GPT-4o or GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs: from $187 with GPT-4o to roughly $5 with GPT-4o mini. KGoTâs benefits generalize to other models, baselines, and benchmarks such as SimpleQA (Wei et al., 2024).
On top of that, KGoT reduces noise and simultaneously minimizes bias and improves fairness by externalizing reasoning into an explicit knowledge graph rather than relying solely on the LLMâs internal generation (contribution #4). This ensures that key steps when resolving tasks are grounded in transparent, explainable, and auditable information.
2 Knowledge Graph of Thoughts
We first illustrate the key idea, namely, using a knowledge graph to encode structurally the task contents. Figure 1 shows an example task and its corresponding evolving KG.
2.1 What is a Knowledge Graph?
A knowledge graph (KG) is a structured representation of information that organizes knowledge into a graph-based format, allowing for efficient querying, reasoning, and retrieval. Formally, a KG consists of a set of triples, where each triple $(s,p,o)$ represents a relationship between two entities $s$ (subject) and $o$ (object) through a predicate $p$ . For example, the triple $(\text{``Earth''},\text{``orbits''},\text{``Sun''})$ captures the fact that Earth orbits the Sun. Mathematically, a knowledge graph can be defined as a directed labeled graph $G=(V,E,L)$ , where $V$ is the set of vertices (entities), $Eâeq VĂ V$ is the set of edges (relationships), and $L$ is the set of labels (predicates) assigned to the edges. Each entity or predicate may further include properties or attributes, enabling richer representation. Knowledge graphs are widely used in various domains, including search engines, recommendation systems, and AI reasoning, as they facilitate both efficient storage and complex queries.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Construction Process
### Overview
The image depicts a diagram illustrating the process of constructing a knowledge graph (KG) to answer a complex question posed from the GAIA Benchmark. The process involves starting with an initial KG, querying the web for additional data, invoking a text inspector (specifically a YouTube transcriber), and finally extracting information from the enhanced graph to generate a response. The diagram shows the evolution of the knowledge graph through these stages, with examples of nodes and relationships.
### Components/Axes
The diagram is structured horizontally, showing a sequence of steps. The main components are:
* **Input Task Statement:** A text box containing a complex question from the GAIA Benchmark.
* **Knowledge Graph:** Three iterations of a knowledge graph are shown, labeled "Knowledge Graph", "Knowledge Graph (enhanced)", and "Knowledge Graph (enhanced)".
* **Query Web:** An icon representing a globe with the text "query web for additional data".
* **Invoke Inspector:** An icon representing a computer screen with the text "invoke inspector (YouTube transcriber)".
* **Extract Info & Generate Response:** An icon representing a cross mark with the text "extract info from graph and generate response".
* **Response:** A text box containing the answer generated from the knowledge graph.
* **Arrows:** Arrows indicate the flow of the process from left to right.
### Detailed Analysis or Content Details
**1. Input Task Statement:**
The text reads: "In the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Ringsâ Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?"
**2. Knowledge Graph (Initial):**
* **Nodes:**
* Gollum (LotR)
* Andy Serkis
* **Relationship:**
* "interpreted by" connecting Gollum (LotR) to Andy Serkis.
**3. Knowledge Graph (Enhanced - Stage 1):**
* **Nodes:**
* Gollum (LotR)
* Andy Serkis
* The Silmarillion
* **Relationships:**
* "interpreted by" connecting Gollum (LotR) to Andy Serkis.
* "narrated" connecting Andy Serkis to The Silmarillion.
* The Silmarillion has the following attributes:
* Type: Audio
* Date: Jul, 2017
* ID: 20160426-07
**4. Knowledge Graph (Enhanced - Stage 2):**
* **Nodes:**
* Gollum (LotR)
* Andy Serkis
* The Silmarillion
* We Are Stars
* **Relationships:**
* "interpreted by" connecting Gollum (LotR) to Andy Serkis.
* "narrated" connecting Andy Serkis to both The Silmarillion and We Are Stars.
* We Are Stars has the following attributes:
* Type: VR 360
* Date: Mar, 2018
* ID: 20160426-10
* A text snippet is connected to "We Are Stars": "...Dinosaurs dominated the earth for over a hundred million years..."
**5. Response:**
The text reads: "In the YouTube 360 VR video âWe Are Starsâ, narrated by Andy Serkis, the number mentioned after the dinosaurs first appearance is 100,000,000"
**6. Process Flow:**
* The process starts with the "Input Task Statement".
* An initial "Knowledge Graph" is built.
* The web is queried for "additional data".
* A "YouTube transcriber" is invoked.
* The "Knowledge Graph" is enhanced with the new data.
* Information is extracted from the enhanced graph to generate the "Response".
### Key Observations
* The diagram demonstrates how a knowledge graph can be iteratively built and enhanced to answer complex questions.
* The inclusion of a YouTube transcriber highlights the importance of processing multimedia content to extract relevant information.
* The example shows how the graph connects entities (Gollum, Andy Serkis, videos) and their relationships (interpreted by, narrated).
* The final response is directly derived from the information contained within the enhanced knowledge graph.
### Interpretation
The diagram illustrates a sophisticated approach to question answering, leveraging knowledge graphs and multimedia processing. The process begins with a natural language query and transforms it into a structured representation (the knowledge graph). By querying the web and transcribing video content, the graph is enriched with relevant information. The final step involves extracting the answer from the graph, demonstrating the power of this approach for complex reasoning and information retrieval. The diagram highlights the importance of combining structured knowledge with unstructured data (video transcripts) to achieve accurate and comprehensive answers. The specific example focuses on temporal relationships ("directly after") and numerical extraction, showcasing the system's ability to handle nuanced queries. The inclusion of metadata (Type, Date, ID) for each video suggests a focus on provenance and data quality.
</details>
Figure 1: The key idea behind Knowledge Graph of Thoughts (KGoT): transforming the representation of a task for an AI assistant from a textual form into a knowledge graph (KG). As an example, we use a Level-3 (i.e., highest difficulty) task from the GAIA benchmark. In order to solve the task, KGoT evolves this KG by adding relevant information that brings the task closer to completion. This is achieved by iteratively running various tools. Finally, the task is solved by extracting the relevant information from the KG, using â for example â a graph query, or an LLMâs inference process with the KG provided as a part of the input prompt. More examples of KGs are in Appendix A.
2.2 Harnessing Knowledge Graphs for Effective AI Assistant Task Resolution
At the heart of KGoT is the process of transforming a task solution state into an evolving KG. The KG representation of the task is built from âthoughtsâ generated by the LLM. These âthoughtsâ are intermediate insights identified by the LLM as it works through the problem. Each thought contributes to expanding or refining the KG by adding vertices or edges that represent new information.
For example, consider the following Level 3 (i.e., highest difficulty) task from the GAIA benchmark: âIn the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Ringsâ Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?â (see Figure 1 for an overview; more examples of constructed KGs are in Appendix A). Here, the KG representation of the task solution state has a vertex âGollum (LotR)â. Then, the thought âGollum from Lord of the Rings is interpreted by Andy Serkisâ results in adding a vertex for âAndy Serkisâ, and linking âGollum (LotR)â to âAndy Serkisâ with the predicate âinterpreted byâ. Such integration of thought generation and KG construction creates a feedback loop where the KG continuously evolves as the task progresses, aligning the representation with problem requirements.
In order to evolve the KG task representation, KGoT iteratively interacts with tools and retrieves more information. For instance, the system might query the internet to identify videos narrated by Andy Serkis (e.g., âThe Silmarillionâ and âWe Are Starsâ). It can also use a YouTube transcriber tool to find their publication date. This iterative refinement allows the KG to model the current âstateâ of a task at each step, creating a more complete and structured representation of this task and bringing it closer to completion. Once the KG has been sufficiently populated with task-specific knowledge, it serves as a robust resource for solving the problem.
In addition to adding new graph elements, KGoT also supports other graph operations. This includes removing nodes and edges, used as a part of noise elimination strategies.
2.3 Extracting Information from the KG
To accommodate different tasks, KGoT supports different ways to extract the information from the KG. Currently, we offer graph query languages or general-purpose languages; each of them can be combined with the so-called Direct Retrieval. First, one can use a graph query, prepared by the LLM in a language such as Cypher (Francis et al., 2018) or SPARQL (Pérez et al., 2009), to extract the answer to the task from the graph. This works particularly well for tasks that require retrieving specific patterns within the KG. Second, we also support general scripts prepared by the LLM in a general-purpose programming language such as Python. This approach, while not as effective as query languages for pattern matching, offers greater flexibility and may outperform the latter when a task requires, for example, traversing a long path in the graph. Third, in certain cases, once enough information is gathered into the KG, it may be more effective to directly paste the KG into the LLM context and ask the LLM to solve the task, instead of preparing a dedicated query or script. We refer to this approach as Direct Retrieval.
The above schemes offer a tradeoff between accuracy, cost, and runtime. For example, when low latency is priority, general-purpose languages should be used, as they provide an efficient lightweight representation of the KG and offer rapid access and modification of graph data. When token cost is most important, one should avoid Direct Retrieval (which consumes many tokens as it directly embeds the KG into the LLM context) and focus on either query or general-purpose languages, with a certain preference for the former, because its generated queries tend to be shorter than scripts. Finally, when aiming for solving as many tasks as possible, one should experiment with all three schemes. As shown in the Evaluation section, these methods have complementary strengths: Direct Retrieval is effective for broad contextual understanding, while graph queries and scripts are better suited for structured reasoning.
2.4 Representing the KG
KGoT can construct three interoperable KG representations: Property graphs (used with graph query languages such as Cypher and systems such as Neo4j (Robinson et al., 2015)), RDF graphs (used with graph query languages such as SPARQL and systems such as RDF4J (Ben Mahria et al., 2021)), and the adjacency list graphs (Besta et al., 2018) (used with general-purpose languages such as Python and systems such as NetworkX (NetworkX Developers, 2025)).
Each representation supports a different class of analysis. The Property graph view facilitates analytics such as pattern matching, filtering, of motif queries directly on the evolving task-state graph. The RDF graph view facilitates reasoning over ontology constraints, schema validation, and SPARQL-based inference for missing links. The adjacency list representation with NetworkX facilitates Python-based graph analytics, for example centrality measures, connected components, clustering coefficients, etc., all on the same KG snapshots.
Appendix A contains examples of task-specific KGs, illustrating how their topology varies with the task domain (e.g., tree-like procedural chains vs. dense relational subgraphs in multi-entity reasoning).
2.5 Bias, Fairness, and Noise Mitigation through KG-Based Representation
KGoT externalizes and structures the reasoning process, which reduces noise, mitigates model bias, and improves fairness, because in each iteration both the outputs from tools and LLM thoughts are converted into triples and stored explicitly. Unlike opaque monolithic LLM generations, this fosters transparency and facilitates identifying biased inference steps. It also facilitates noise mitigation: new triples can be explicitly checked for the quality of their information content before being integrated into the KG, and existing triples can also be removed if they are deemed redundant (examples of such triples that have been found and removed are in Appendix B.6).
3 System Architecture
The KGoT modular and flexible architecture, pictured in Figure 2, consists of three main components: the Graph Store Module, the Controller, and the Integrated Tools, each playing a critical role in the task-solving process. Below, we provide a detailed description of each component and its role in the system. Additional details are in Appendix B (architecture) and in Appendix C (prompts).
3.1 Maintaining the Knowledge Graph with the Graph Store Module
A key component of the KGoT system is the Graph Store Module, which manages the storage and retrieval of the dynamically evolving knowledge graph which represents the task state. In order to harness graph queries, we use a graph database backend; in the current KGoT implementation, we test Cypher together with Neo4j (Robinson et al., 2015), an established graph database (Besta et al., 2023b; c), as well as SPARQL together with the RDF4J backend (Ben Mahria et al., 2021). Then, in order to support graph accesses using a general-purpose language, KGoT harnesses the NetworkX library (NetworkX Developers, 2025) and Python. Note that the extensible design of KGoT enables seamless integration of any other backends and languages.
3.2 Managing the Workflow with the Controller Module
The Controller orchestrates the interactions between the KG and the tools. Upon receiving a user query, it iteratively interprets the task, determines the appropriate tools to invoke based on the KG state and task needs, and integrates tool outputs back into the KG. The Controller uses a dual-LLM architecture with a clear separation of roles: the LLM Graph Executor constructs and evolves the KG, while the LLM Tool Executor manages tool selection and execution.
The LLM Graph Executor determines the next steps after each iteration that constructs and evolves the KG. It identifies any missing information necessary to solve the task, formulates appropriate queries for the graph store interaction (retrieve/insert operations), and parses intermediate or final results for integration into the KG. It also prepares the final response to the user based on the KG.
The LLM Tool Executor operates as the executor of the plan devised by the LLM Graph Executor. It identifies the most suitable tools for retrieving missing information, considering factors such as tool availability, relevance, and the outcome of previous tool invocation attempts. For example, if a web crawler fails to retrieve certain data, the LLM Tool Executor might prioritize a different retrieval mechanism or adjust its queries. The LLM Tool Executor manages the tool execution process, including interacting with APIs, performing calculations, or extracting information, and returns the results to the LLM Graph Executor for further reasoning and integration into the KG.
3.3 Ensuring Versatile and Extensible Set of Integrated Tools
KGoT offers a hierarchical suite of tools tailored to diverse task needs. The Python Code Tool enables dynamic script generation and execution for complex computations. The LLM Tool supplements the controllerâs reasoning by integrating an auxiliary language model, enhancing knowledge access while minimizing hallucination risk. For multimodal inputs, the Image Tool supports image processing and extraction. Web-based tasks are handled by the Surfer Agent (based on the design by Hugging Face Agents (Roucher & Petrov, 2025)), which leverages tools like the Wikipedia Tool, granular navigation tools (PageUp, PageDown, Find), and SerpApi (SerpApi LLM, 2025) for search. Additional tools include the ExtractZip Tool for compressed files and the Text Inspector Tool for converting content from sources like MP3s and YouTube transcripts into Markdown. Finally, the user can seamlessly add a new tool by initializing the tool, passing in the logger object for tool use statistics, and appending the tool to the tool list via a Tool Manager object. We require all tools implemented to adhere to the LangChainâs BaseTool interface class. This way, the list of tools managed by the Tool Manager can be directly bound to the LLM Tool Executor via LangChain bind_tools, further facilitating new tools.
3.4 Ensuring High-Performance & Scalability
The used scalability optimizations include (1) asynchronous execution using asyncio (Python Software Foundation, 2025b) to parallelize LLM tool invocations, mitigating I/O bottlenecks and reducing idle time, (2) graph operation parallelism by reformulating LLM-generated Cypher queries to enable concurrent execution of independent operations in a graph database, and (3) MPI-based distributed processing, which decomposes workloads into atomic tasks distributed across ranks using a work-stealing algorithm to ensure balanced computational load and scalability.
3.5 Ensuring System Robustness
Robustness is ensured with two established mechanisms, Self-Consistency (Wang et al., 2023b) (via majority voting) and LLM-as-a-Judge (Gu et al., 2025) (other strategies such as embedding-based stability are also applicable (Besta et al., 2025d)). With Self-Consistency, we query the LLM multiple times when deciding whether to insert more data into the KG or retrieve existing data, when deciding which tool to use, and when parsing the final solution. This approach reduces the impact of single-instance errors or inconsistencies in various parts of the KGoT architecture. LLM-as-a-Judge further reinforces the robustness, by directly employing the LLM agent to make these decisions based on generated reasoning chains.
Overall, both Self-Consistency and LLM-as-a-Judge have been shown to significantly enhance the robustness of prompting. For example, MT-Bench and Chatbot Arena show that strong judges (e.g., GPT-4 class) match human preferences at 80% agreement or more, on par with human-human agreement (Zheng et al., 2023). Prometheus/Prometheus-2 further demonstrate open evaluator LMs with the highest correlations to both humans and proprietary judges across direct-assessment and pairwise settings, and AlpacaEval has been validated against approximately 20K human annotations, addressing earlier concerns about reproducibility at scale. Similarly reliable gains have been shown for Self-Consistency (Wang et al., 2023b).
3.6 Ensuring Layered Error Containment & Management
To manage LLM-generated syntax errors, KGoT includes LangChainâs JSON parsers that detect syntax issues. When a syntax error is detected, the system first attempts to correct it by adjusting the problematic syntax using different encoders, such as the âunicode escapeâ (Python Software Foundation, 2025a). If the issue persists, KGoT employs a retry mechanism (three attempts by default) that uses the LLM to rephrase the query/command and attempts to regenerate its output. If the error persists, the system logs it for further analysis, bypasses the problematic query, and continues with other iterations.
To handle API & system related errors, such as the OpenAI code 500, we employ exponential backoff, implemented using the tenacity library (Tenacity Developers, 2025a). Additionally, KGoT includes comprehensive logging systems as part of its error management framework. These systems track the errors encountered during system operation, providing valuable data that can be easily parsed and analyzed (e.g., snapshots of the knowledge graphs or responses from third-party APIs).
The Python Executor tool, a key component of the system, is containerized to ensure secure execution of LLM-generated code. This tool is designed to run code with strict timeouts and safeguards, preventing potential misuse or resource overconsumption.
3.7 Implementation Details
KGoT employs Docker (Docker Inc., 2025) and Sarus (Benedicic et al., 2019) for containerization, enabling a consistent and isolated runtime environment for all components. We containerize critical modules such as the KGoT controller, the Neo4j knowledge graph, and integrated tools (e.g., the Python Executor tool for safely running LLM-generated code with timeouts). Here, Docker provides a widely adopted containerization platform for local and cloud deployments that guarantees consistency between development and production environments. Sarus, a specialized container platform designed for high-performance computing (HPC) environments, extends KGoTâs portability to HPC settings where Docker is typically unavailable due to security constraints. This integration allows KGoT to operate efficiently in HPC environments, leveraging their computational power.
KGoT also harnesses LangChain (LangChain Inc., 2025a), an open-source framework specifically designed for creating and orchestrating LLM-driven applications. LangChain offers a comprehensive suite of tools and APIs that simplify the complexities of managing LLMs, including prompt engineering, tool integration, and the coordination of LLM outputs.
4 System Workflow
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph of Thoughts - High-Level & Detailed View
### Overview
This diagram illustrates the architecture of a "Knowledge Graph of Thoughts" system, presenting both a high-level overview and a detailed view of its components and workflow. The system takes a user question as input and generates a Knowledge Graph of Thoughts (KGOT) response. It leverages a combination of knowledge graphs, Large Language Models (LLMs), and integrated tools.
### Components/Axes
The diagram is divided into two main sections: a high-level overview at the top and a detailed view at the bottom, separated by a "More details" banner. Key components include:
* **User Question:** Input to the system.
* **Controller:** Manages the overall process.
* **LLM Graph Executor:** Executes graph-related operations using LLMs.
* **LLM Tool Executor:** Executes tool calls using LLMs.
* **Integrated Tools:** A collection of tools used by the LLM Tool Executor (Python code & math tool, Image tool, Text inspector, MDConverter, mp3, YouTube transcriber, Browser).
* **Graph Store:** Stores the knowledge graph.
* **Backend:** The underlying data storage and knowledge extraction mechanism.
* **KGOT Response:** The output of the system.
The detailed view includes numbered steps (1-9) representing the workflow. There are also annotations indicating where LLMs are used extensively.
### Detailed Analysis or Content Details
**High-Level Overview:**
* **User Question** (top-left): An icon representing a user asking a question.
* **Controller** (top-center): A rectangular box labeled "Controller".
* **LLM Graph Executor** (top-right): A rectangular box labeled "LLM Graph Executor".
* **LLM Tool Executor** (top-right): A rectangular box labeled "LLM Tool Executor".
* **Integrated Tools** (top-right): A rectangular box labeled "Integrated Tools".
* **KGOT Response** (top-right): An icon representing the system's response.
* **Knowledge Store** (top-left): A circular diagram representing a knowledge graph with nodes and edges. Text: "Knowledge graph", "Knowledge extraction method".
* **Storage Backend** (top-center): A rectangular box labeled "Storage backend (e.g., a graph database)".
**Detailed View:**
1. **New graph state** (left-center): A box labeled "New graph state".
2. **Max. iterations?** (center-left): A diamond-shaped decision node labeled "Max. iterations?". An arrow labeled "no" leads to step 3.
3. **Determine the next step** (center-left): A box labeled "Determine the next step (majority vote)". An arrow labeled "yes" loops back to step 1. This step is performed by an LLM.
4. **Define tool calls** (center-right): A box labeled "Define tool calls". This step is performed by an LLM.
5. **Run tool calls** (center-right): A box labeled "Run tool calls".
6. **Run ENHANCE** (bottom-left): A box labeled "Run ENHANCE". This step is performed by an LLM.
7. **Run SOLVE** (bottom-left): A box labeled "Run SOLVE (Generate solution)". This step is performed by an LLM.
8. **Apply additional mathematical processing** (bottom-left): A box labeled "Apply additional mathematical processing". This step is performed by an LLM.
9. **Parse solution** (bottom-right): A box labeled "Parse solution". This step is performed by an LLM.
**Integrated Tools (right side):**
* **Python code & math tool:** Labeled "LLM" indicating LLM usage.
* **Image tool:** Labeled "LLM". Includes "ExtractZIP tool".
* **Text inspector:** Labeled "LLM".
* **MDConverter:** Labeled "LLM".
* **mp3:** Labeled "LLM".
* **YouTube transcriber:** Labeled "LLM".
* **Browser:** Labeled "LLM". Includes "Wikipedia tool" with options "Find", "Find next", "Visit tool", "Active search".
* Annotations: "LLM" indicates that a given step extensively uses an LLM. "uses" arrows connect tools to LLM.
**Backend (bottom-left):**
* **Graph database (e.g., Neo4j)**
* **Knowledge extraction using a graph query language**
* **Lightweight backend using knowledge extraction and a general-purpose language**
* Text: "Each backend can be used separately or at the same time in order to benefit from the strengths of both."
* Text: "(other backends could also be harnessed)"
### Key Observations
* The system is heavily reliant on LLMs, as indicated by the "LLM" annotations throughout the detailed view.
* The workflow involves an iterative process (steps 1-3) until a maximum number of iterations is reached.
* The system utilizes a variety of integrated tools to enhance its capabilities.
* The backend can leverage different knowledge extraction methods and storage solutions.
* The diagram clearly distinguishes between the high-level architecture and the detailed workflow.
### Interpretation
The diagram depicts a sophisticated system for reasoning and problem-solving using knowledge graphs and LLMs. The iterative loop (1-3) suggests a process of refinement and exploration within the knowledge graph. The use of multiple integrated tools allows the system to handle diverse types of information and tasks. The backend flexibility indicates that the system can be adapted to different data sources and knowledge extraction techniques. The diagram highlights the central role of LLMs in orchestrating the entire process, from graph manipulation to tool execution and solution generation. The system appears designed to be robust and adaptable, capable of tackling complex queries by leveraging both structured knowledge (the graph) and the generative capabilities of LLMs. The inclusion of "majority vote" in step 3 suggests a mechanism for handling uncertainty or conflicting information. The diagram is a blueprint for a system that aims to combine the strengths of symbolic reasoning (knowledge graphs) and statistical learning (LLMs).
</details>
Figure 2: Architecture overview of KGoT (top part) and the design details combined with the workflow (bottom part).
We show the workflow in the bottom part of Figure 2. The workflow begins when the user submits a problem to the system
<details>
<summary>x3.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. The first step is to verify whether the maximum number of iterations allowed for solving the problem has been reached
<details>
<summary>x4.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. If the iteration limit is exceeded, the system will no longer try to gather additional information and insert it into the KG, but instead will return a solution with the existing data in the KG
<details>
<summary>x5.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. Otherwise, the majority vote (over several replies from the LLM) decides whether the system should proceed with the Enhance pathway (using tools to generate new knowledge) or directly proceed to the Solve pathway (gathering the existing knowledge in the KG and using it to deliver the task solution).
The Enhance Pathway If the majority vote indicates an Enhance pathway, the next step involves determining the tools necessary for completing the Enhance operation
<details>
<summary>x6.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. The system then orchestrates the appropriate tool calls based on the KG state
<details>
<summary>x7.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. Once the required data from the tools is collected, the system generates the Enhance query or queries to modify the KG appropriately. Each Enhance query is executed
<details>
<summary>x8.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
and its output is validated. If an error or invalid value is returned, the system attempts to fix the query, retrying a specified number of times. If retries fail, the query is discarded, and the operation moves on. After processing the Enhance operation, the system increments the iteration count and continues until the KG is sufficiently expanded or the iteration limit is reached. This path ensures that the knowledge graph is enriched with relevant and accurate information, enabling the system to progress toward a solution effectively.
The Solve Pathway If the majority vote directs the system to the Solve pathway, the system executes multiple solve operations iteratively
<details>
<summary>x9.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
. If an execution produces an invalid value or error three times in a row, the system asks the LLM to attempt to correct the issue by recreating the used query. The query is then re-executed. If errors persist after three such retries, the query is regenerated entirely, disregarding the faulty result, and the process restarts. After the Solve operation returns the result, final parsing is applied, which includes potential mathematical processing to resolve potential calculations
<details>
<summary>x10.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
and refining the output (e.g., formatting the results appropriately)
<details>
<summary>x11.png Details</summary>

### Visual Description
Icon/Small Image (19x14)
</details>
.
5 Evaluation
We now show advantages of KGoT over the state of the art. Additional results and full details on the evaluation setup are in Appendix D.
Comparison Baselines. We focus on the Hugging Face (HF) Agents (Roucher & Petrov, 2025), the most competitive scheme in the GAIA benchmark for the hardest level 3 tasks with the GPT-4 class of models. We also compare to two agentic frameworks, namely GPTSwarm (Zhuge et al., 2024) (a representative graph-enhanced multi-agent scheme) and Magentic-One (Fourney et al., 2024), an AI agent equipped with a central orchestrator and multiple integrated tool agents. Next, to evaluate whether database search outperforms graph-based knowledge extraction, we also consider two retrieval-augmented generation (RAG) (Lewis et al., 2020) schemes, a simple RAG scheme and GraphRAG (Edge et al., 2025). Both RAG baselines use the same tool-generated knowledge, chunking data at tool-call granularity (i.e., a chunk corresponds to individual tool call output). Simple RAG constructs a vector database from these tool outputs while GraphRAG instead models the tool outputs as a static KG of entities and relations, enabling retrieval via graph traversal. Finally, we use Zero-Shot schemes where a model answers without any additional agent framework.
KGoT variants. First, we experiment with graph query languages vs. general-purpose languages, cf. Section 2.3. For each option, we vary how the Solve operation is executed, by either having the LLM send a request to the backend (a Python script for NetworkX and a Cypher/SPARQL query for Neo4j/RDF4J) or by directly asking the LLM to infer the answer based on the KG (Direct Retrieval (DR)). We experiment with different query languages (Cypher vs. SPARQL). We also consider âfusionâ runs, which simulate the effect from KGoT runs with both graph backends available simultaneously (or both Solve operation variants harnessed for each task). Fusion runs only incur negligible additional storage overhead because the generated KGs are small (up to several hundreds of nodes). Finally, we experiment with different tool sets. To focus on the differences coming from harnessing the KG, we reuse several utilities from AutoGen (Wu et al., 2024) such as Browser and MDConverter, and tools from HF Agents, such as Surfer Agent, web browsing tools, and Text Inspector.
Considered Metrics We focus primarily on the number of solved tasks as well as token costs ($). Unless stated otherwise, we report single run results due to budget reasons.
Considered Datasets We use the GAIA benchmark (Mialon et al., 2024) focusing on the validation set (165 tasks) for budgetary reasons and also because it comes with the ground truth answers. The considered tasks are highly diverse in nature; many require parsing websites or analyzing PDF, image, and audio files. We focus on GAIA as this is currently the most comprehensive benchmark for general-purpose AI assistants, covering diverse domains such as web navigation, code execution, image reasoning, scientific QA, and multimodal tasks. We further evaluate on SimpleQA (Wei et al., 2024), a factuality benchmark of 4,326 questions, of which we sample 10% for budgetary reasons. The dataset spans diverse topics and emphasizes single, verifiable answers, making it effective for assessing factual accuracy.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Language Models on Solving Tasks
### Overview
This image presents a comparative bar chart evaluating the performance of several language models (GPT-40, GPT-4 mini, and variations incorporating Knowledge Graphs (KGs) and Retrieval Augmented Generation (RAG)) across different task-solving scenarios: Zero-Shot, KGoT, KGoT (fusion), and Baselines. The chart displays two primary metrics: the number of solved tasks (left panel) and the average cost per task in US dollars (right panel). The models are positioned along the x-axis, and performance is represented by bar height. Three levels of performance are indicated by color: Level 1 (light blue), Level 2 (medium blue), and Level 3 (dark blue).
### Components/Axes
* **X-axis (Left & Right Panels):** Models being compared: GPT-40, GPT-4 mini, Neo4j, Neo4j + Query, Network + DR, Network + Query, Network + Query + DR, RDP4j + DR, Neo4j + Network (Query + DR), Simple RAG, GraphSwarm, MagneticOne, HF GPT-40 mini.
* **Y-axis (Left Panel):** "Number of Solved Tasks (the higher the better)", ranging from 0 to 70, with a linear scale.
* **Y-axis (Right Panel):** "Average Cost ($) (the lower the better)", displayed on a logarithmic scale from 10^-3 to 10^-1 (0.001 to 0.1).
* **Legend (Top-Left):** Level 1 (light blue), Level 2 (medium blue), Level 3 (dark blue).
* **Annotations:** "Max: 71" and "Max: 3.403$" indicating the maximum values for each metric.
* **Horizontal Labels:** Dividing the chart into sections: "Zero-Shot", "KGoT", "KGoT (fusion)", "Baselines".
### Detailed Analysis or Content Details
**Left Panel: Number of Solved Tasks**
* **GPT-40:** Shows a consistent high performance across all categories.
* Zero-Shot: Approximately 17 tasks solved.
* KGoT: Approximately 21 tasks solved.
* KGoT (fusion): Approximately 34 tasks solved.
* Baselines: Approximately 31 tasks solved.
* **GPT-4 mini:** Generally lower performance than GPT-40.
* Zero-Shot: Approximately 10 tasks solved.
* KGoT: Approximately 21 tasks solved.
* KGoT (fusion): Approximately 28 tasks solved.
* Baselines: Approximately 22 tasks solved.
* **Neo4j:** Performance varies.
* Zero-Shot: Approximately 14 tasks solved.
* KGoT: Approximately 18 tasks solved.
* KGoT (fusion): Approximately 24 tasks solved.
* Baselines: Approximately 13 tasks solved.
* **Neo4j + Query:** Similar to Neo4j.
* Zero-Shot: Approximately 21 tasks solved.
* KGoT: Approximately 21 tasks solved.
* KGoT (fusion): Approximately 33 tasks solved.
* Baselines: Approximately 18 tasks solved.
* **Network + DR:** Performance is moderate.
* Zero-Shot: Approximately 16 tasks solved.
* KGoT: Approximately 21 tasks solved.
* KGoT (fusion): Approximately 20 tasks solved.
* Baselines: Approximately 13 tasks solved.
* **Network + Query:** Performance is moderate.
* Zero-Shot: Approximately 16 tasks solved.
* KGoT: Approximately 21 tasks solved.
* KGoT (fusion): Approximately 24 tasks solved.
* Baselines: Approximately 18 tasks solved.
* **Network + Query + DR:** Performance is moderate.
* Zero-Shot: Approximately 18 tasks solved.
* KGoT: Approximately 20 tasks solved.
* KGoT (fusion): Approximately 34 tasks solved.
* Baselines: Approximately 20 tasks solved.
* **RDP4j + DR:** Performance is moderate.
* Zero-Shot: Approximately 20 tasks solved.
* KGoT: Approximately 21 tasks solved.
* KGoT (fusion): Approximately 27 tasks solved.
* Baselines: Approximately 14 tasks solved.
* **Neo4j + Network (Query + DR):** Performance is high.
* Zero-Shot: Approximately 22 tasks solved.
* KGoT: Approximately 2 tasks solved.
* KGoT (fusion): Approximately 33 tasks solved.
* Baselines: Approximately 14 tasks solved.
* **Simple RAG:** Performance is moderate.
* Zero-Shot: Approximately 13 tasks solved.
* KGoT: Approximately 18 tasks solved.
* KGoT (fusion): Approximately 28 tasks solved.
* Baselines: Approximately 13 tasks solved.
* **GraphSwarm:** Performance is moderate.
* Zero-Shot: Approximately 13 tasks solved.
* KGoT: Approximately 18 tasks solved.
* KGoT (fusion): Approximately 20 tasks solved.
* Baselines: Approximately 13 tasks solved.
* **MagneticOne:** Performance is moderate.
* Zero-Shot: Approximately 13 tasks solved.
* KGoT: Approximately 18 tasks solved.
* KGoT (fusion): Approximately 20 tasks solved.
* Baselines: Approximately 13 tasks solved.
* **HF GPT-40 mini:** Performance is moderate.
* Zero-Shot: Approximately 14 tasks solved.
* KGoT: Approximately 18 tasks solved.
* KGoT (fusion): Approximately 20 tasks solved.
* Baselines: Approximately 13 tasks solved.
**Right Panel: Average Cost ($)**
* **GPT-40:** Highest cost across all categories.
* Zero-Shot: Approximately 0.095$
* KGoT: Approximately 0.135$
* KGoT (fusion): Approximately 0.145$
* Baselines: Approximately 0.145$
* **GPT-4 mini:** Lower cost than GPT-40.
* Zero-Shot: Approximately 0.00175$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.0025$
* **Neo4j:** Moderate cost.
* Zero-Shot: Approximately 0.00135$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00135$
* **Neo4j + Query:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.002$
* **Network + DR:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00135$
* **Network + Query:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.002$
* **Network + Query + DR:** Moderate cost.
* Zero-Shot: Approximately 0.00165$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.002$
* **RDP4j + DR:** Moderate cost.
* Zero-Shot: Approximately 0.00165$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00145$
* **Neo4j + Network (Query + DR):** Moderate cost.
* Zero-Shot: Approximately 0.00175$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00145$
* **Simple RAG:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00135$
* **GraphSwarm:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00135$
* **MagneticOne:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00135$
* **HF GPT-40 mini:** Moderate cost.
* Zero-Shot: Approximately 0.00145$
* KGoT: Approximately 0.00195$
* KGoT (fusion): Approximately 0.00265$
* Baselines: Approximately 0.00135$
### Key Observations
* GPT-40 consistently achieves the highest number of solved tasks, but at a significantly higher cost.
* GPT-4 mini offers a substantial cost reduction but with a corresponding decrease in performance.
* Integrating Knowledge Graphs (KGoT and KGoT (fusion)) generally improves performance compared to Zero-Shot for most models.
* The "KGoT (fusion)" scenario consistently yields better results than "KGoT" alone.
* There's a clear trade-off between performance (number of solved tasks) and cost.
* The logarithmic scale on the right panel highlights the cost differences more effectively.
### Interpretation
The data suggests that while GPT-40 is the most capable model, its high cost may be prohibitive for many applications. Models incorporating Knowledge Graphs demonstrate improved performance, indicating the value of external knowledge sources. The fusion of KGoT with other techniques (e.g., Network + Query + DR) appears particularly effective. The chart illustrates a typical efficiency frontier: one can achieve higher performance, but only at the expense of increased cost. The optimal model choice will depend on the specific application's requirements and budget constraints. The consistent cost profile of the models other than GPT-40 and GPT-4 mini suggests a similar underlying computational expense, while the performance differences are likely due to the effectiveness of the knowledge integration and retrieval mechanisms. The logarithmic scale on the cost axis emphasizes the substantial cost difference between GPT-40 and the other models, making it a critical factor in decision-making.
</details>
Figure 3: Advantages of different variants of KGoT over other baselines (Hugging Face Agents using both GPT-4o-mini and GPT-4o, Magentic-One, GPTSwarm, two RAG baselines, Zero-Shot GPT-4o mini, and Zero-Shot GPT-4o) on the validation dataset of the GAIA benchmark. DR stands for Direct Retrieval. The used model is GPT-4o mini unless noted otherwise.
5.1 Advantages of KGoT
Figure 3 shows the number of solved tasks (the left side) as well as the average cost per solved task (the right side) for different KGoT variants as well as all comparison baselines. While we focus on GPT-4o mini, we also show the results for HF Agents and Zero-Shot with GPT-4o. Additionally, we show the Pareto front in Figure 11 for the multidimensional optimization problem of improving accuracy (i.e., reducing failed tasks) and lowering cost. All variants of KGoT solve a greater number of tasks (up to 9 more) compared to HF Agents while also being more cost-efficient (between 42% to 62% lower costs). The key reason for the KGoT advantages stems from harnessing the knowledge graphâbased representation of the evolving task state.
The ideal fusion runs of Neo4j and NetworkX solve an even greater number of tasks (57 for both) than the single runs, they have a lower average cost (up to 62% lower than HF Agents), and they even outperform HF Agents with GPT-4o. The fusion of all combinations of backend and solver types solve by far the highest number of tasks (71) â more than twice as much as HF Agents â while also exhibiting 44% lower cost than HF Agents. The direct Zero-Shot use of GPT-4o mini and GPT-4o has the lowest average cost per solved task (just $0.0013 and $0.0164 respectively), making it the most cost-effective, however this approach is only able to solve 17 and 29 tasks, respectively. GPTSwarm is cheaper compared to KGoT, but also comes with fewer solved tasks (only 26). While Magentic-One is a capable agent with a sophisticated architecture, its performance with GPT-4o mini is limited, solving 31 tasks correctly, while also exhibiting significantly higher costs. Simple RAG yields somewhat higher costs than KGoT and it solves fewer tasks (35). GraphRAG performs even worse, solving only 23 tasks and incurring even higher cost. While neither RAG baseline can invoke new tools to gather missing information (reducing accuracy and adaptability), GraphRAGâs worse performance is due to the fact that it primarily targets query summarization and not tasks as diverse as those tested by GAIA. Overall, KGoT achieves the best cost-accuracy tradeoff, being both highly affordable and very effective.
5.2 Analysis of Methods for Knowledge Extraction
We explore different methods of extracting knowledge. Overall, in many situations, different methods have complementary strengths and weaknesses.
Graph queries with Neo4j excel at queries such as counting patterns. Yet, Cypher queries can be difficult to generate correctly, especially for graphs with more nodes and edges. Despite this, KGoTâs Cypher queries are able to solve many new GAIA tasks that could not be solved without harnessing Cypher. SPARQL (PĂ©rez et al., 2009) + RDF4J (Eclipse Foundation, 2025) is slightly worse (36 tasks solved) than Cypher + Neo4j (existing literature also indicates that LLMs have difficulties formulating effective SPARQL queries (Emonet et al., 2024; Mecharnia & dâAquin, 2025)).
Python with NetworkX offers certain advantages over Neo4j by eliminating the need for a separate database server, making it a lightweight choice for the KG. Moreover, NetworkX computations are fast and efficient for small to medium-sized graphs without the overhead of database transactions. Unlike Neo4j, which requires writing Cypher queries, we observe that in cases where Neo4j-based implementations struggle, NetworkX-generated graphs tend to be more detailed and provide richer vertex properties and relationships. This is likely due to the greater flexibility of Python code over Cypher queries for graph insertion, enabling more fine-grained control over vertex attributes and relationships. Another reason may be the fact that Python is likely more represented in the training data of the respective models than Cypher.
Our analysis of failed tasks indicates that, in many cases, the KG contains the required data, but the graph query fails to extract it. In such scenarios, Direct Retrieval, where the entire KG is included in the modelâs context, performs significantly better by bypassing query composition issues. However, Direct Retrieval demonstrates lower accuracy in cases requiring structured, multi-step reasoning.
We also found that Direct Retrieval excels at extracting dispersed information but struggles with structured queries, whereas graph queries are more effective for structured reasoning but can fail when the LLM generates incorrect query formulations. Although both Cypher and general-purpose queries occasionally are erroneous, Python scripts require more frequent corrections because they are often longer and more error-prone. However, despite the higher number of corrections, the LLM is able to fix Python code more easily than Cypher queries, often succeeding after a single attempt. During retrieval, the LLM frequently embeds necessary computations directly within the Python scripts while annotating its reasoning through comments, improving transparency and interpretability.
5.3 Advantages on the GAIA Test Set
Table 1: Comparison of KGoT with other current state-of-the-art open-source agents on the full GAIA test set. The baseline data, including for TapeAgent (Bahdanau et al., 2024), of the number of solved tasks is obtained through the GAIA Leaderboard (Mialon et al., 2025). We highlight the best performing scheme in a given category in bold. Model: GPT-4o mini.
| Agents | All | L1 | L2 | L3 |
| --- | --- | --- | --- | --- |
| GPTSwarm | 33 | 15 | 15 | 3 |
| Magentic-One | 43 | 22 | 18 | 3 |
| TapeAgent | 66 | 28 | 35 | 3 |
| Hugging Face Agents | 68 | 30 | 34 | 4 |
| KGoT (fusion) | 73 | 33 | 36 | 4 |
Furthermore, our approach achieves state-of-the-art performance on the GAIA test set with the GPT-4o mini model. The results are shown in Table 1, underscoring its effectiveness across all evaluation levels. The test set consists of 301 tasks (93 level 1 tasks, 159 level 2 tasks and 49 level 3 tasks).
5.4 Advantages beyond GAIA Benchmark
We also evaluate KGoT as well as HF Agents and GPTSwarm on a 10% sample (433 tasks) of the SimpleQA benchmark (detailed results are in Appendix D.1). KGoT performs best, solving 73.21%, while HF Agents and GPTSwarm exhibit reduced accuracy (66.05% and 53.81% respectively). KGoT incurs only 0.018$ per solved task, less than a third of the HF Agents costs (0.058$), while being somewhat more expensive than GPTSwarm (0.00093$).
We further evaluate KGoT on the entire SimpleQA benchmark (due to very high costs of running all SimpleQA questions, we limit the full benchmark evaluation to KGoT). We observe no degradation in performance with a 70.34% accuracy rate. When compared against the official F1-scores of various OpenAI and Claude models (OpenAI, 2025), KGoT outperforms all the available results. Specifically, our design achieves a 71.06% F1 score, significantly surpassing the 49.4% outcome of the top-performing reasoning model and improving upon all mini-reasoning models by at least 3.5 $Ă$ . Furthermore, KGoT exceeds the performance of all standard OpenAI models, from GPT-4oâs 40% F1 score to the best-scoring closed-source model, GPT-4.5, with 62.5%. More detailed results are available in Appendix D.1.
5.5 Ensuring Scalability and Mitigating Bottlenecks
The primary bottleneck in KGoT arises from I/O-bound and latency-sensitive LLM tool invocations (e.g., web browsing, text parsing), which account for 72% of the runtime, which KGoT mitigates through asynchronous execution and graph operation parallelism as discussed in Section 3.4. A detailed breakdown of the runtime is reported in Appendix D.3. Figure 10 confirms KGoTâs scalability, as increasing the number of parallelism consistently reduces the runtime. Moreover, due to the effective knowledge extraction process and the nature of the tasks considered, none of the tasks require large KGs. The maximum graph size that we observed was 522 nodes. This is orders of magnitude below any scalability concerns.
5.6 Impact from Various Design Decisions
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Bar Chart: Performance Comparison of Different Agent Architectures
### Overview
This bar chart compares the performance of four different agent architectures â GPTswarm, HF Agents, KGoT (Neo4j + Query), and Zero-Shot â across a range of language models. The performance metric is the "Number of Solved Tasks" (the higher the better). The chart displays the number of solved tasks for each agent architecture on each language model.
### Components/Axes
* **X-axis:** Language Models - Owen2.5-32B, DeepSeek-R1-70B, GPT-40 mini, DeepSeek-R1-32B, QwQ-32B, DeepSeek-R1-7B, Owen2.5-72B, Owen2.5-7B, Owen2.5-1.5B
* **Y-axis:** Number of Solved Tasks (the higher the better), ranging from 0 to 50.
* **Legend:**
* GPTswarm (Light Red)
* HF Agents (Light Blue)
* KGoT (Neo4j + Query) (Medium Blue)
* Zero-Shot (Hatched Pattern)
### Detailed Analysis
The chart consists of a series of grouped bar plots, one group for each language model. For each model, there are four bars representing the performance of each agent architecture.
Here's a breakdown of the data, approximate values are provided with uncertainty due to bar height estimation:
* **Owen2.5-32B:**
* GPTswarm: ~19
* HF Agents: ~29
* KGoT: ~26
* Zero-Shot: ~3
* **DeepSeek-R1-70B:**
* GPTswarm: ~16
* HF Agents: ~26
* KGoT: ~17
* Zero-Shot: ~0
* **GPT-40 mini:**
* GPTswarm: ~22
* HF Agents: ~22
* KGoT: ~14
* Zero-Shot: ~0
* **DeepSeek-R1-32B:**
* GPTswarm: ~17
* HF Agents: ~40
* KGoT: ~21
* Zero-Shot: ~0
* **QwQ-32B:**
* GPTswarm: ~6
* HF Agents: ~16
* KGoT: ~14
* Zero-Shot: ~0
* **DeepSeek-R1-7B:**
* GPTswarm: ~20
* HF Agents: ~39
* KGoT: ~2
* Zero-Shot: ~0
* **Owen2.5-72B:**
* GPTswarm: ~27
* HF Agents: ~39
* KGoT: ~5
* Zero-Shot: ~2
* **Owen2.5-7B:**
* GPTswarm: ~19
* HF Agents: ~37
* KGoT: ~12
* Zero-Shot: ~3
* **Owen2.5-1.5B:**
* GPTswarm: ~12
* HF Agents: ~19
* KGoT: ~9
* Zero-Shot: ~4
**Trends:**
* **HF Agents** consistently outperforms the other architectures across most language models, generally achieving the highest number of solved tasks. The HF Agents line generally slopes upward, peaking at Owen2.5-72B and DeepSeek-R1-7B.
* **GPTswarm** shows moderate performance, generally falling between HF Agents and KGoT.
* **KGoT** generally performs the worst, with very low scores on several models.
* **Zero-Shot** consistently has the lowest performance, often scoring 0 solved tasks.
### Key Observations
* HF Agents demonstrate a clear advantage over other architectures.
* The performance of all architectures varies significantly depending on the language model used.
* Zero-Shot consistently underperforms, suggesting it is not a viable approach for these tasks.
* DeepSeek-R1-32B shows the highest performance for HF Agents, reaching 40 solved tasks.
### Interpretation
The data suggests that HF Agents are the most effective architecture for solving tasks using these language models. The significant difference in performance between HF Agents and other architectures highlights the benefits of the HF Agents approach. The variation in performance across different language models indicates that the choice of language model is crucial for achieving good results. The consistently poor performance of Zero-Shot suggests that it lacks the necessary capabilities for these tasks. The chart provides valuable insights into the strengths and weaknesses of different agent architectures and can inform the selection of the most appropriate architecture for a given task and language model. The high performance of HF Agents on DeepSeek-R1-32B is a notable outlier, suggesting a particularly strong synergy between these two components.
</details>
Figure 4: Performance on the GAIA validation set with KGoT (non-fusion) using various LLM models. For KGoT, we use Cypher queries for knowledge extraction from the Neo4j database.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Bar Chart: Performance Comparison of Knowledge Graph Systems
### Overview
This bar chart compares the performance of different knowledge graph systems (Neo4j, NetworkX, and combinations thereof) across various approaches (Direct Retrieve, Query, Query + DR) and levels of task complexity (Level 1, Level 2, Level 3). The performance metric is the "Number of Solved Tasks," where a higher number indicates better performance. A "Max" value is also indicated.
### Components/Axes
* **X-axis:** Represents the different system configurations and approaches. Categories are: "Direct Retrieve", "Query", "Query + DR", "Single Run #1", "Single Run #2", "Fusion". These are further grouped under "Neo4j", "NetworkX", "Neo4j + NetworkX", and "No KG".
* **Y-axis:** Represents the "Number of Solved Tasks" with a scale from 0 to 80. The label also states "(the higher the better)".
* **Legend:**
* Level 1: Light Blue
* Level 2: Medium Blue
* Level 3: Purple
* Max: Grey
* **Title:** Not explicitly present, but the chart's content implies a title related to performance comparison.
### Detailed Analysis
The chart consists of grouped bar plots for each system configuration. Each group has three bars representing Level 1, Level 2, and Level 3 tasks, plus a grey "Max" bar.
**Neo4j:**
* **Direct Retrieve:** Level 1: ~21, Level 2: ~18, Level 3: ~16
* **Query:** Level 1: ~40, Level 2: ~21, Level 3: ~21
* **Query + DR:** Level 1: ~24, Level 2: ~29, Level 3: ~42
**NetworkX:**
* **Direct Retrieve:** Level 1: ~20, Level 2: ~20, Level 3: ~18
* **Query:** Level 1: ~28, Level 2: ~21, Level 3: ~28
* **Query + DR:** Level 1: ~60, Level 2: ~34, Level 3: ~2
**Neo4j + NetworkX:**
* **Direct Retrieve:** Level 1: ~25, Level 2: ~26, Level 3: ~24
* **Query:** Level 1: ~54, Level 2: ~33, Level 3: ~3
* **Query + DR:** Level 1: ~72, Level 2: ~33, Level 3: ~5
**No KG:**
* **Single Run #1:** Level 1: ~14, Level 2: ~14, Level 3: ~2
* **Single Run #2:** Level 1: ~20, Level 2: ~16, Level 3: ~18
* **Fusion:** Level 1: ~32, Level 2: ~42, Level 3: ~38
### Key Observations
* The "Query + DR" approach generally performs best for Neo4j and Neo4j + NetworkX, especially at Level 1.
* NetworkX shows a significant performance boost with the "Query + DR" approach at Level 1, outperforming Neo4j and Neo4j + NetworkX.
* Level 3 tasks consistently have the lowest solved counts across all configurations.
* The "Max" value is relatively low across all configurations, suggesting a ceiling on performance.
* The "Fusion" approach in the "No KG" category shows the highest performance among the "No KG" configurations.
### Interpretation
The chart demonstrates the effectiveness of combining query-based and direct retrieval approaches for knowledge graph systems. The "Query + DR" strategy appears to leverage the strengths of both methods, leading to improved performance, particularly for Level 1 tasks. The performance differences between Neo4j, NetworkX, and their combination suggest that each system has unique capabilities and limitations. The consistently low performance on Level 3 tasks indicates that these tasks are inherently more challenging or require more sophisticated reasoning capabilities than the systems currently possess. The "Max" value suggests that there is a limit to the number of tasks that can be solved, even with the best configuration. The "No KG" category provides a baseline for comparison, and the "Fusion" approach shows that even without a dedicated knowledge graph, reasonable performance can be achieved through alternative methods. The chart highlights the importance of selecting the appropriate system configuration and approach based on the specific task complexity and desired performance level.
</details>
Figure 5: The impact coming from harnessing knowledge graphs (KGs) with different knowledge extraction methods (graph queries with Neo4j and Cypher, and general-purpose languages with Python and NetworkX), vs. using no KGs at all. DR stands for Direct Retrieval. Model: GPT-4o mini.
We also show the advantages of KGoT on different open models in Figure 5 over HF Agents and GPTSwarm for nearly all considered models (Yang et al., 2025; Guo et al., 2025). Interestingly, certain sizes of DeepSeek-R1 (Guo et al., 2025) offer high Zero-Shot performance that outperforms both KGoT and HF Agents, illustrating potential for further improvements specifically aimed at Reasoning Language Models (RLMs) (Besta et al., 2025a; c).
Finally, we investigate the impact on performance coming from harnessing KGs, vs. using no KGs at all (the âno KGâ baseline), which we illustrate in Figure 5. Harnessing KGs has clear advantages, with a nearly 2 $Ă$ increase in the number of solved tasks. This confirms the positive impact from structuring the task related knowledge into a graph format, and implies that our workflow generates high quality graphs. To further confirm this, we additionally verified these graphs manually and we discovered that the generated KGs do contain the actual solution (e.g., the solution can be found across nodes/edges of a given KG by string matching). This illustrates that in the majority of the solved tasks, the automatically generated KGs correctly represent the solution and directly enable solving a given task.
We offer further analyses in Appendix D, including studying the impact on performance from different tool sets, prompt formats as well as fusion types.
6 Related Work
Our work is related to numerous LLM domains.
First, we use LangChain (LangChain Inc., 2025a) to facilitate the integration of the LLM agents with the rest of the KGoT system. Other such LLM integration frameworks, such as MiniChain (Rush, 2023) or AutoChain (Forethought, 2023), could be used instead.
Agent collaboration frameworks are systems such as Magentic-One and numerous others (Zhuge et al., 2024; Tang et al., 2024; Liu et al., 2024b; Li et al., 2024; Chu et al., 2024; Wu et al., 2024; Chen et al., 2024; Hong et al., 2024; Shinn et al., 2023; Zhu et al., 2024; Kagaya et al., 2024; Zhao et al., 2024a; Stengel-Eskin et al., 2024; Significant Gravitas, 2025; Zhu et al., 2025). The core KGoT idea that can be applied to enhance such frameworks is that a KG can also be used as a common shared task representation for multiple agents solving a task together. Such a graph would be then updated by more than a single agent. This idea proves effective, as confirmed by the fact that KGoT outperforms highly competitive baselines (HF Agents, Magentic-One, GPTSwarm) in both GAIA and SimpleQA benchmarks.
Some agent frameworks explicitly use graphs for more effective collaboration. Examples are GPTSwarm (Zhuge et al., 2024), MacNet (Qian et al., 2025), and AgentPrune (Zhang et al., 2025). These systems differ from KGoT as they use a graph to model and manage multiple agents in a structured way, forming a hierarchy of tools. Contrarily, KGoT uses KGs to represent the task itself, including its intermediate state. These two design choices are orthogonal and could be combined together. Moreover, while KGoT only relies on in-context learning; both MacNet (Qian et al., 2025) and AgentPrune (Zhang et al., 2025) require additional training rounds, making their integration and deployment more challenging and expensive than KGoT.
Many works exist in the domain of general prompt engineering (Beurer-Kellner et al., 2024; Besta et al., 2025c; Yao et al., 2023a; Besta et al., 2024a; Wei et al., 2022; Yao et al., 2023b; Chen et al., 2023; Creswell et al., 2023; Wang et al., 2023a; Hu et al., 2024; Dua et al., 2022; Jung et al., 2022; Ye et al., 2023). One could use such schemes to further enhance respective parts of the KGoT workflow. While we already use prompts that are suited for encoding knowledge graphs, possibly harnessing other ideas from that domain could bring further benefits.
Task decomposition & planning increases the effectiveness of LLMs by dividing a task into subtasks. Examples include ADaPT (Prasad et al., 2024), ANPL (Huang et al., 2023), and others (Zhu et al., 2025; Shen et al., 2023). Overall, the whole KGoT workflow already harnesses recursive task decomposition: the input task is divided into numerous steps, and many of these steps are further decomposed into sub steps by the LLM Graph Executor if necessary. For example, when solving a task based on the already constructed KG, the LLM Graph Executor may decide to decompose this step similarly to ADaPT. Other decomposition schemes could also be tried, we leave this as future work.
Retrieval-Augmented Generation (RAG) is an important part of the LLM ecosystem, with numerous designs being proposed (Edge et al., 2025; Gao et al., 2024; Besta et al., 2025b; Zhao et al., 2024b; Hu & Lu, 2025; Huang & Huang, 2024; Yu et al., 2024a; Mialon et al., 2023; Li et al., 2022; Abdallah & Jatowt, 2024; Delile et al., 2024; Manathunga & Illangasekara, 2023; Zeng et al., 2024; Wewer et al., 2021; Xu et al., 2024; Sarthi et al., 2024; Asai et al., 2024; Yu et al., 2024b; Gutiérrez et al., 2024). RAG has been used primarily to ensure data privacy and to reduce hallucinations. We illustrate that it has lower performance than KGoT when applied to AI assistant tasks.
Another increasingly important part of the LLM ecosystem is the usage of tools to augment the abilities of LLMs (Beurer-Kellner et al., 2023; Schick et al., 2023; Xie et al., 2024). For example, ToolNet (Liu et al., 2024a) uses a directed graph to model the application of multiple tools while solving a task, however focuses specifically on the iterative usage of tools at scale. KGoT harnesses a flexible and adaptable hierarchy of various tools, which can easily be extended with ToolNet and such designs, to solve a wider range of complex tasks.
While KGoT focuses on classical AI assistant tasks, it can be extended to other applications. Promising directions could include supporting multi-stage, cost-efficient reasoning, for example to enhance the capabilities of the recent reasoning models such as DeepSeek-R1. Extending KGoT to this and other domains may require new ways of KG construction via predictive graph models (Besta et al., 2023a; 2024c), integration with neural graph databases (Besta et al., 2022), or deployment over distributed-memory clusters for scalability. Further, refining its reasoning strategies through advanced task decomposition schemes could improve performance on very long-horizon tasks. These directions highlight both the generality of the framework and current boundaries in tool orchestration, reasoning depth, and scalability, which we aim to address in future work.
7 Conclusion
In this paper, we introduce Knowledge Graph of Thoughts (KGoT), an AI assistant architecture that enhances the reasoning capabilities of low-cost models while significantly reducing operational expenses. By dynamically constructing and evolving knowledge graphs (KGs) that encode the task and its resolution state, KGoT enables structured knowledge representation and retrieval, improving task success rates on benchmarks such as GAIA and SimpleQA. Our extensive evaluation demonstrates that KGoT outperforms existing LLM-based agent solutions, for example achieving a substantial increase in task-solving efficiency of 29% or more over the competitive Hugging Face Agents baseline, while ensuring over 36 $Ă$ lower costs. Thanks to its modular design, KGoT can be extended to new domains that require complex multi-step reasoning integrated with extensive interactions with the external compute environment, for example automated scientific discovery or software design.
Acknowledgments
We thank Chi Zhang and Muyang Du for their contributions to the framework. We thank Hussein Harake, Colin McMurtrie, Mark Klein, Angelo Mangili, and the whole CSCS team granting access to the Ault, Daint and Alps machines, and for their excellent technical support. We thank Timo Schneider for help with infrastructure at SPCL. This project received funding from the European Research Council (Project PSAP, No. 101002047), and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No. 955513 (MAELSTROM). This project was supported by the ETH Future Computing Laboratory (EFCL), financed by a donation from Huawei Technologies. This project received funding from the European Unionâs HE research and innovation programme under the grant agreement No. 101070141 (Project GLACIATION). We gratefully acknowledge the Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/017103.
References
- Abdallah & Jatowt (2024) Abdelrahman Abdallah and Adam Jatowt. Generator-Retriever-Generator Approach for Open-Domain Question Answering, March 2024. URL https://arxiv.org/abs/2307.11278. arXiv:2307.11278.
- Asai et al. (2024) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 9112â9141, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/25f7be9694d7b32d5cc670927b8091e1-Abstract-Conference.html.
- Bahdanau et al. (2024) Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mitul Tiwari, and Quaizar Vohra. TapeAgents: A Holistic Framework for Agent Development and Optimization, December 2024. URL https://arxiv.org/abs/2412.08445. arXiv:2412.08445.
- Ben Mahria et al. (2021) Bilal Ben Mahria, Ilham Chaker, and Azeddine Zahi. An Empirical Study on the Evaluation of the RDF Storage Systems. Journal of Big Data, 8(1):100:1â100:20, July 2021. ISSN 2196-1115. doi: 10.1186/s40537-021-00486-y. URL https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00486-y.
- Benedicic et al. (2019) Lucas Benedicic, Felipe A. Cruz, Alberto Madonna, and Kean Mariotti. Sarus: Highly Scalable Docker Containers for HPC Systems. In MichĂšle Weiland, Guido Juckeland, Sadaf Alam, and Heike Jagode (eds.), Proceedings of the International Conference on High Performance Computing (ICS â19), volume 11887 of Lecture Notes in Computer Science, pp. 46â60, Frankfurt, Germany, June 2019. Springer International Publishing. ISBN 978-3-030-34356-9. doi: 10.1007/978-3-030-34356-9Ë5. URL https://link.springer.com/chapter/10.1007/978-3-030-34356-9_5.
- Besta et al. (2018) Maciej Besta, Dimitri Stanojevic, Tijana Zivic, Jagpreet Singh, Maurice Hoerold, and Torsten Hoefler. Log(Graph): A Near-Optimal High-Performance Graph Representation. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT â18, pp. 7:1â7:13, Limassol, Cyprus, November 2018. Association for Computing Machinery. ISBN 9781450359863. doi: 10.1145/3243176.3243198. URL https://doi.org/10.1145/3243176.3243198.
- Besta et al. (2022) Maciej Besta, Patrick Iff, Florian Scheidl, Kazuki Osawa, Nikoli Dryden, Michal Podstawski, Tiancheng Chen, and Torsten Hoefler. Neural Graph Databases. In Bastian Rieck and Razvan Pascanu (eds.), Proceedings of the First Learning on Graphs Conference, volume 198 of Proceedings of Machine Learning Research, pp. 31:1â31:38, Virtual Event, December 2022. PMLR. URL https://proceedings.mlr.press/v198/besta22a.html.
- Besta et al. (2023a) Maciej Besta, Afonso Claudino Catarino, Lukas Gianinazzi, Nils Blach, Piotr Nyczyk, Hubert Niewiadomski, and Torsten Hoefler. HOT: Higher-Order Dynamic Graph Representation Learning with Efficient Transformers. In Soledad Villar and Benjamin Chamberlain (eds.), Proceedings of the Second Learning on Graphs Conference, volume 231 of Proceedings of Machine Learning Research, pp. 15:1â15:20, Virtual Event, November 2023a. PMLR. URL https://proceedings.mlr.press/v231/besta24a.html.
- Besta et al. (2023b) Maciej Besta, Robert Gerstenberger, Marc Fischer, Michal Podstawski, Nils Blach, Berke Egeli, Georgy Mitenkov, Wojciech Chlapek, Marek Michalewicz, Hubert Niewiadomski, JĂŒrgen MĂŒller, and Torsten Hoefler. The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC â23, pp. 22:1â22:18, Denver, CO, USA, November 2023b. Association for Computing Machinery. ISBN 9798400701092. doi: 10.1145/3581784.3607068. URL https://doi.org/10.1145/3581784.3607068.
- Besta et al. (2023c) Maciej Besta, Robert Gerstenberger, Emanuel Peter, Marc Fischer, MichaĆ Podstawski, Claude Barthels, Gustavo Alonso, and Torsten Hoefler. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. ACM Comput. Surv., 56(2):31:1â31:40, September 2023c. ISSN 0360-0300. doi: 10.1145/3604932. URL https://doi.org/10.1145/3604932.
- Besta et al. (2024a) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682â17690, March 2024a. doi: 10.1609/aaai.v38i16.29720. URL https://ojs.aaai.org/index.php/AAAI/article/view/29720.
- Besta et al. (2024b) Maciej Besta, Robert Gerstenberger, Patrick Iff, Pournima Sonawane, Juan GĂłmez Luna, Raghavendra Kanakagiri, Rui Min, Onur Mutlu, Torsten Hoefler, Raja Appuswamy, and Aidan O Mahony. Hardware Acceleration for Knowledge Graph Processing: Challenges & Recent Developments, November 2024b. URL https://arxiv.org/abs/2408.12173. arXiv:2408.12173.
- Besta et al. (2024c) Maciej Besta, Florian Scheidl, Lukas Gianinazzi, Grzegorz KwaĆniewski, Shachar Klaiman, JĂŒrgen MĂŒller, and Torsten Hoefler. Demystifying Higher-Order Graph Neural Networks, December 2024c. URL https://arxiv.org/abs/2406.12841. arXiv:2406.12841.
- Besta et al. (2025a) Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz KwaĆniewski, JĂŒrgen MĂŒller, Ćukasz Flis, Hannes Eberhard, Zixuan Chen, Hubert Niewiadomski, and Torsten Hoefler. Reasoning Language Models: A Blueprint, June 2025a. URL https://arxiv.org/abs/2501.11223. arXiv:2501.11223.
- Besta et al. (2025b) Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, MichaĆ Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, JĂŒrgen MĂŒller, Hubert Niewiadomski, and Torsten Hoefler. Multi-Head RAG: Solving Multi-Aspect Problems with LLMs, July 2025b. URL https://arxiv.org/abs/2406.05085. arXiv:2406.05085.
- Besta et al. (2025c) Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz KwaĆniewski, JĂŒrgen MĂŒller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Aidan OâMahony, Onur Mutlu, and Torsten Hoefler. Demystifying Chains, Trees, and Graphs of Thoughts. IEEE Transactions on Pattern Analysis and Machine Intelligence, August 2025c. doi: 10.1109/TPAMI.2025.3598182. URL https://ieeexplore.ieee.org/document/11123142.
- Besta et al. (2025d) Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, and Torsten Hoefler. CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks, July 2025d. URL https://arxiv.org/abs/2406.02524. arXiv:2406.02524.
- Beurer-Kellner et al. (2023) Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Large Language Models are Zero-Shot Multi-Tool Users. In Proceedings of the ICML Workshop on Knowledge and Logical Reasoning in the Era of Data-Driven Learning, KLR â23, Honolulu, HI, USA, July 2023. URL https://files.sri.inf.ethz.ch/website/papers/lmql_actions.pdf.
- Beurer-Kellner et al. (2024) Luca Beurer-Kellner, Mark Niklas MĂŒller, Marc Fischer, and Martin Vechev. Prompt Sketching for Large Language Models. In Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 3674â3706, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/beurer-kellner24b.html.
- Bhattacharjya et al. (2024) Debarun Bhattacharjya, Junkyu Lee, Don Joven Agravante, Balaji Ganesan, and Radu Marinescu. Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning, February 2024. URL https://arxiv.org/abs/2402.01602. arXiv:2402.01602.
- Chen et al. (2024) Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. AutoAgents: A Framework for Automatic Agent Generation. In Kate Larson (ed.), Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI â24, pp. 22â30, Jeju, South Korea, August 2024. International Joint Conferences on Artificial Intelligence Organization. doi: 10.24963/ijcai.2024/3. URL https://www.ijcai.org/proceedings/2024/3.
- Chen et al. (2023) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research, November 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.
- Chu et al. (2024) Zhixuan Chu, Yan Wang, Feng Zhu, Lu Yu, Longfei Li, and Jinjie Gu. Professional Agents â Evolving Large Language Models into Autonomous Experts with Human-Level Competencies, February 2024. URL https://arxiv.org/abs/2402.03628. arXiv:2402.03628.
- Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR â23, Kigali, Rwanda, May 2023. OpenReview. URL https://openreview.net/forum?id=3Pf3Wg6o-A4.
- Delile et al. (2024) Julien Delile, Srayanta Mukherjee, Anton Van Pamel, and Leonid Zhukov. Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge. In Proceedings of the Workshop ML for Life and Material Science: From Theory to Industry Applications, ML4LMS â24, Vienna, Austria, July 2024. OpenReview. URL https://openreview.net/forum?id=RUwfsPWrv3.
- Docker Inc. (2025) Docker Inc. Docker: Accelerated Container Applications. https://www.docker.com/, July 2025. Accessed: 2025-09-22.
- Dua et al. (2022) Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive Prompting for Decomposing Complex Questions. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP â22, pp. 1251â1265, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.81. URL https://aclanthology.org/2022.emnlp-main.81/.
- Eclipse Foundation (2025) Eclipse Foundation. RDF4J. https://rdf4j.org/, September 2025. Accessed: 2025-09-22.
- Edge et al. (2025) Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From Local to Global: A Graph RAG Approach to Query-Focused Summarization, February 2025. URL https://arxiv.org/abs/2404.16130. arXiv:2404.16130.
- Emonet et al. (2024) Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, and Ana Claudia Sima. LLM-Based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs. In Reham Alharbi, Jacopo de Berardinis, Paul Groth, Albert Meroño Peñuela, Elena Simperl, and Valentina Tamma (eds.), Proceedings of the Special Session on Harmonising Generative AI and Semantic Web Technologies (HGAIS â24), volume 3953 of Workshop Proceedings, Baltimore, MD, USA, November 2024. CEUR. URL https://ceur-ws.org/Vol-3953/355.pdf.
- Forethought (2023) Forethought. AutoChain. https://autochain.forethought.ai/, 2023. Accessed: 2025-09-22.
- Fourney et al. (2024) Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks, November 2024. URL https://arxiv.org/abs/2411.04468. arXiv:2411.04468.
- Francis et al. (2018) Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and AndrĂ©s Taylor. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the International Conference on Management of Data, SIGMOD â18, pp. 1433â1445, Houston, TX, USA, June 2018. Association for Computing Machinery. ISBN 9781450347037. doi: 10.1145/3183713.3190657. URL https://doi.org/10.1145/3183713.3190657.
- Gao et al. (2024) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey, March 2024. URL https://arxiv.org/abs/2312.10997. arXiv:2312.10997.
- Gu et al. (2025) Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A Survey on LLM-as-a-Judge, March 2025. URL https://arxiv.org/abs/2411.15594. arXiv:2411.15594.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, January 2025. URL https://arxiv.org/abs/2501.12948. arXiv:2501.12948.
- Guo et al. (2024) Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Kate Larson (ed.), Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI â24, pp. 8048â8057, Jeju, South Korea, August 2024. International Joint Conferences on Artificial Intelligence Organization. doi: 10.24963/ijcai.2024/890. URL https://www.ijcai.org/proceedings/2024/890. Survey Track.
- GutiĂ©rrez et al. (2024) Bernal JimĂ©nez GutiĂ©rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS â24), volume 37 of Advances in Neural Information Processing Systems, pp. 59532â59569, Vancouver, Canada, December 2024. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/6ddc001d07ca4f319af96a3024f6dbd1-Abstract-Conference.html.
- Hong et al. (2024) Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and JĂŒrgen Schmidhuber. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 23247â23275, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/6507b115562bb0a305f1958ccc87355a-Abstract-Conference.html.
- Hu et al. (2024) Hanxu Hu, Hongyuan Lu, Huajian Zhang, Wai Lam, and Yue Zhang. Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models, August 2024. URL https://arxiv.org/abs/2305.10276. arXiv:2305.10276.
- Hu & Lu (2025) Yucheng Hu and Yuxing Lu. RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, June 2025. URL https://arxiv.org/abs/2404.19543. arXiv:2404.19543.
- Huang et al. (2023) Di Huang, Ziyuan Nan, Xing Hu, Pengwei Jin, Shaohui Peng, Yuanbo Wen, Rui Zhang, Zidong Du, Qi Guo, Yewen Pu, and Yunji Chen. ANPL: Towards Natural Programming with Interactive Decomposition. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 69404â69440, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/dba8fa689ede9e56cbcd4f719def38fb-Abstract-Conference.html.
- Huang & Huang (2024) Yizheng Huang and Jimmy Huang. A Survey on Retrieval-Augmented Text Generation for Large Language Models, August 2024. URL https://arxiv.org/abs/2404.10981. arXiv:2404.10981.
- Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP â22, pp. 1266â1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.82. URL https://aclanthology.org/2022.emnlp-main.82/.
- Kaddour et al. (2023) Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and Applications of Large Language Models, July 2023. URL https://arxiv.org/abs/2307.10169. arXiv:2307.10169.
- Kagaya et al. (2024) Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You. RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents. In Proceedings of the Workshop on Open-World Agents, OWA â24, Vancouver, Canada, December 2024. OpenReview. URL https://openreview.net/forum?id=Xf49Dpxuox.
- Kim et al. (2024) Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. An LLM Compiler for Parallel Function Calling. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 24370â24391, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/kim24y.html.
- LangChain Inc. (2025a) LangChain Inc. LangChain. https://www.langchain.com/, 2025a. Accessed: 2025-09-22.
- LangChain Inc. (2025b) LangChain Inc. Dealing with API Errors. https://js.langchain.com/v0.1/docs/modules/data_connection/text_embedding/api_errors/, 2025b. Accessed: 2025-09-22.
- LangChain Inc. (2025c) LangChain Inc. LangChain Core Tools: BaseTool. https://api.python.langchain.com/en/latest/tools/langchain_core.tools.BaseTool.html, 2025c. Accessed: 2025-09-22.
- LangChain Inc. (2025d) LangChain Inc. How to parse JSON output. https://python.langchain.com/docs/how_to/output_parser_json/, 2025d. Accessed: 2025-09-22.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich KĂŒttler, Mike Lewis, Wen-tau Yih, Tim RocktĂ€schel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Proceedings of the Thirty-Fourth Annual Conference on Neural Information Processing Systems (NeurIPS â20), volume 33 of Advances in Neural Information Processing Systems, pp. 9459â9474, Virtual Event, December 2020. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
- Li & Vasarhelyi (2024) Huaxia Li and Miklos A. Vasarhelyi. Applying Large Language Models in Accounting: A Comparative Analysis of Different Methodologies and Off-the-Shelf Examples. Journal of Emerging Technologies in Accounting, 21(2):133â152, October 2024. ISSN 1554-1908. doi: 10.2308/JETA-2023-065. URL https://publications.aaahq.org/jeta/article-abstract/21/2/133/12800/.
- Li et al. (2022) Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. A Survey on Retrieval-Augmented Text Generation, February 2022. URL https://arxiv.org/abs/2202.01110. arXiv:2202.01110.
- Li et al. (2024) Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More Agents Is All You Need. Transactions on Machine Learning Research, October 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=bgzUSZ8aeg.
- Liu et al. (2024a) Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph, February 2024a. URL https://arxiv.org/abs/2403.00839. arXiv:2403.00839.
- Liu et al. (2024b) Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration. In Proceedings of the First Conference on Language Modeling, COLM â24, Philadelphia, PA, USA, October 2024b. OpenReview. URL https://openreview.net/forum?id=XII0Wp1XA9.
- Manathunga & Illangasekara (2023) S. S. Manathunga and Y. A. Illangasekara. Retrieval Augmented Generation and Representative Vector Summarization for Large Unstructured Textual Data in Medical Education, August 2023. URL https://arxiv.org/abs/2308.00479. arXiv:2308.00479.
- Mecharnia & dâAquin (2025) Thamer Mecharnia and Mathieu dâAquin. Performance and Limitations of Fine-Tuned LLMs in SPARQL Query Generation. In Genet Asefa Gesese, Harald Sack, Heiko Paulheim, Albert Merono-Penuela, and Lihu Chen (eds.), Proceedings of the Workshop on Generative AI and Knowledge Graphs, GenAIK â25, pp. 69â77, Abu Dhabi, United Arab Emirates, January 2025. International Committee on Computational Linguistics. URL https://aclanthology.org/2025.genaik-1.8/.
- Mialon et al. (2023) Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented Language Models: A Survey. Transactions on Machine Learning Research, July 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=jh7wH2AzKK. Survey Certification.
- Mialon et al. (2024) GrĂ©goire Mialon, ClĂ©mentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 9025â9049, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/25ae35b5b1738d80f1f03a8713e405ec-Abstract-Conference.html.
- Mialon et al. (2025) Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA Leaderboard. https://huggingface.co/spaces/gaia-benchmark/leaderboard, September 2025. Accessed: 2025-09-25.
- NetworkX Developers (2025) NetworkX Developers. NetworkX Documentation. https://networkx.org/, May 2025. Accessed: 2025-09-22.
- OpenAI (2025) OpenAI. simple-evals. https://github.com/openai/simple-evals, July 2025. Accessed: 2025-09-22.
- PĂ©rez et al. (2009) Jorge PĂ©rez, Marcelo Arenas, and Claudio Gutierrez. Semantics and Complexity of SPARQL. ACM Trans. Database Syst., 34(3):16:1â16:45, September 2009. ISSN 0362-5915. doi: 10.1145/1567274.1567278. URL https://doi.org/10.1145/1567274.1567278.
- Prasad et al. (2024) Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-Needed Decomposition and Planning with Language Models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of the Association for Computational Linguistics: NAACL 2024, pp. 4226â4252, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.264. URL https://aclanthology.org/2024.findings-naacl.264/.
- Python Software Foundation (2025a) Python Software Foundation. codecs â Codec registry and base classes. https://docs.python.org/3/library/codecs.html, September 2025a. Accessed: 2025-09-22.
- Python Software Foundation (2025b) Python Software Foundation. asyncio â Asynchronous I/O. https://docs.python.org/3/library/asyncio.html, September 2025b. Accessed: 2025-09-22.
- Qian et al. (2025) Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling Large Language Model-Based Multi-Agent Collaboration. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), Proceedings of the Thirteenth International Conference on Learning Representations, ICLR â25, pp. 41488â41505, Singapore, April 2025. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/66a026c0d17040889b50f0dfa650e5e0-Abstract-Conference.html.
- Robinson et al. (2015) Ian Robinson, Jim Webber, and Emil Eifrem. Graph Database Internals. In Graph Databases, chapter 7, pp. 149â170. OâReilly, Sebastopol, CA, USA, 2nd edition, 2015. ISBN 9781491930892.
- Roucher & Petrov (2025) Aymeric Roucher and Sergei Petrov. Beating GAIA with Transformers Agents. https://github.com/aymeric-roucher/GAIA, February 2025. Accessed: 2025-09-22.
- Rush (2023) Alexander Rush. MiniChain: A Small Library for Coding with Large Language Models. In Yansong Feng and Els Lefever (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP â23, pp. 311â317, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-demo.27. URL https://aclanthology.org/2023.emnlp-demo.27.
- Sarthi et al. (2024) Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher Manning. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (eds.), Proceedings of the Twelfth International Conference on Learning Representations, ICLR â24, pp. 32628â32649, Vienna, Austria, May 2024. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/8a2acd174940dbca361a6398a4f9df91-Abstract-Conference.html.
- Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto DessĂŹ, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 68539â68551, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html.
- SerpApi LLM (2025) SerpApi LLM. SerpApi: Google Search API. https://serpapi.com/, 2025. Accessed: 2025-09-22.
- Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 38154â38180, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/77c33e6a367922d003ff102ffb92b658-Abstract-Conference.html.
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 8634â8652, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html.
- Significant Gravitas (2025) Significant Gravitas. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT, September 2025. Accessed: 2025-09-22.
- Singhal (2012) Amit Singhal. Introducing the Knowledge Graph: things, not strings. https://www.blog.google/products/search/introducing-knowledge-graph-things-not/, May 2012. Accessed: 2025-09-22.
- Stengel-Eskin et al. (2024) Elias Stengel-Eskin, Archiki Prasad, and Mohit Bansal. ReGAL: Refactoring Programs to Discover Generalizable Abstractions. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 46605â46624, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/stengel-eskin24a.html.
- Sumers et al. (2024) Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive Architectures for Language Agents. Transactions on Machine Learning Research, February 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=1i6ZCvflQJ. Survey Certification.
- Tang et al. (2024) Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and TegawendĂ© F. BissyandĂ©. CodeAgent: Autonomous Communicative Agents for Code Review. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP â24, pp. 11279â11313, Miami, FL, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.632. URL https://aclanthology.org/2024.emnlp-main.632/.
- Tenacity Developers (2025a) Tenacity Developers. Tenacity: Retrying Library. https://github.com/jd/tenacity, April 2025a. Accessed: 2025-09-22.
- Tenacity Developers (2025b) Tenacity Developers. Tenacity Documentation. https://tenacity.readthedocs.io/en/latest/, 2025b. Accessed: 2025-09-22.
- Wang et al. (2023a) Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalonâs Game of Thoughts: Battle Against Deception through Recursive Contemplation, October 2023a. URL https://arxiv.org/abs/2310.01320. arXiv:2310.01320.
- Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR â23, Kigali, Rwanda, May 2023b. OpenReview. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Wang et al. (2023c) Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian (Shawn) Ma, and Yitao Liang. Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 34153â34189, New Orleans, LA, USA, December 2023c. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/6b8dfb8c0c12e6fafc6c256cb08a5ca7-Abstract-Conference.html.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the Thirty-Sixth Annual Conference on Neural Information Processing Systems (NeurIPS â22), volume 35 of Advances in Neural Information Processing Systems, pp. 24824â24837, New Orleans, LA, USA, December 2022. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
- Wei et al. (2024) Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring Short-Form Factuality in Large Language Models, November 2024. URL https://arxiv.org/abs/2411.04368. arXiv:2411.04368.
- Wewer et al. (2021) Christopher Wewer, Florian Lemmerich, and Michael Cochez. Updating Embeddings for Dynamic Knowledge Graphs, September 2021. URL https://arxiv.org/abs/2109.10896. arXiv:2109.10896.
- Wu et al. (2024) Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In Proceedings of the First Conference on Language Modeling, COLM â24, Philadelphia, PA, USA, October 2024. OpenReview. URL https://openreview.net/forum?id=BAakY1hNKS.
- Xie et al. (2024) Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Zeju Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. OpenAgents: An Open Platform for Language Agents in the Wild. In Proceedings of the First Conference on Language Modeling, COLM â24, Philadelphia, PA, USA, October 2024. OpenReview. URL https://openreview.net/forum?id=sKATR2O1Y0.
- Xu et al. (2024) Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Chaojun Xiao, Zhiyuan Liu, Ge Yu, and Chenyan Xiong. ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents, October 2024. URL https://arxiv.org/abs/2402.13547. arXiv:2402.13547.
- Yang et al. (2025) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, et al. Qwen2.5 Technical Report, January 2025. URL https://arxiv.org/abs/2412.15115. arXiv:2412.15115.
- Yao et al. (2023a) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 11809â11822, New Orleans, LA, USA, December 2023a. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html.
- Yao et al. (2023b) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR â23, Kigali, Rwanda, May 2023b. OpenReview. URL https://openreview.net/forum?id=WE_vluYUL-X.
- Ye et al. (2023) Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large Language Models Are Versatile Decomposers: Decomposing Evidence and Questions for Table-Based Reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR â23, pp. 174â184, Taipei, Taiwan, July 2023. Association for Computing Machinery. ISBN 9781450394086. doi: 10.1145/3539618.3591708. URL https://doi.org/10.1145/3539618.3591708.
- Yu et al. (2024a) Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of Retrieval-Augmented Generation: A Survey. In Wenwu Zhu, Hui Xiong, Xiuzhen Cheng, Lizhen Cui, Zhicheng Dou, Junyu Dong, Shanchen Pang, Li Wang, Lanju Kong, and Zhenxiang Chen (eds.), Proceedings of the 12th CCF Conference, BigData, volume 2301 of Communications in Computer and Information Science (CCIS), pp. 102â120, Qingdao, China, August 2024a. Springer Nature. ISBN 978-981-96-1024-2. doi: 10.1007/978-981-96-1024-2Ë8. URL https://link.springer.com/chapter/10.1007/978-981-96-1024-2_8.
- Yu et al. (2024b) Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP â24, pp. 14672â14685, Miami, FL, USA, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.813. URL https://aclanthology.org/2024.emnlp-main.813/.
- Zeng et al. (2024) Huimin Zeng, Zhenrui Yue, Qian Jiang, and Dong Wang. Federated Recommendation via Hybrid Retrieval Augmented Generation. In Wei Ding, Chang-Tien Lu, Fusheng Wang, Liping Di, Kesheng Wu, Jun Huan, Raghu Nambiar, Jundong Li, Filip Ilievski, Ricardo Baeza-Yates, and Xiaohua Hu (eds.), Proceedings of the IEEE International Conference on Big Data, BigData â24, pp. 8078â8087, Washington, DC, USA, December 2024. IEEE Press. doi: 10.1109/BigData62323.2024.10825302. URL https://ieeexplore.ieee.org/document/10825302.
- Zhang et al. (2025) Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the Crap: An Economical Communication Pipeline for LLM-Based Multi-Agent Systems. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), Proceedings of the Thirteenth International Conference on Learning Representations, ICLR â25, pp. 75389â75428, Singapore, April 2025. International Conference on Learning Representations. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/bbc461518c59a2a8d64e70e2c38c4a0e-Abstract-Conference.html.
- Zhao et al. (2024a) Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19632â19642, March 2024a. doi: 10.1609/aaai.v38i17.29936. URL https://ojs.aaai.org/index.php/AAAI/article/view/29936.
- Zhao et al. (2024b) Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-Augmented Generation for AI-Generated Content: A Survey, June 2024b. URL https://arxiv.org/abs/2402.19473. arXiv:2402.19473.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS â23), volume 36 of Advances in Neural Information Processing Systems, pp. 46595â46623, New Orleans, LA, USA, December 2023. Curran Associates. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
- Zhu et al. (2025) Yuqi Zhu, Shuofei Qiao, Yixin Ou, Shumin Deng, Shiwei Lyu, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen, and Ningyu Zhang. KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Findings of the Association for Computational Linguistics: NAACL 2025, pp. 3709â3732, Albuquerque, NM, USA, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025.findings-naacl.205/.
- Zhu et al. (2024) Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. Large Language Models Can Learn Rules, December 2024. URL https://arxiv.org/abs/2310.07064. arXiv:2310.07064.
- Zhuge et al. (2024) Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and JĂŒrgen Schmidhuber. GPTSwarm: Language Agents as Optimizable Graphs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning (ICML â24), volume 235 of Proceedings of Machine Learning Research, pp. 62743â62767, Vienna, Austria, July 2024. PMLR. URL https://proceedings.mlr.press/v235/zhuge24a.html.
Appendix A Additional Examples of Knowledge Graph Representation of Tasks
We include selected snapshots of KG representation of tasks, covering a wide range of graph structures from simple chains to trees and cyclic graphs. Each snapshot captures the current KG state in a JSON file, exported using a predefined query that retrieves all labeled nodes and edges. Regardless of the underlying graph backend, the use of a consistent export format allows all snapshots to be visualized through Neo4jâs built-in web interface. In the following, we showcase illustrations of such snapshots and task statements from the GAIA validation set. Please note that the GAIA benchmark discourages making its tasks accessible to crawling. To honor their wishes, we replaced the names of entities with placeholders in the following examples, while keeping the overall structure intact.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Task Resolution
### Overview
The image depicts a diagram illustrating the process of Knowledge Graph Task Resolution (KGOT) for answering a specific question. The diagram shows a question input on the left and the corresponding enhanced knowledge graph representation on the right, connected by an arrow indicating the resolution process.
### Components/Axes
The diagram consists of two main sections:
1. **Question Input:** A rectangular box containing the question "What writer is quoted by Merriam-Webster for the Word of the Day from [date]?" along with a list of required tools: Web browser, Search engine, and Audio capability.
2. **Enhanced Knowledge Graph:** A larger, lavender-colored rectangular area representing the knowledge graph. This graph consists of nodes and edges representing entities and relationships.
### Detailed Analysis or Content Details
The knowledge graph contains the following nodes and relationships:
* **Date:** A black circular node.
* **Word:** A white rectangular node labeled "Word". An edge labeled "HAS DATE" connects "Date" to "Word".
* **Quote:** A white circular node labeled "Quote". An edge labeled "HAS QUOTE" connects "Word" to "Quote". The "Quote" node has an ellipsis (...) indicating further details.
* **[firstname lastname]:** A white rectangular node representing the author's name. An edge labeled "QUOTED BY" connects "Quote" to "[firstname lastname]".
* **[concept]:** A white rectangular node labeled "concept". An edge labeled "HAS QUOTE" connects "Quote" to "[concept]".
The arrow labeled "KGOT Task Resolution" points from the question input box to the enhanced knowledge graph, indicating the transformation of the question into a graph representation.
### Key Observations
The diagram illustrates how a natural language question is translated into a structured knowledge graph representation. The graph highlights the key entities (Date, Word, Quote, Author) and their relationships (HAS DATE, HAS QUOTE, QUOTED BY). The use of brackets around "firstname lastname" and "concept" suggests these are placeholders for specific values that would be filled in during the resolution process.
### Interpretation
This diagram demonstrates a method for question answering using knowledge graphs. The KGOT process converts a natural language query into a graph structure, enabling a system to reason about the relationships between entities and retrieve the answer. The required tools (Web browser, Search engine, Audio capability) suggest that the system may leverage external resources to populate the knowledge graph or refine the query. The diagram highlights the importance of representing knowledge in a structured format to facilitate automated reasoning and information retrieval. The ellipsis on the "Quote" node suggests that the system may need to access additional information to fully resolve the query. The diagram is a conceptual illustration of the process and does not contain specific data or numerical values.
</details>
Figure 6: Example of a chain structure. This task requires 7 intermediate steps and the usage of 3 tools. The expected solution is â[firstname lastname]â. KGoT invokes the Surfer agent to search for relevant pages, locate the relevant quote, and find the person who said it. All intermediate information is successfully retrieved and used for enhancing the dynamically constructed KG. The quote contains two properties, significance and text. âsignificanceâ stores the meaning of the quote, whereas âtextâ stores the actual quote.
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Task Resolution
### Overview
The image depicts a diagram illustrating the resolution of a Knowledge Graph (KG) task, specifically question 51. The diagram shows a question presented on the left, and the resulting enhanced knowledge graph on the right, connected by an arrow labeled "KGoT Task Resolution". The knowledge graph represents relationships between entities like "Pope" and "Bishop" through the "CO-CONSECRATED" relation.
### Components/Axes
The diagram consists of two main sections:
1. **Question Section (Left):** Contains the question text and required tools.
2. **Enhanced Knowledge Graph Section (Right):** Displays the graph with nodes and edges representing entities and relationships.
The question section includes:
* **Question:** "The [museum name] has a portrait in its collection with an accession number of [number]. Of the consecrators and co-consecrators of this portrait's subject as a bishop, what is the name of the one who never became pope?"
* **Required Tools:**
* Web browser (icon of a globe with a spiderweb)
* Search engine (icon of a magnifying glass)
The knowledge graph section includes:
* **Nodes:** Represent entities like "Pope", "Bishop", and individuals represented by placeholders like "[firstname1 lastname1]", "[firstname2 lastname2]", and "[firstname3 lastname3]". Nodes are represented as black circles.
* **Edges:** Represent relationships between entities, labeled "CO-CONSECRATED". Edges are represented as arrows.
* **Title:** "Enhanced Knowledge Graph"
### Detailed Analysis or Content Details
The knowledge graph shows the following relationships:
* A node labeled "[firstname1 lastname1]" is connected to a "Pope" node via a "CO-CONSECRATED" edge.
* A "Bishop" node is connected to a node labeled "[firstname2 lastname2]" via a "CO-CONSECRATED" edge.
* The "Bishop" node is also connected to a "Pope" node labeled "[firstname3 lastname3]" via a "CO-CONSECRATED" edge.
* The "Pope" node labeled "[firstname3 lastname3]" is connected to another "Pope" node via a "CO-CONSECRATED" edge.
The question asks for the name of the individual who was a co-consecrator of a bishop but never became pope. Based on the graph, the possible candidates are "[firstname1 lastname1]" and "[firstname2 lastname2]". The question implies that only one of them did not become pope.
### Key Observations
The diagram illustrates how a complex question can be represented and potentially answered using a knowledge graph. The placeholders in the graph suggest that the specific entities are to be filled in based on the information retrieved using the required tools (web browser and search engine). The graph structure highlights the relationships between individuals and their roles (Bishop, Pope) and the "CO-CONSECRATED" relationship.
### Interpretation
The diagram demonstrates a process for answering a complex question using a knowledge graph. The question requires identifying individuals connected to a specific bishop through the "CO-CONSECRATED" relationship and then filtering those individuals based on whether they became a pope. The knowledge graph provides a visual representation of the relevant entities and relationships, making it easier to identify the answer. The use of placeholders indicates that the graph is a template that will be populated with specific data retrieved from external sources. The diagram suggests that the KGoT Task Resolution process transforms the initial question into a structured knowledge graph representation, facilitating the answer retrieval process. The diagram does not provide the actual answer, but rather illustrates the method for finding it.
</details>
Figure 7: Example of a tree structure. This task requires 6 intermediate steps and the usage of 2 tools. The expected solution is â[firstname1 lastname1]â. The Surfer agent is also invoked for this task. In this KG representation of the task, [popename] is identified as the consecrator, where [firstname1 lastname1], [firstname2 lastname2] and [firstname3 lastname3] are all co-consecrators. Subsequently, the correct answer is obtained from the KGoT from the KG by correctly identifying [firstname1 lastname1] as the one without any labels.
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Task Resolution
### Overview
The image depicts a diagram illustrating the resolution of a knowledge graph task. The task involves answering the question: "How many studio albums were published by [firstname lastname] between [year] and [year] (included)? You can use the latest 2022 version of English Wikipedia." The diagram shows the transformation from the initial question to an "Enhanced Knowledge Graph" representation.
### Components/Axes
The diagram consists of three main sections:
1. **Question & Tools:** A light-grey box containing the question and the required tools (Web browser and Search engine).
2. **Transformation:** A pair of grey arrows indicating the task resolution process.
3. **Enhanced Knowledge Graph:** A lavender-colored box containing the knowledge graph representation.
The knowledge graph itself consists of nodes and edges:
* **Nodes:** Represent entities like "YEAR", "[album name 1]", "[album name 2]", "[album name 3]", "[album name 4]", and "[firstname lastname]". These are depicted as black circles.
* **Edges:** Represent relationships between entities, labeled "RELEASED". These are depicted as black arrows.
### Detailed Analysis or Content Details
The question section states:
* **Question:** "How many studio albums were published by [firstname lastname] between [year] and [year] (included)? You can use the latest 2022 version of English Wikipedia."
* **Required Tools:**
* Web browser (icon of a globe)
* Search engine (icon of a magnifying glass)
The Enhanced Knowledge Graph shows the following relationships:
* "[album name 1]" is "RELEASED" in "YEAR".
* "[album name 2]" is "RELEASED" in "YEAR".
* "[album name 3]" is "RELEASED" in "YEAR".
* "[album name 4]" is "RELEASED" in "YEAR".
* "[firstname lastname]" is "RELEASED" "[album name 1]", "[album name 2]", "[album name 3]", and "[album name 4]".
The graph structure suggests that the task is resolved by identifying albums released by the specified artist within the given year range. The graph represents the relationships needed to answer the question.
### Key Observations
* The diagram uses bracketed placeholders ([...]) for specific values (artist name, years, album names). This indicates a generalized representation of the task.
* The graph is directed, showing the direction of the "RELEASED" relationship.
* The graph structure is relatively simple, suggesting a straightforward task resolution process.
### Interpretation
The diagram illustrates how a natural language question can be transformed into a structured knowledge graph representation. This representation facilitates the retrieval of information from a knowledge base (like Wikipedia) to answer the question. The graph shows the key entities and relationships needed to determine the number of studio albums published by a given artist within a specified time frame. The use of a knowledge graph allows for a more precise and efficient search compared to simply querying text. The diagram highlights the importance of representing knowledge in a structured format for automated reasoning and question answering. The placeholders suggest that the system is designed to handle various artist names, years, and album titles.
</details>
Figure 8: Example of a tree structure. This task requires 4 intermediate steps and the usage of 2 tools. The expected solution is â4â. This is a trap question where only the studio albums should be taken into account. In addition to years, the type of the albums is also stored as a property in the KG. Please note that the original GAIA task has a different solution, which we do not want to reveal.
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Diagram: Enhanced Knowledge Graph for KGOT Task Resolution
### Overview
The image presents a directed graph illustrating the process of KGOT (Knowledge Graph to Text) task resolution. The graph depicts how a script generates a URL, which then leads to a source code. This source code is processed into an array, sorted, and ultimately results in the sum of specific integers within the sorted array. The diagram is split into two sections: a text description of a task on the left, and the knowledge graph on the right.
### Components/Axes
The diagram consists of nodes (black circles) representing entities and edges (arrows) representing relationships between them. The nodes are labeled with terms like "Script", "URL", "SourceCode", "Array", "SortedArray", "Integer", and "KGOT Task Resolution". The edges are labeled with relationship types such as "LEADS_TO", "GENERATES", "PROCESSES", "SORTS_TO", "HAS_INTEGER", and "SUMS_WITH".
### Detailed Analysis or Content Details
The graph can be traced as follows:
1. **KGOT Task Resolution** `GENERATES` **Script**.
2. **Script** `GENERATES` **URL**.
3. **URL** `LEADS_TO` **SourceCode**.
4. **SourceCode** `PROCESSES` **Array**.
5. **Array** `SORTS_TO` **SortedArray**.
6. **SortedArray** `HAS_INTEGER` **42**.
7. **SortedArray** `HAS_INTEGER` **23**.
8. **SortedArray** `HAS_INTEGER` **65**.
9. **42** and **23** `RESULTS_IN` **65**.
10. **65** is an **Integer**.
11. **42** is an **Integer**.
12. **23** is an **Integer**.
13. **42** and **65** `SUMS_WITH` **Integer**.
The text on the left side of the image describes the task: "The attached image contains a Python script. Run the Python code against an array of strings, listed below. The output of the Python script will be a URL containing C++ source code. Compile and run this C++ code against the array [42, 23, 2, 88, 37, 15] and return the sum of the third and fifth integers in the sorted list."
The array is: `arr = ['URL', 'ele', 'me', 'nts', 'as', 'sho', 'str', 'ings']`
The required tools are indicated by icons at the bottom left:
1. Web browser
2. Search engine
3. File handling
4. Computer vision
5. OCR
6. Code execution
7. Calculator
### Key Observations
The diagram visually represents the flow of data and operations involved in the KGOT task resolution. The final result is the integer 65, which is the sum of the third and fifth integers in the sorted array [2, 15, 23, 37, 42, 88]. The diagram highlights the dependencies between different components and the relationships between them.
### Interpretation
The diagram illustrates a pipeline for converting a knowledge graph to text. The process begins with a task resolution, which generates a script. This script then produces a URL, leading to source code. The source code is executed on an array of data, resulting in a sorted array and ultimately a numerical result. The diagram emphasizes the importance of each step in the process and how they contribute to the final outcome. The inclusion of the array and the final sum (65) suggests that the diagram is a simplified representation of a more complex process, focusing on the key steps and their relationships. The required tools indicate the need for a combination of software and potentially manual intervention (OCR) to complete the task. The diagram is a high-level overview, abstracting away the details of the Python script and C++ code.
</details>
Figure 9: Example of a cyclic graph structure. This task requires 7 intermediate steps and the usage of 6 tools. The expected solution is â65â. Here, array has the property âvaluesâ with $[42,23,2,88,37,15]$ , SortedArray contains the correctly sorted values $[2,15,23,37,42,88]$ . The final solution â65â is correctly retrieved and parsed as KGoT response. Please note that we used different array values than in the original GAIA task.
A.1 Graph Storage Representation of Knowledge Graph Examples
We now illustrate two examples of knowledge graphs and how they are represented in Neo4j and NetworkX respectively as well as the queries used to extract the final solution. Please note again, that we either replaced the values with placeholders (first question) or with different values (second question) in order to not leak the GAIA benchmark questions.
We start with GAIA question 59, which is illustrated in Figure 6. The knowledge graph stored in Neo4j after the first iteration is shown in the code snippet below.
Neo4j KG representation while processing question 59.
%****âŁappendix-kgs.texâŁLineâŁ75âŁ**** Nodes: Label: Writer {neo4j_id:0, properties:{ânameâ: â[firstname lastname]â}} Label: WordOfTheDay {neo4j_id:1, properties:{âpronunciationâ: â[con-cept]â, âdefinitionâ: âtextual definitionâ, âcounterâ: 1, âoriginâ: âsome war between year-yearâ, âwordâ: â[concept]â, âdateâ: â[date1]â}} Label: Quote {neo4j_id:2, properties:{âtextâ: â[quote]â, âsourceâ: â[newspaper name]â, âdateâ: â[date2]â}} Relationships: Label: QUOTED_FOR {source: {neo4j_id: 0, label: Writer}, target: {neo4j_id: 1, label: WordOfTheDay}, properties: {}} Label: QUOTED_IN {source: {neo4j_id: 0, label: Writer}, target: {neo4j_id: 2, label: Quote}, properties: {}}
The Cypher query used to extract the solution was the following:
Cypher query to extract the solution for question 59.
MATCH (w:Writer)-[:QUOTED_FOR]->(wod:WordOfTheDay {date: â[date1]â}) RETURN w.name AS writer_name
To illustrate the use of NetworkX, we use a knowledge graph for question 106 (shown in Figure 9) from the GAIA benchmark after the second iteration.
NetworkX KG representation while processing question 106.
Existing Nodes: Label: Function [{id:A1, properties:{ânameâ: âimage_inspectorâ}}, {id:call_X2CcPnp5acMUPAp1Qx3OTvKx, properties:{ânameâ: âimage_inspectorâ, âargsâ: {âquestionâ: âWhat Python script is depicted in the attached image?â, âfull_path_to_imageâ: â[filepath].pngâ}}}] Label: Script [{id:A2, properties:{âdescriptionâ: âPython script to construct a URL by combining a base URL with specific indices from an arrayâ}}] Label: Array [{id:A3, properties:{âcontentâ: "[âURLâ, âeleâ, âmeâ, ântsâ, âasâ, âshoâ, ârtâ, âstrâ, âingsâ]"}}] Label: URL [{id:A4, properties:{âbaseâ: â[base URL]â, âindicesâ: [some indices]}}] Existing Relationships: Label: uses [{source: {id: A1}, target: {id: A2}, properties: {}}] Label: contains [{source: {id: A2}, target: {id: A3}, properties: {}}] Label: constructs [{source: {id: A2}, target: {id: A4}, properties: {}}] Label: None [{source: {id: call_X2CcPnp5acMUPAp1Qx3OTvKx}, target: {id: A2}, properties: {}}]
The following Python code was used to extract the final solution:
Python code to extract the solution for question 106.
# Retrieve the base URL and indices to construct the final URL base_url = self.G.nodes[âA4â][âbaseâ] indices = self.G.nodes[âA4â][âindicesâ] # Retrieve the array content arr = eval(self.G.nodes[âA3â][âcontentâ]) # Construct the URL using the specified indices constructed_url = base_url + ââ.join(arr[i] for i in indices) # The next step would be to compile and run the C++ code from the constructed URL, but # since we cannot execute external code, we will simulate the sorting and summing # process in Python. # Simulating the C++ code execution with the given array sorted_arr = sorted([2, 15, 23, 37, 42, 88]) # Sum of the third and fifth integers in the sorted list result = sorted_arr[2] + sorted_arr[4]
After the code execution, the correct solution of 65 is obtained.
Appendix B Additional Details on System Design & Implementation
B.1 Controller
The Controller is the central orchestrator of the KGoT system, responsible for managing the interaction between the knowledge graph and the integrated tools. When a user submits a query, the Controller initiates the reasoning process by interpreting the task and coordinating the steps required for its resolution.
To offer fine-grained control over the KGoT control logic, the following parameters can be configured:
- num_next_steps_decision: Number of times to prompt an LLM on how to proceed (Solve/Enhance). Defaults to 5.
- max_retrieve_query_retry: Maximum retries for a Solve query when the initial attempt fails. Defaults to 3.
- max_cypher_fixing_retry: Maximum retries for fixing a Cypher query that encounter errors. Defaults to 3.
- max_final_solution_parsing: Maximum retries of parsing the final solution from the output of the Solve query. Defaults to 3.
- max_tool_retries: Maximum number of retries when a tool invocation fails. Defaults to 6.
Controller classes derived from the ControllerInterface abstract class embed such parameters with default values defined for their class. Users can experiment with custom parameters as well. We discuss how the choice of these parameters impacts the system robustness in Appendix B.2.
B.1.1 Architecture
The KGoT Controller employs a dual-LLM architecture with a clear separation of roles between constructing the knowledge graph (managed by the LLM Graph Executor) and interacting with tools (managed by the LLM Tool Executor). The following discussion provides additional specifics to the workflow description in Section 4.
The LLM Graph Executor is responsible for decision making and orchestrating the knowledge graph-based task resolution workflow, leading to different pathways (Solve or Enhance).
- define_next_step: Determine the next step. This function is invoked up to num_next_steps_decision times to collect replies from an LLM, which are subsequently used with a majority vote to decide whether to retrieve information from the knowledge graph for solving the task (Solve) or insert new information (Enhance).
- _insert_logic: Run Enhance. Once we have successfully executed tool calls and gathered new information, the system generates the Enhance query or queries to modify the knowledge graph accordingly. Each Enhance query is executed and its output is validated.
- _retrieve_logic: Run Solve. If the majority vote directs the system to the Solve pathway, a predefined solution technique (direct or query-based retrieve) is used for the solution generation.
- _get_math_response: Apply additional mathematical processing (optional).
- parse_solution_with_llm: Parse the final solution into a suitable format and prepare it as the KGoT response.
The LLM Tool Executor decides which tools to use as well as handling the interaction with these tools.
- define_tool_calls: Define tool calls. The system orchestrates the appropriate tool calls based on the knowledge graph state.
- _invoke_tools_after_llm_response, _invoke_tool_with_retry: Run tool calls with or without retry.
B.2 Enhancing System Robustness
Given the non-deterministic nature of LLMs and their potential for generating hallucinations (Kaddour et al., 2023), the robustness of KGoT has been a fundamental focus throughout its design and implementation. Ensuring that the system consistently delivers accurate and reliable results across various scenarios is paramount. One of the key strategies employed to enhance robustness is the use of majority voting, also known as Self-Consistency (Wang et al., 2023b). In KGoT, majority voting is implemented by querying the LLM multiple times (by default 5 times) when deciding the next step, whether to insert more data into the knowledge graph or retrieve existing data. This approach reduces the impact of single-instance errors or inconsistencies, ensuring that the decisions made reflect the LLMâs most consistent reasoning paths.
The choice of defaulting to five iterations for majority voting is a strategic balance between reliability and cost management, and was based on the work by Wang et al. (2023b), which showed diminishing returns beyond this point.
In addition, KGoT uses a separate default iteration count of seven for executing its full range of functions during problem-solving. These seven iterations correspond to the typical number of tool calls required to thoroughly explore the problem space, including multiple interactions with tools like the Surfer agent and the external LLM. Unlike the five iterations used for majority voting used to ensure robustness, this strategy ensures the system leverages its resources effectively across multiple tool invocations before concluding with a âNo Solutionâ response if the problem remains unresolved.
Layered Error-Checking: KGoT integrates multiple error-checking mechanisms to safeguard against potential issues. The system continuously monitors for syntax errors and failures in API calls. These mechanisms are complemented by custom parsers and retry protocols. The parsers, customized from LangChain (LangChain Inc., 2025d), are designed to extract the required information from the LLMâs responses, eliminating the need for manual parsing. In cases where errors persist despite initial correction attempts, the system employs retry mechanisms. These involve the LLM rephrasing the Cypher queries and try them again. The Controllerâs design includes a limit on the number of retries for generating Cypher queries and invoking tools, balancing the need for error resolution with the practical constraints of time and computational resources. More information can be found in the subsequent section.
B.3 Error Management Techniques
B.3.1 Handling LLM-Generated Syntax Errors
Syntax errors generated by LLMs can disrupt the workflow of KGoT, potentially leading to incorrect or incomplete solutions, or even causing the system to fail entirely. To manage these errors, KGoT includes LangChainâs JSON parsers (LangChain Inc., 2025d) that detect syntax issues.
When a syntax error is detected, the system first attempts to correct it by adjusting the problematic syntax using different encoders, such as "unicode_escape" (Python Software Foundation, 2025a). If the issue persists, KGoT employs a retry mechanism that uses the LLM to rephrase the query/command and attempts to regenerate its output. This retry mechanism is designed to handle up to three attempts, after which the system logs the error for further analysis, bypasses the problematic query, and continues with other iterations in the hope that another tool or LLM call will still be able to resolve the problem.
A significant issue encountered with LLM-generated responses is managing the escape characters, especially when returning a Cypher query inside the standard JSON structure expected by the LangChain parser. The combination of retries using different encoders and parsers has mitigated the problem, though not entirely resolved it. Manual parsing and the use of regular expressions have also been attempted but with limited success.
B.3.2 Managing API and System Errors
API-related errors, such as the OpenAI code â500â errors, are a common challenge in the operation of KGoT, especially when the external servers are overwhelmed. To manage these errors, the primary strategy employed is exponential backoff, which is a technique where the system waits for progressively longer intervals before retrying a failed API call, reducing the likelihood of repeated failures due to temporary server issues or rate limits (Tenacity Developers, 2025b). In KGoT, this approach is implemented using the tenacity library, with a retry policy that waits for random intervals ranging from 1 to 60 seconds and allows for up to six retry attempts (wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)).
Additionally, KGoT includes comprehensive logging systems as part of its error management framework. These systems track the errors encountered during system operation, providing valuable data that can be easily parsed and analyzed (e.g. snapshots of the knowledge graphs or responses from third-party APIs). This data can then be used to refine the systemâs error-handling protocols and improve overall reliability.
It is also important to note that the systemâs error management strategies are built on top of existing errors systems provided by external tools, such as the LangChain interface for OpenAI, which already implements a default exponential backoff strategy with up to six retries (LangChain Inc., 2025b). These built-in mechanisms complement KGoTâs own error-handling strategies, creating a multi-layered defense against potential failures and ensuring high levels of system reliability.
B.4 Detailed Tool Description
Tools are a fundamental component of the KGoT framework, enabling seamless interaction with external resources such as the web and various file formats. KGoT currently supports the following tools:
- Python Code Tool: Executes code snippets provided by the LLM in a secure Python environment hosted within a Docker (or Sarus) container. This ensures that any potential security risks from executing untrusted code are mitigated. Besides running code, this tool is also utilized for mathematical computations.
- Large Language Model (LLM) Tool: Allows the LLM Tool Executor to request data generation from another instance of the same LLM. It is primarily employed for simple, objective tasks where no other tool is applicable.
- Surfer Agent: This web browser agent leverages SerpAPI to perform efficient Google searches and extract relevant webpage data. Built on Hugging Face Agents (Roucher & Petrov, 2025), this tool combines the capabilities with our WebCrawler and Wikipedia tools while adding support for JavaScript-rendered pages. It uses viewpoint segmentation to prevent the âlost in the middle effectâ and incorporates additional navigation functionalities, such as search and page traversal.
- ExtractZip Tool: Extracts data from compressed files (e.g., ZIP archives). It was enhanced through integration with the TextInspector Tool, enabling seamless analysis of extracted files without requiring additional iterations to process the data.
- TextInspector Tool: A versatile tool for extracting data from multiple file types, including PDFs, spreadsheets, MP3s, and YouTube videos. It organizes extracted content in Markdown format, enhancing readability and integration into the Knowledge Graph. The tool was augmented with the best components from our original MultiModal Tool and the Hugging Face Agents TextInspector Tool. It can directly process questions about extracted content without returning the raw data to the LLM.
- Image Tool: Extracts information from images, such as text or objects, and returns it in a structured format. This tool is crucial for tasks requiring image processing and analysis. We selected the best prompts from our original tool set as well as Hugging Face Agents to optimize data extraction and analysis.
Tool integration within the KGoT framework is crucial for extending the systemâs problem-solving capabilities beyond what is achievable by LLMs alone. The strategy is designed to be modular, scalable, and efficient, enabling the system to leverage a diverse array of external tools for tasks such as data retrieval, complex computations, document processing, and more.
B.4.1 Modular Tool Architecture
All tools integrated into the KGoT system are built upon the BaseTool abstraction provided by the LangChain framework (LangChain Inc., 2025c). This standardized approach ensures consistency and interoperability among different tools, facilitating seamless integration and management of new tools. Each tool implementation adheres to the following structure:
- tool_name: A unique identifier for the tool, used by the system to reference and invoke the appropriate functionality.
- description: A detailed explanation of the toolâs purpose, capabilities, and appropriate usage scenarios. This description assists the LLM Tool Executor in selecting the right tool for specific tasks. Including few-shot examples is recommended, though the description must adhere to the 1024-character limit imposed by BaseTool.
- args_schema: A schema defining the expected input arguments for the tool, including their types and descriptions. This schema ensures that the LLM Tool Executor provides correctly formatted and valid inputs when invoking the tool.
This structured definition enables the LLM Tool Executor to dynamically understand and interact with a wide array of tools, promoting flexibility and extensibility within the KGoT system.
B.4.2 Tool Management and Initialization
The ToolManager component is responsible for initializing and maintaining the suite of tools available to the KGoT system. It handles tasks such as loading tool configurations, setting up necessary environment variables (e.g., API keys), and conducting initial tests to verify tool readiness, such as checking whether the RunPythonCodeTool âs Docker container is running. The ToolManager ensures that all tools are properly configured and available for use during the systemâs operation.
Simplified example of ToolManager initialization.
class ToolManager: def __init__(self): self.set_env_keys() self.tools = [ LLM_tool(...), image_question_tool(...), textInspectorTool(...), search_tool(...), run_python_tool(...), extract_zip_tool(...), # Additional tools can be added here ] self.test_tools() def get_tools(self): return self.tools
This modular setup allows for the easy addition or removal of tools, enabling the system to adapt to evolving requirements and incorporate new functionalities as needed.
B.4.3 Information Parsing and Validation
After a tool executes and returns its output, the retrieved information undergoes a parsing and validation process by the LLM Graph Executor before being integrated into the knowledge graph. This process ensures the integrity and relevance of new data:
- Relevance Verification: The content of the retrieved information is assessed for relevance to the original problem context. This step may involve cross-referencing with existing knowledge, checking for logical consistency, and filtering out extraneous or irrelevant details. The LLM Graph Executor handles this during Cypher query generation.
- Integration into Knowledge Graph: Validated and appropriately formatted information is then seamlessly integrated into the knowledge graph by executing each Cypher query (with required error managements as mentioned in section B.3.1), enriching the systemâs understanding and enabling more informed reasoning in future iterations.
B.4.4 Benefits
This structured and systematic approach to tool integration and selection offers several key benefits:
- Enhanced Capability: By leveraging specialized tools, KGoT can handle a wide range of complex tasks that go beyond the inherent capabilities of LLMs, providing more comprehensive and accurate solutions.
- Scalability: The modular architecture allows for easy expansion of the tool set, enabling the system to adapt to new domains and problem types with minimal reconfiguration.
- Flexibility: The systemâs ability to adaptively select and coordinate multiple tools in response to dynamic problem contexts ensures robust and versatile problem-solving capabilities.
B.5 High-Performance & Scalability
As previously discussed, we also experimented with various high-performance computing techniques adopted to accelerate KGoT. This section outlines additional design details.
The acceleration strategies can be classified into two categories: those targeting the speedup of a single task, and those aimed at accelerating the execution of KGoT on a batch of tasks such as the GAIA benchmark.
Optimizations in the first category are:
- Asynchronous Execution: Profiling of the KGoT workflow reveals that a substantial portion of runtime is spent on LLM model calls and tool invocations. As this represents a typical I/O-intensive workload, Python multi-threading is sufficient to address the bottleneck. KGoT dynamically schedules independent I/O operations (based on the current graph state and execution logic) using asyncio to achieve full concurrency.
- Graph Operation Parallelism: KGoT maintains a graph storage backend for managing the knowledge graph. When new knowledge is obtained from the tools, KGoT generates a list of queries, which represent a sequence of graph operations to add or modify nodes, properties, and edges. However, executing these operations sequentially in the graph storage backend can be time-consuming. A key observation is that many of these operations exhibit potential independence. We leveraged this potential parallelism to accelerate these graph storage operations. Our solution involves having KGoT request an LLM to analyze dependencies within the operations and return multiple independent chains of graph storage operations. These chains are then executed concurrently using the asynchronous method proposed earlier, enabling parallel execution of queries on the graph storage. This approach effectively harnesses the inherent parallelism to significantly improve processing speed.
The applied optimizations result in an overall speedup of 2.30 $Ă$ compared to the sequential baseline for a single KGoT task.
The second category focuses on accelerating a batch of tasks, for which MPI-based distributed processing is employed. Additional optimizations have also been implemented to further enhance performance.
- Work Stealing: The work-stealing algorithm operates by allowing idle processors to âstealâ tasks from the queues of busy processors, ensuring balanced workload distribution. Each processor maintains its task queue, prioritizing local execution, while stealing occurs only when its queue is empty. This approach reduces idle time and enhances parallel efficiency. Our implementation of the work-stealing algorithm for KGoT adopts a novel approach tailored for distributed atomic task execution in an MPI environment. Each question is treated as an atomic task, initially distributed evenly across all ranks to ensure balanced workload allocation. When a rank completes all its assigned tasks, it enters a work-stealing phase, prioritizing the rank with the largest queue of remaining tasks. Operating in a peer-to-peer mode without a designated master rank, each rank maintains a work-stealing monitor to handle task redistribution. This monitor tracks incoming requests and facilitates the transfer of the last available task to the requesting rank whenever feasible. The system ensures continuous work-stealing, dynamically redistributing tasks to idle ranks, thus minimizing idle time and maximizing computational efficiency across all ranks. This decentralized and adaptive strategy significantly enhances the parallel processing capabilities of KGoT.
- Container Pool: The container pool implementation for KGoT ensures modular and independent execution of each tasks on separate ranks by running essential modules, such as Neo4j and the Python tool, within isolated containers, with one container assigned per rank. We use a Kubernetes-like container orchestration tool specifically designed for KGoT running with MPI. The container pool supports Docker and Sarus to be compatible with local and cluster environments. Our design guarantees that each task operates independently without interfering with each other, while trying to minimize latency between the KGoT controller and the containers.
Ultimately, our experiments achieved a 12.74 $Ă$ speedup over the sequential baseline on the GAIA benchmark when executed with 8 ranks in MPI, as illustrated in Figure 10. This demonstrates the significant performance improvement of the KGoT system achieved on a consumer-grade platform.
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Chart: Speedup vs. Number of Processing Elements
### Overview
This chart depicts the speedup achieved by two different approaches â Work Stealing and Non-Work Stealing â as the number of processing elements (p) increases. The speedup is measured using Message Passing Interface (MPI). The chart provides performance data for a specific hardware configuration: Apple M3 Pro chip with 12 cores and 16GB of memory, across 30 questions and 2 measurements.
### Components/Axes
* **X-axis:** Number of Processing Elements (p) in Message Passing Interface (MPI). Scale ranges from 1 to 10, with markers at each integer value.
* **Y-axis:** Speedup, defined as S = T<sub>p</sub>/T<sub>1</sub>, where T<sub>1</sub> is sequential execution time and T<sub>p</sub> is parallel execution time with p processors. Scale ranges from 2 to 12, with markers at integer values.
* **Legend:** Located at the top-right of the chart.
* Red line with circular markers: Work Stealing
* Black line with cross markers: Non-Work Stealing
* **Metadata:** Located at the top-left of the chart.
* # of Questions = 30
* # of Measurement = 2
* Chip: Apple M3 Pro @ 4.056GHz (12 cores)
* Memory: 16GB
* **Peak Performance Indicator:** Located at the top-right of the chart.
* Peak: 12.74x at p = 8
### Detailed Analysis
**Work Stealing (Red Line):**
The Work Stealing line starts at approximately 1.7 speedup with 1 processing element. It exhibits a generally upward trend, with some fluctuations.
* p = 1: Speedup â 1.7
* p = 2: Speedup â 3.3
* p = 4: Speedup â 5.0
* p = 6: Speedup â 9.2
* p = 8: Speedup â 12.7 (Peak)
* p = 10: Speedup â 11.2
**Non-Work Stealing (Black Line):**
The Non-Work Stealing line also starts at approximately 1.7 speedup with 1 processing element. It shows a more gradual and consistent upward trend compared to Work Stealing.
* p = 1: Speedup â 1.7
* p = 2: Speedup â 3.0
* p = 4: Speedup â 4.5
* p = 6: Speedup â 7.5
* p = 8: Speedup â 8.5
* p = 10: Speedup â 9.5
### Key Observations
* Work Stealing consistently outperforms Non-Work Stealing across all tested numbers of processing elements.
* Work Stealing reaches its peak speedup of approximately 12.7x at 8 processing elements, after which it slightly decreases.
* Non-Work Stealing shows a steady increase in speedup as the number of processing elements increases, without reaching a clear peak within the tested range.
* Both methods start with a similar speedup at 1 processing element.
### Interpretation
The data suggests that Work Stealing is a more effective parallelization strategy than Non-Work Stealing for this specific workload (30 questions) on the given hardware (Apple M3 Pro). The significant speedup achieved by Work Stealing indicates that it efficiently distributes and manages tasks across multiple processing elements. The peak performance at 8 processing elements suggests an optimal balance between parallelization overhead and task distribution. The slight decrease in speedup beyond 8 processing elements could be due to increased communication overhead or diminishing returns from adding more processors. The consistent, but lower, performance of Non-Work Stealing suggests it may be less adaptable to dynamic task loads or less efficient in utilizing available processing resources. The fact that both start at the same speedup indicates that the initial overhead is similar, but the work stealing method is able to scale better.
</details>
Figure 10: Measured parallel speedup of KGoT task execution across varying numbers of MPI processes, under two scheduling strategies: with and without work stealing. Each task corresponds to a GAIA benchmark question, and each data point represents the average of 2 measurements on an Apple M3 Pro (12 cores @ 4.056GHz) and 18GB Memory. The dashed grey line indicates the expected theoretical speedup curve ( $S={2.2985}Ă p$ ) based on the asynchronous optimizations applied to individual tasks. As previously discussed, acceleration strategies are categorized into (1) single-task optimizationsâincluding asynchronous I/O scheduling and graph operation parallelismâand (2) batch-level parallelism using MPI-based distributed processing. The work-stealing variant consistently outperforms the non-stealing baseline by minimizing idle time and dynamically redistributing atomic question tasks across ranks. These combined strategies result in a 12.74 $Ă$ speedup over the sequential baseline when using 8 processes.
B.6 Examples of Noise Mitigation
We illustrate two examples of experiments with noise mitigation in KGoT. As before, we have replaced the specific values with placeholders to prevent the leakage of the GAIA benchmark tasks.
B.6.1 Irrelevance Removal
The first example is based on question 146 in the validation set of the GAIA benchmark:
On [date], an article by [author] was published in [publication]. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by [researcher] supported by?
The example KG has been populated with data directly related to the answer as well as information that is relevant to the question but not necessary for answering it. Removing this extraneous data makes it easier for KGoT to reason about the KG content and extract data relevant to the answer. The data to be removed is marked in red.
Question 146: Initial state of the knowledge graph.
Nodes: Label: Funding {neo4j_id:0, properties:{âaward_numberâ: â[award_number]â}} Label: Researcher {neo4j_id:13, properties:{ânameâ: â[researcher]â}} Label: Article {neo4j_id:11, properties:{âauthorâ: â[author]â, âtitleâ: â[title]â, âsourceâ: â[publication]â, âpublication_dateâ: â[date]â}} Label: Paper {neo4j_id:12, properties:{âtitleâ: â[paper]â}} Relationships: Label: SUPPORTED_BY {source: {neo4j_id: 13, label: Researcher}, target: {neo4j_id: 0, label: Funding}, properties: {}} Label: LINKED_TO {source: {neo4j_id: 11, label: Article}, target: {neo4j_id: 12, label: Paper}, properties: {}} Label: INVOLVES {source: {neo4j_id: 12, label: Paper}, target: {neo4j_id: 13, label: Researcher}, properties: {}}
Question 146: Denoised knowledge graph.
Nodes: Label: Funding {neo4j_id:0, properties:{âaward_numberâ: â[award_numberâ}} Label: Researcher {neo4j_id:13, properties:{ânameâ: â[researcher]â}} Relationships: Label: SUPPORTED_BY {source: {neo4j_id: 13, label: Researcher}, target: {neo4j_id: 0, label: Funding}, properties: {}}
B.6.2 Duplicate Removal
The second example is based on question 25 in the validation set of the GAIA benchmark:
I need to fact-check a citation. This is the citation from the bibliography: [citation1] And this is the in-line citation: Our relationship with the authors of the works we read can often be â[quote]â ([citation2]). Does the quoted text match what is actually in the article? If Yes, answer Yes, otherwise, give me the word in my citation that does not match with the correct one (without any article).
In the example, the knowledge graph has been populated by two nearly identical nodes. The nodes and relationships marked for removal are shown in red.
Question 25: Initial state of the knowledge graph.
Nodes: Label: Quote {neo4j_id:22, properties:{âtextâ: â[quote]â}} {neo4j_id:0, properties:{âtextâ: â[near_identical_quote]â}} Label: Article {neo4j_id:3, properties:{âjournalâ: â[journal]â, âpage_startâ: [page_start], âauthorâ: â[author]â, âpage_endâ: [page_end], âtitleâ: â[title]â, âissueâ: [issue], âvolumeâ: [volume], âyearâ: [year], âdoiâ: â[year]â}} Relationships: Label: CONTAINS {source: {neo4j_id: 3, label: Article}, target: {neo4j_id: 22, label: Quote}, properties: {}} {source: {neo4j_id: 3, label: Article}, target: {neo4j_id: 0, label: Quote}, properties: {}}
Question 25: Denoised knowledge graph.
Nodes: Label: Quote {neo4j_id:22, properties:{âtextâ: â[quote]â}} Label: Article {neo4j_id:3, properties:{âjournalâ: â[journal]â, âpage_startâ: [page_start], âauthorâ: â[author]â, âpage_endâ: [page_end], âtitleâ: â[title]â, âissueâ: [issue], âvolumeâ: [volume], âyearâ: [year], âdoiâ: â[year]â}} Relationships: Label: CONTAINS {source: {neo4j_id: 3, label: Article}, target: {neo4j_id: 22, label: Quote}, properties: {}}
Appendix C Additional Details on Prompt Engineering
The primary objectives in our prompt design include improving decision-making processes, effectively managing complex scenarios, and allowing the LLM to adapt to diverse problem domains while maintaining high accuracy and efficiency. To achieve this, we leverage prompt engineering techniques, particularly the use of generic few-shot examples embedded in prompt templates. These examples guide the LLM in following instructions step by step (chain-of-thought) and reducing errors in generating graph queries with complex syntax.
C.1 Prompt for Majority Voting
At the beginning of each iteration, the LLM Graph Executor uses the following prompt to decide whether the task can be solved with the current KG or if more information is needed. For system robustness, it is run multiple times with varying reasoning paths, and a majority vote (Self-Consistency) is applied to the responses. The prompt also explicitly instructs the model to decide on either the Solve or the Enhance pathway. By requiring the model to output an indicator (query_type = âRETRIEVEâ or âINSERTâ), we can programmatically branch the workflow allowing for control of reasoning pathways.
Graph Executor: Determine the next step
<task> You are a problem solver using a Neo4j database as a knowledge graph to solve a given problem. Note that the database may be incomplete. </task> <instructions> Understand the initial problem, the initial problem nuances, *ALL the existing data* in the database and the tools already called. Can you solve the initial problem using the existing data in the database? âą
If you can solve the initial problem with the existing data currently in the database return the final answer and set the query_type to RETRIEVE. Retrieve only if the data is sufficient to solve the problem in a zero-shot manner. âą
If the existing data is insufficient to solve the problem, return why you could not solve the initial problem and what is missing for you to solve it, and set query_type to INSERT. âą
Remember that if you donât have ALL the information requested, but only partial (e.g. there are still some calculations needed), you should continue to INSERT more data. </instructions> <examples> <examples_retrieve> <!-- In-context few-shot examples --> </examples_retrieve> <examples_insert> <!-- In-context few-shot examples --> </examples_insert> </examples> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <tool_calls_made> {tool_calls_made} </tool_calls_made>
C.2 Prompts for Enhance Pathway
If the majority voting deems the current knowledge base as âinsufficientâ, we enter the Enhance Pathway. To identify the knowledge gap, a list of reasons why the task is not solvable and what information is missing is synthesized by the LLM Graph Executor to a single, consistent description.
Graph Executor: Identify missing information
<task> You are a logic expert, your task is to determine why a given problem cannot be solved using the existing data in a Neo4j database. </task> <instructions> You are provided with a list of reasons. Your job is to combine these reasons into a single, coherent paragraph, ensuring that there are no duplicates. âą
Carefully review and understand each reason provided. âą
Synthesize the reasons into one unified text. </instructions> <list_of_reasons> {list_of_reasons} </list_of_reasons>
By providing both the current graph state and the identified missing information, the LLM Tool Executor defines context-aware tool calls to bridge the knowledge gap identified by the LLM Graph Executor.
Tool Executor: Define tool calls
<task> You are an information retriever tasked with populating a Neo4j database with the necessary information to solve the given initial problem. </task> <instructions> <! - - In-context few-shot examples covering the following aspects: 1. **Understand Requirements** 2. **Gather Information** 3. **Detailed Usage** 4. **Utilize Existing Data** 5. **Avoid Redundant Calls** 6. **Ensure Uniqueness of Tool Calls** 7. **Default Tool** 8. **Do Not Hallucinate** - - > </instructions> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <missing_information> {missing_information} </missing_information> <tool_calls_made> {tool_calls_made} </tool_calls_made>
Afterwards specialized tools such as a web browser or code executor are invoked to perform data retrieval from external resources. The newly acquired information is then used to enhance the KG. The LLM Graph Executor is asked to analyze the retrieved information in the context of the initial user query and the current state of the KG. The following prompt is carefully designed to guide the LLM to generate semantically correct and context-aware Cypher queries with concrete examples.
Graph Executor: Create Cypher for data ingestion
<task> You are a problem solver tasked with updating an incomplete Neo4j database used as a knowledge graph. You have just acquired new information that needs to be integrated into the database. </task> <instructions> <! - - In-context few-shot examples covering following aspects: 0. **Understand the Context** 1. **Use Provided New Information Only** 2. **No Calculations** 3. **Avoid Duplicates** 4. **Combine Operations with WITH Clauses** 5. **Group Related Queries** 6. **Omit RETURN Statements** 7. **Omit ID Usage** 8. **Merge Existing Nodes** 9. **Correct Syntax and Semantics** 10. **Use Correct Relationships** 11. **Escape Characters** - - > </instructions> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <missing_information> {missing_information} </missing_information> <new_information> {new_information} </new_information>
C.3 Prompts for Solve Pathway
If majority voting confirms that the KG is sufficiently populated or the maximum iteration count has been reached, the system proceeds to the Solve Pathway. The iteratively refined KG serves as a reliable information source for LLMs to solve the initial query. To provide a robust response, we introduced two approaches, a query-based approach and Direct Retrieval, for knowledge extraction.
C.3.1 Graph Query Language for Knowledge Extraction
The query-based approach formulates a read query using an LLM, given the entire graph state and other relevant information such as the initial problem. The LLM-generated query is then executed on the graph database to return the final solution. Please note KGoT iteratively executes the solve operations collected from the majority voting.
In-context few-shot examples for query-based knowledge extraction
<examples_retrieve> <example_retrieve_1> Initial problem: Retrieve all books written by ââJ.K. Rowlingââ. Existing entities: Author: [{{name: ââJ.K. Rowlingââ, author_id: ââA1ââ}, {{name: ââGeorge R.R. Martinââ, author_id: ââA2ââ}}], Book: [{{title: ââHarry Potter and the Philosopherâs Stoneââ, book_id: ââB1ââ}, {{title: ââHarry Potter and the Chamber of Secretsââ, book_id: ââB2ââ}, {{title: ââA Game of Thronesââ, book_id: ââB3ââ}}] Existing relationships: (A1)-[:WROTE]->(B1), (A1)-[:WROTE]->(B2), (A2)-[:WROTE]->(B3) Solution: query: â MATCH (a:Author {{name: ââJ.K. Rowlingââ}})-[:WROTE]->(b:Book) RETURN b.title AS book_titleâ query_type: RETRIEVE </example_retrieve_1> <example_retrieve_2> Initial problem: List all colleagues of ââBobââ. Existing entities: Employee: [{{name: ââAliceââ, employee_id: ââE1ââ}, {{name: ââBobââ, employee_id: ââE2ââ}, {{name: ââCharlieââ, employee_id: ââE3ââ}}], Department: [{{name: ââHRââ, department_id: ââD1ââ}, {{name: ââEngineeringââ, department_id: ââD2ââ}}] Existing relationships: (E1)-[:WORKS_IN]->(D1), (E2)-[:WORKS_IN]->(D1), (E3)-[:WORKS_IN]->(D2) Solution: query: â MATCH (e:Employee {name: "Bob"})-[:WORKS_IN]->(d:Department) <-[:WORKS_IN]-(colleague:Employee) WHERE colleague.name <> "Bob" RETURN colleague.name AS colleague_name â query_type: RETRIEVE </example_retrieve_2> </examples_retrieve>
If the attempt to fix a previously generated query fails or the query did not return any results, KGoT will try to regenerate the query from scratch by providing the initial problem statement, the existing data as well as additionally the incorrect query.
Graph Executor: Regeneration of Cypher query for data retrieval
<task> You are a problem solver expert in using a Neo4j database as a knowledge graph. Your task is to solve a given problem by generating a correct Cypher query. You will be provided with the initial problem, existing data in the database, and a previous incorrect Cypher query that returned an empty result. Your goal is to create a new Cypher query that returns the correct results. </task> <instructions> 1.
Understand the initial problem, the problem nuances and the existing data in the database. 2.
Analyze the provided incorrect query to identify why it returned an empty result. 3.
Write a new Cypher query to retrieve the necessary data from the database to solve the initial problem. You can use ALL Cypher/Neo4j functionalities. 4.
Ensure the new query is accurate and follows correct Cypher syntax and semantics. </instructions> <examples> <!-- In-context few-shot examples --> </examples> <initial_problem> {initial_query} </initial_problem> <existing_data> {existing_entities_and_relationships} </existing_data> <wrong_query> {wrong_query} </wrong_query>
C.3.2 Direct Retrieval for Knowledge Extraction
Direct Retrieval refers to directly asking the LLM to formulate the final solution, given the entire graph state, without executing any LLM-generated read queries on the graph storage.
In-context few-shot examples for DR-based knowledge extraction
<examples_retrieve> <example_retrieve_1> Initial problem: Retrieve all books written by ââJ.K. Rowlingââ. Existing entities: Author: [{{name: ââJ.K. Rowlingââ, author_id: ââA1ââ}, {{name: ââGeorge R.R. Martinââ, author_id: ââA2ââ}}], Book: [{{title: ââHarry Potter and the Philosopherâs Stoneââ, book_id: ââB1ââ}, {{title: ââHarry Potter and the Chamber of Secretsââ, book_id: ââB2ââ}, {{title: ââA Game of Thronesââ, book_id: ââB3ââ}}] Existing relationships: (A1)-[:WROTE]->(B1), (A1)-[:WROTE]->(B2), (A2)-[:WROTE]->(B3) Solution: query: âHarry Potter and the Philosopherâs Stone, Harry Potter and the Chamber of Secretsâ query_type: RETRIEVE </example_retrieve_1> <example_retrieve_2> Initial problem: List all colleagues of ââBobââ. Existing entities: Employee: [{{name: ââAliceââ, employee_id: ââE1ââ}, {{name: ââBobââ, employee_id: ââE2ââ}, {{name: ââCharlieââ, employee_id: ââE3ââ}}], Department: [{{name: ââHRââ, department_id: ââD1ââ}, {{name: ââEngineeringââ, department_id: ââD2ââ}}] Existing relationships: (E1)-[:WORKS_IN]->(D1), (E2)-[:WORKS_IN]->(D1), (E3)-[:WORKS_IN]->(D2) Solution: query: âAliceâ query_type: RETRIEVE </example_retrieve_2> </examples_retrieve>
C.3.3 Formatting Final Solution
After successful knowledge extraction from the KG, we obtain a partial answer to our initial query. Next, we examine if further post-processing, such as intermediate calculation or formatting, needs to be performed. In the following prompt, we first detect if any unresolved calculation is required.
Solution formatting: Examine need for mathematical processing
<task> You are an expert in identifying the need for mathematical or probabilistic calculations in problem-solving scenarios. Given an initial query and a partial solution, your task is to determine whether the partial solution requires further mathematical or probabilistic calculations to arrive at a complete solution. You will return a boolean value: True if additional calculations are needed and False if they are not. </task> <instructions> âą
Analyze the initial query and the provided partial solution. âą
Identify any elements in the query and partial solution that suggest the further need for numerical analysis, calculations, or probabilistic reasoning. âą
Consider if the partial solution includes all necessary numerical results or if there are unresolved numerical aspects. âą
Return true if the completion of the solution requires more calculations, otherwise return false. âą
Focus on the necessity for calculations rather than the nature of the math or probability involved. </instructions> <examples> <!-- In-context few-shot examples --> </examples> <initial_problem> {initial_query} </initial_problem> <partial_solution> {partial_solution} </partial_solution>
If any further mathematical processing is needed, the Python Code Tool is invoked to refine the current partial solution by executing an LLM-generated Python script. This ensures accuracy by leveraging the strength of LLMs in scripting. Moreover, it effectively avoids hallucinations by grounding outputs through verifiable and deterministic code computation.
Solution formatting: Apply additional mathematical processing
<task> You are a math and python expert tasked with solving a mathematical problem. </task> <instructions> To complete this task, follow these steps: 1. **Understand the Problem**: âą
Carefully read and understand the initial problem and the partial solution. âą
Elaborate on any mathematical calculations from the partial solution that are required to solve the initial problem. 2. **Perform Calculations**: âą
Use the run_python_code Tool to perform any necessary mathematical calculations. âą
Craft Python code that accurately calculates the required values based on the partial solution and the initial problem. âą
Remember to add print statements to display the reasoning behind the calculations. âą
**ALWAYS** add print statement for the final answer. 3. **Do Not Hallucinate**: âą
**Do not invent information** that is not provided in the initial problem or the partial solution. âą
**Do not perform calculations manually**; use the run_python_code Tool for all mathematical operations. </instructions> <initial_problem> {initial_query} </initial_problem> <partial_solution> {current_solution} </partial_solution>
To produce a single, consistent answer and format the final solution to the initial user query, we guide the LLM with a dedicated prompt.
Solution formatting: Parse the final solution
<task> You are a formatter and extractor. Your task is to combine partial solution from a database and format them according to the initial problem statement. </task> <instructions> 1.
Understand the initial problem, the problem nuances, the desired output, and the desired output format. 2.
Review the provided partial solution. 3.
Integrate and elaborate on the various pieces of information from the partial solution to produce a complete solution to the initial problem. Do not invent any new information. 4.
Your final answer should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. 5.
ADDITIONALLY, your final answer MUST adhere to any formatting instructions specified in the original question (e.g., alphabetization, sequencing, units, rounding, decimal places, etc.) 6.
If you are asked for a number, express it numerically (i.e., with digits rather than words), donât use commas, do not round the number unless directly specified, and DO NOT INCLUDE UNITS such as $ or USD or percent signs unless specified otherwise. 7.
If you are asked for a string, donât use articles or abbreviations (e.g. for cities), unless specified otherwise. Donât output any final sentence punctuation such as â.â, â!â, or â?â. 8.
If you are asked for a comma separated list, apply the above rules depending on whether the elements are numbers or strings. </instructions> <examples> <!-- In-context few-shot examples --> </examples> <initial_problem> {initial_query} </initial_problem> <given_partial_solution> {partial_solution} </given_partial_solution>
C.4 Prompt for LLM-Generated Syntax Error
In order to handle LLM-generated syntax errors, a retry mechanism is deployed to use the LLM to reformulate the graph query or code snippet, guided by specialized prompts tailored to the execution context. For Python code, the prompt guides the model to fix the code and update dependencies if needed, ensuring successful execution.
Error handling: Fix invalid Python code
<task> You are an expert Python programmer. You will be provided with a block of Python code, a list of required packages, and an error message that occurred during code execution. Your task is to fix the code so that it runs successfully and provide an updated list of required packages if necessary. </task> <instructions> 1.
Carefully analyze the provided Python code and the error message. 2.
Identify the root cause of the error. 3.
Modify the code to resolve the error. 4.
Update the list of required packages if any additional packages are needed. 5.
Ensure that the fixed code adheres to best practices where possible. </instructions> <rules> âą
You must return both the fixed Python code and the updated list of required packages. âą
Ensure the code and package list are in proper format. </rules> <examples> <!-- In-context few-shot examples --> </examples> <code> {code} </code> <required_modules> {required_modules} </required_modules> <error> {error} </error>
For Cypher queries, the prompt helps the model diagnose syntax or escaping issues based on the error log and returns a corrected version.
Error handling: Fix invalid Cypher query
<task> You are a Cypher expert, and you need to fix the syntax and semantic of a given incorrect Cypher query. </task> <instructions> Given the incorrect Cypher and the error log: 1.
Understand the source of the error (especially look out for wrongly escaped/not escaped characters). 2.
Correct the Cypher query 3.
Return the corrected Cypher query. </instructions> <wrong_cypher> {cypher_to_fix} </wrong_cypher> <error_log> {error_log} </error_log>
Both prompts are reusable across pathways and enforce minimal, well-scoped corrections grounded in the provided error context.
Appendix D Additional Results
We plot the results from Figure 3 also as a Pareto front in Figure 11.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Scatter Plot: Performance Comparison of Knowledge Graph Enhanced LLMs
### Overview
This scatter plot compares the performance of various Large Language Models (LLMs) and knowledge graph integration techniques. The x-axis represents the total cost in dollars, while the y-axis represents the number of failed tasks. Lower values on both axes indicate better performance. Different marker shapes and colors are used to distinguish between different categories of models and approaches.
### Components/Axes
* **X-axis:** Total Cost ($) - (the lower the better). Scale ranges from approximately 0.00 to 10.00, with increments of 2.00.
* **Y-axis:** Number of Failed Tasks (the lower the better). Scale ranges from approximately 90.00 to 150.00, with increments of 10.00.
* **Legend:** Located in the bottom-left corner.
* **Star (â
):** KGoT (fusion)
* **Star (â
):** KGoT
* **Circle (O):** Baselines
* **Diamond (â):** Zero-Shot
### Detailed Analysis
The plot displays data points representing the performance of different models. Here's a breakdown of the approximate coordinates for each data point, cross-referenced with the legend:
* **GPT-4o mini (Diamond):** Approximately (0.8, 142).
* **GPTSwarm (Diamond):** Approximately (1.2, 140).
* **GPT-4o (Diamond):** Approximately (1.8, 144).
* **RDF4J (Query) (Star):** Approximately (3.6, 132).
* **Neo4j (Query) (Star):** Approximately (4.0, 125).
* **KGoT (fusion) (Star):** Approximately (4.4, 108).
* **KGoT (Star):** Approximately (5.0, 118).
* **Simple RAG (Circle):** Approximately (5.6, 135).
* **Neo4j (DR) (Circle):** Approximately (5.8, 128).
* **NetworkX (DR) (Circle):** Approximately (6.0, 122).
* **NetworkX (Query) (Circle):** Approximately (6.2, 120).
* **Neo4j (Query + DR) (Star):** Approximately (6.2, 112).
* **NetworkX (Query + DR) (Circle):** Approximately (6.6, 110).
* **Neo4j + NetworkX (Query + DR) (Circle):** Approximately (9.6, 95).
* **HF Agents (GPT-4o mini) (Circle):** Approximately (8.6, 138).
* **GraphRAG (Circle):** Approximately (7.6, 142).
**Trends:**
* The "Zero-Shot" models (diamonds) generally exhibit higher numbers of failed tasks for relatively low costs.
* The "KGoT" models (stars) show a trend of lower failed tasks with increasing cost.
* The "Baseline" models (circles) are spread across the cost and failed task spectrum.
* The combination of Neo4j and NetworkX (Query + DR) appears to achieve the lowest number of failed tasks, but at a higher cost.
### Key Observations
* **Neo4j + NetworkX (Query + DR)** stands out as the best performer, achieving the lowest number of failed tasks (approximately 95) at a cost of around $9.6.
* **GPT-4o mini** and **GPTSwarm** are relatively inexpensive but have a higher number of failed tasks (around 142 and 140 respectively).
* There's a noticeable cluster of models around the $5-7 cost range with varying numbers of failed tasks.
* The spread of data points suggests a trade-off between cost and performance.
### Interpretation
The data suggests that integrating knowledge graphs with LLMs can significantly improve performance (reduce failed tasks), but often at a higher cost. The combination of Neo4j and NetworkX (Query + DR) appears to be the most effective approach, indicating that leveraging both query-based and retrieval-augmented generation (DR) techniques yields the best results. The "Zero-Shot" models, while inexpensive, are less reliable. The plot highlights the importance of considering the cost-benefit trade-off when selecting an LLM and knowledge graph integration strategy. The outliers, such as Neo4j + NetworkX, suggest that specific combinations of techniques can lead to substantial performance gains. The data also implies that simply adding a knowledge graph isn't enough; the *way* it's integrated (e.g., query vs. DR) matters significantly.
</details>
Figure 11: Pareto front plot of cost and error counts. We report results for answering 165 GAIA validation questions across different comparison targets, using the GPT-4o mini model with each baseline. For the Zero-Shot inference, we also include results for GPT-4o for comparison. Please note that we omit the results for Magentic-One and HF Agents (GPT-4o) as their high costs would heavily disturb the plot. DR means Direct Retrieval.
We also plot the relative improvements of KGoT over Hugging Face Agents and GPTSwarm respectively in Figure 12, which is based on the results shown in Figure 5.
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Bar Chart: Tasks Improved with KGOT Compared to HF Agents
### Overview
This is a vertical bar chart comparing the number of tasks improved with KGOT (Knowledge Graph Optimized Training) compared to HF (Hugging Face) Agents, across several different models. The y-axis represents the number of tasks improved, and the x-axis lists the model names. Each bar is labeled with the numerical improvement. A horizontal dashed line indicates the arithmetic mean of the improvements.
### Components/Axes
* **Y-axis Title:** "Tasks Improved with KGOT (compared to HF Agents)" - Scale ranges from 0 to 8, with increments of 1.
* **X-axis Labels:** Model names: "Qwen2.5-32B", "DeepSeek-R1-70B", "GPT-4o mini", "DeepSeek-R1-32B", "QWQ-32B", "DeepSeek-R1-7B", "DeepSeek-R1-1.5B", "Qwen2.5-72B", "Qwen2.5-7B", "Qwen2.5-1.5B".
* **Horizontal Line:** "Arithmetic Mean: +3.3" - A dashed grey line at approximately y = 3.3.
* **Bar Colors:** The bars are predominantly a shade of green, with the last three bars being a lighter grey.
* **Bar Labels:** Each bar is labeled with a numerical value indicating the improvement.
### Detailed Analysis
The chart displays the following data points:
* **Qwen2.5-32B:** +7 tasks improved. (Dark Green)
* **DeepSeek-R1-70B:** +6 tasks improved. (Dark Green)
* **GPT-4o mini:** +5 tasks improved. (Dark Green)
* **DeepSeek-R1-32B:** +4 tasks improved. (Dark Green)
* **QWQ-32B:** +4 tasks improved. (Dark Green)
* **DeepSeek-R1-7B:** +3 tasks improved. (Dark Green)
* **DeepSeek-R1-1.5B:** +2 tasks improved. (Light Grey)
* **Qwen2.5-72B:** +1 task improved. (Light Grey)
* **Qwen2.5-7B:** +1 task improved. (Light Grey)
* **Qwen2.5-1.5B:** 0 tasks improved. (Light Grey)
The bars generally decrease in height from left to right, with a noticeable shift in color from dark green to light grey around the "DeepSeek-R1-1.5B" model. The trend is a decreasing number of tasks improved as you move from left to right across the models.
### Key Observations
* The models Qwen2.5-32B, DeepSeek-R1-70B, and GPT-4o mini show the highest improvements with KGOT.
* The models Qwen2.5-1.5B, Qwen2.5-7B, and Qwen2.5-72B show minimal or no improvement with KGOT.
* The arithmetic mean of +3.3 provides a baseline for comparison. Most models outperform this mean, while the last three underperform.
* There is a clear distinction between the models that benefit significantly from KGOT (dark green) and those that do not (light grey).
### Interpretation
The data suggests that KGOT is more effective for certain models than others. Larger models (Qwen2.5-32B, DeepSeek-R1-70B) appear to benefit the most from KGOT, while smaller models (Qwen2.5-1.5B, Qwen2.5-7B, Qwen2.5-72B) show little to no improvement. This could indicate that KGOT is particularly useful for models with a larger capacity to leverage the knowledge graph information. The shift in bar color likely signifies a threshold or categorization of model performance with KGOT. The arithmetic mean provides a useful reference point, highlighting which models are above or below average in terms of improvement. The data implies that KGOT is not a universally beneficial technique and its effectiveness is dependent on the underlying model architecture and size.
</details>
(a) Hugging Face Agents
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Bar Chart: Tasks Improved with KGOT (compared to GPT-Swarm)
### Overview
This bar chart visualizes the improvement in the number of tasks completed when using KGOT compared to GPT-Swarm, for various language models. The y-axis represents the number of tasks improved, with positive values indicating improvement and negative values indicating a decrease. The x-axis lists the different language models being compared. A horizontal dashed line indicates the arithmetic mean of the improvements.
### Components/Axes
* **Y-axis Title:** "Tasks Improved with KGOT (compared to GPT-Swarm)" - Scale ranges from approximately -5 to 22.
* **X-axis Labels:** "Qwen2.5-32B", "DeepSeek-R1-70B", "GPT-40 mini", "DeepSeek-R1-32B", "QwQ-32B", "DeepSeek-R1-7B", "DeepSeek-R1-1.5B", "Qwen2.5-72B", "Qwen2.5-7B", "Qwen2.5-1.5B".
* **Horizontal Line:** "Arithmetic Mean: +7.5" - A dashed grey line at approximately y = 7.5.
* **Bar Colors:** Green bars indicate positive improvement, while red bars indicate a decrease in tasks improved.
### Detailed Analysis
The chart displays the following data points:
* **Qwen2.5-32B:** -3 (Red bar)
* **DeepSeek-R1-70B:** +12 (Green bar)
* **GPT-40 mini:** +14 (Green bar)
* **DeepSeek-R1-32B:** +15 (Green bar)
* **QwQ-32B:** +20 (Green bar)
* **DeepSeek-R1-7B:** +4 (Green bar)
* **DeepSeek-R1-1.5B:** +2 (Green bar)
* **Qwen2.5-72B:** +12 (Green bar)
* **Qwen2.5-7B:** 0 (Grey bar)
* **Qwen2.5-1.5B:** -1 (Red bar)
**Trends:**
* The majority of the language models show a positive improvement in tasks completed when using KGOT compared to GPT-Swarm.
* The improvements range from a decrease of 3 tasks (Qwen2.5-32B) to an increase of 20 tasks (QwQ-32B).
* The DeepSeek-R1 models consistently show significant improvements.
* Qwen2.5-7B shows no improvement (0).
* Qwen2.5-32B and Qwen2.5-1.5B show a decrease in tasks improved.
### Key Observations
* QwQ-32B demonstrates the largest improvement (+20 tasks).
* Qwen2.5-32B shows the largest decrease (-3 tasks).
* The arithmetic mean of +7.5 suggests that, on average, KGOT improves task completion across these models.
* The spread of the data points is relatively wide, indicating varying degrees of benefit from KGOT depending on the model.
### Interpretation
The data suggests that KGOT is generally beneficial for improving task completion when compared to GPT-Swarm, as evidenced by the positive arithmetic mean and the prevalence of green bars. However, the effectiveness of KGOT varies significantly across different language models. The substantial improvement observed with QwQ-32B suggests that KGOT may be particularly well-suited for this model's architecture or training data. Conversely, the decrease in performance with Qwen2.5-32B and Qwen2.5-1.5B indicates that KGOT may not be universally applicable and could even be detrimental in certain cases. Further investigation is needed to understand the factors that contribute to these variations and to optimize the use of KGOT for different language models. The fact that the DeepSeek-R1 models consistently perform well with KGOT suggests a potential synergy between the two technologies.
</details>
(b) GPTSwarm
Figure 12: Relative improvement of KGoT over Hugging Face Agents (left) and GPTSwarm (right) on the GAIA validation set using various LLM models.
Table 2: Comparison of KGoT with other current state-of-the-art open-source agents on the GAIA benchmark. We provide both the absolute (number of solved tasks) and relative (percentage) results. The baseline data on the test set is obtained through the leaderboard. We highlight the best performing scheme in a given category in bold. The validation set consists of 165 tasks in total (53 in level 1, 86 in level 2 and 26 in level 3), whereas the test set contains 301 tasks (93 in level 1, 159 in level 2 and 49 in level 3). DR stands for Direct Retrieval.
| | | Absolute | Relative | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Agents | Model | All | L1 | L2 | L3 | Avg. | L1 | L2 | L3 |
| Test Set | | | | | | | | | |
| GPTSwarm | GPT-4o mini | 33 | 15 | 15 | 3 | 10.96 | 16.13 | 9.43 | 6.12 |
| Magentic-One | GPT-4o mini | 43 | 22 | 18 | 3 | 14.29 | 23.66 | 11.32 | 6.12 |
| TapeAgent | GPT-4o mini | 66 | 28 | 35 | 3 | 21.93 | 30.11 | 22.01 | 6.12 |
| Hugging Face Agents | GPT-4o mini | 68 | 30 | 34 | 4 | 22.59 | 32.26 | 21.38 | 8.16 |
| KGoT (fusion) | GPT-4o mini | 73 | 33 | 36 | 4 | 24.25 | 35.48 | 22.64 | 8.16 |
| Validation Set | | | | | | | | | |
| Simple RAG | GPT-4o mini | 35 | 18 | 15 | 2 | 21.21 | 33.96 | 17.44 | 7.69 |
| GraphRAG | GPT-4o mini | 23 | 10 | 13 | 0 | 13.94 | 18.87 | 15.12 | 0.00 |
| Magentic-One | GPT-4o mini | 31 | 13 | 18 | 0 | 18.79 | 24.53 | 20.93 | 0.00 |
| No KG (Single Run #1) | GPT-4o mini | 30 | 14 | 14 | 2 | 18.18 | 26.42 | 16.28 | 7.69 |
| No KG (Single Run #2) | GPT-4o mini | 33 | 17 | 16 | 0 | 20.00 | 32.08 | 18.60 | 0.00 |
| No KG (Fusion) | GPT-4o mini | 40 | 18 | 20 | 2 | 24.24 | 33.96 | 23.26 | 7.69 |
| KGoT (Neo4j + DR) | GPT-4o mini | 40 | 21 | 16 | 3 | 24.24 | 39.62 | 18.60 | 11.54 |
| KGoT (NetworkX + Query) | GPT-4o mini | 44 | 21 | 21 | 2 | 26.67 | 39.62 | 24.42 | 7.69 |
| KGoT (NetworkX + DR) | GPT-4o mini | 40 | 20 | 18 | 2 | 24.24 | 37.74 | 20.93 | 7.69 |
| KGoT (RDF4J + Query) | GPT-4o mini | 36 | 20 | 15 | 1 | 21.82 | 37.74 | 17.44 | 3.85 |
| KGoT (fusion) (Neo4j; Query + DR) | GPT-4o mini | 57 | 29 | 24 | 4 | 34.55 | 54.72 | 27.91 | 15.38 |
| KGoT (fusion) (NetworkX; Query + DR) | GPT-4o mini | 57 | 27 | 28 | 2 | 34.55 | 50.94 | 32.56 | 7.69 |
| KGoT (fusion) (Neo4j + NetworkX; Query + DR) | GPT-4o mini | 71 | 34 | 33 | 4 | 43.03 | 64.15 | 38.37 | 15.38 |
| Zero-Shot | GPT-4o mini | 17 | 4 | 13 | 0 | 10.30 | 7.55 | 15.12 | 0.00 |
| Zero-Shot | GPT-4o | 29 | 10 | 17 | 2 | 17.58 | 18.87 | 19.77 | 7.69 |
| Zero-Shot | Qwen2.5-1.5B | 3 | 2 | 1 | 0 | 1.81 | 3.77 | 1.16 | 0.00 |
| Zero-Shot | Qwen2.5-7B | 9 | 4 | 5 | 0 | 5.45 | 7.55 | 5.81 | 0.00 |
| Zero-Shot | Qwen2.5-32B | 15 | 7 | 8 | 0 | 9.09 | 13.21 | 9.30 | 0.00 |
| Zero-Shot | Qwen2.5-72B | 19 | 6 | 13 | 0 | 11.52 | 11.32 | 15.12 | 0.00 |
| Zero-Shot | QwQ-32B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Zero-Shot | DeepSeek-R1-1.5B | 5 | 3 | 2 | 0 | 3.03 | 5.66 | 2.33 | 0.00 |
| Zero-Shot | DeepSeek-R1-7B | 13 | 8 | 5 | 0 | 7.88 | 15.09 | 5.81 | 0.00 |
| Zero-Shot | DeepSeek-R1-32B | 14 | 8 | 6 | 0 | 8.48 | 15.09 | 6.98 | 0.00 |
| Zero-Shot | DeepSeek-R1-70B | 20 | 9 | 10 | 1 | 12.12 | 16.98 | 11.63 | 3.85 |
| GPTSwarm | GPT-4o mini | 26 | 13 | 13 | 0 | 15.76 | 24.53 | 15.12 | 0.00 |
| GPTSwarm | Qwen2.5-1.5B | 5 | 4 | 1 | 0 | 3.03 | 7.55 | 1.16 | 0.00 |
| GPTSwarm | Qwen2.5-7B | 12 | 8 | 4 | 0 | 7.27 | 15.09 | 4.65 | 0.00 |
| GPTSwarm | Qwen2.5-32B | 29 | 15 | 14 | 0 | 17.58 | 28.30 | 16.28 | 0.00 |
| GPTSwarm | Qwen2.5-72B | 27 | 13 | 14 | 0 | 16.36 | 24.53 | 16.28 | 0.00 |
| GPTSwarm | QwQ-32B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| GPTSwarm | DeepSeek-R1-1.5B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| GPTSwarm | DeepSeek-R1-7B | 2 | 0 | 2 | 0 | 1.21 | 0.00 | 2.33 | 0.00 |
| GPTSwarm | DeepSeek-R1-32B | 6 | 3 | 3 | 0 | 3.64 | 5.66 | 3.49 | 0.00 |
| GPTSwarm | DeepSeek-R1-70B | 10 | 5 | 5 | 0 | 6.06 | 9.43 | 5.81 | 0.00 |
| Hugging Face Agents | GPT-4o mini | 35 | 14 | 20 | 1 | 21.21 | 26.42 | 23.26 | 3.85 |
| Hugging Face Agents | GPT-4o | 55 | 22 | 31 | 2 | 33.33 | 41.51 | 36.05 | 7.69 |
| Hugging Face Agents | Qwen2.5-1.5B | 4 | 2 | 2 | 0 | 2.42 | 3.77 | 2.33 | 0.00 |
| Hugging Face Agents | Qwen2.5-7B | 11 | 7 | 4 | 0 | 6.66 | 13.21 | 4.65 | 0.00 |
| Hugging Face Agents | Qwen2.5-32B | 19 | 10 | 9 | 0 | 11.52 | 18.87 | 11.63 | 0.00 |
| Hugging Face Agents | Qwen2.5-72B | 38 | 16 | 22 | 0 | 23.03 | 30.19 | 25.58 | 0.00 |
| Hugging Face Agents | QwQ-32B | 16 | 9 | 7 | 0 | 9.70 | 16.98 | 8.14 | 0.00 |
| Hugging Face Agents | DeepSeek-R1-1.5B | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Hugging Face Agents | DeepSeek-R1-7B | 3 | 2 | 1 | 0 | 1.81 | 3.77 | 1.16 | 0.00 |
| Hugging Face Agents | DeepSeek-R1-32B | 17 | 9 | 7 | 1 | 10.30 | 16.98 | 8.14 | 3.85 |
| Hugging Face Agents | DeepSeek-R1-70B | 16 | 9 | 6 | 1 | 9.70 | 16.98 | 6.98 | 3.85 |
| KGoT (Neo4j + Query) | GPT-4o mini | 40 | 21 | 18 | 1 | 24.24 | 39.62 | 20.93 | 3.85 |
| KGoT (Neo4j + Query) | Qwen2.5-1.5B | 4 | 3 | 1 | 0 | 2.42 | 5.66 | 1.16 | 0.00 |
| KGoT (Neo4j + Query) | Qwen2.5-7B | 12 | 7 | 5 | 0 | 7.27 | 13.21 | 5.81 | 0.00 |
| KGoT (Neo4j + Query) | Qwen2.5-32B | 26 | 12 | 14 | 0 | 15.76 | 22.64 | 16.28 | 0.00 |
| KGoT (Neo4j + Query) | Qwen2.5-72B | 39 | 18 | 21 | 0 | 23.64 | 33.96 | 24.42 | 0.00 |
| KGoT (Neo4j + Query) | QwQ-32B | 20 | 11 | 9 | 0 | 12.12 | 20.75 | 10.47 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-1.5B | 2 | 1 | 1 | 0 | 1.21 | 1.89 | 1.16 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-7B | 6 | 3 | 3 | 0 | 3.64 | 5.66 | 3.49 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-32B | 21 | 12 | 9 | 0 | 12.73 | 22.64 | 10.47 | 0.00 |
| KGoT (Neo4j + Query) | DeepSeek-R1-70B | 22 | 11 | 10 | 1 | 13.33 | 20.75 | 11.63 | 3.85 |
D.1 SimpleQA Results
Table 3: Comparison of KGoT, HF Agents and GPTSwarm on a subset of SimpleQA as well as the results for KGoT on the full benchmark. We highlight the best performing scheme in given category in bold. Model: GPT-4o mini.
| | | Not | | Correct | | | Cost per |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Correct | attempted | Incorrect | given at- | | Total | solved | |
| Framework | (%) | (%) | (%) | tempted (%) | F-score | cost ($) | task ($) |
| GPTSwarm | 53.8106 | 6.2356 | 39.9538 | 57.3892 | 55.5 | 0.2159 | 0.00092660 |
| HF Agents | 66.0508 | 18.0139 | 15.9353 | 80.5634 | 72.6 | 16.7117 | 0.05843265 |
| KGoT | 73.2102 | 1.6166 | 25.1732 | 74.4131 | 73.8 | 5.6432 | 0.01780182 |
| KGoT (Full) | 70.3421 | 2.0342 | 27.8548 | 71.8027 | 71.1 | 59.1538 | 0.01943931 |
Table 4: F1-score comparison of KGoT, OpenAI and Claude models on SimpleQA. OpenAI and Claude results were taken from the official repository (OpenAI, 2025). Model for KGoT: GPT-4o mini.
| Reasoning Models | F1-score | Assistant Models | F1-score |
| --- | --- | --- | --- |
| o1 | 42.6 | gpt-4.1-2025-04-14 | 41.6 |
| o1-preview | 42.4 | gpt-4.1-mini-2025-04-14 | 16.8 |
| o3-high | 48.6 | gpt-4.1-nano-2025-04-14 | 7.6 |
| o3 | 49.4 | gpt-4o-2024-11-20 | 38.8 |
| o3-low | 49.4 | gpt-4o-2024-08-06 | 40.1 |
| o1-mini | 7.6 | gpt-4o-2024-05-13 | 39.0 |
| o3-mini-high | 13.8 | gpt-4o-mini-2024-07-18 | 9.5 |
| o3-mini | 13.4 | gpt-4.5-preview-2025-02-27 | 62.5 |
| o3-mini-low | 13.0 | gpt-4-turbo-2024-04-09 | 24.2 |
| o4-mini-high | 19.3 | Claude 3.5 Sonnet | 28.9 |
| o4-mini | 20.2 | Claude 3 Opus | 23.5 |
| o4-mini-low | 20.2 | | |
| KGoT | 71.1 | | |
D.2 Impact from Various Design Decisions
Table 5: Analysis of different design decisions and tool sets in KGoT. â ST â stands for the type of the solve operation and pathway (â GQ â: graph query, â DR â: Direct Retrieval), â PF â for the prompt format (â MD â: Markdown) and â merged â stands for a combination of the original KGoT tools and the Hugging Face Agents tools.
| Configuration | Metrics | | | | |
| --- | --- | --- | --- | --- | --- |
| Tools | ST | PF | Solved | Time (h) | Cost |
| HF | DR | XML | 37 | 11.87 | $7.84 |
| HF | GQ | MD | 33 | 9.70 | $4.28 |
| merged | GQ | XML | 31 | 10.62 | $5.43 |
| HF | GQ | XML | 30 | 13.02 | $4.90 |
| original KGoT | GQ | XML | 27 | 27.57 | $6.85 |
We explored different tool sets, with selected results presented in Table 5. Initially, we examined the limitations of our original tools and subsequently integrated the complete Hugging Face Agents tool set into the KGoT framework, which led to improvements in accuracy, runtime, and cost efficiency. A detailed analysis allowed us to merge the most effective components from both tool sets into an optimized hybrid tool set, further enhancing accuracy and runtime while only moderately increasing costs. Key improvements include a tighter integration between the ExtractZip tool and the Text Inspector tool, which now supports Markdown, as well as enhancements to the Surfer Agent, incorporating a Wikipedia tool and augmenting viewpoint segmentation with full-page summarization. This optimized tool set was used for all subsequent experiments.
We further evaluated different prompt formats in the initial iterations of KGoT. While our primary format was XML-based, we conducted additional tests using Markdown. Initial experiments with the Hugging Face Agents tool set (see Table 5) combined with Markdown and GPT-4o mini yielded improved accuracy, reduced runtime, and lower costs. However, these results were not consistently reproducible with GPT-4o. Moreover, Markdown-based prompts interfered with optimizations such as Direct Retrieval, ultimately leading us to retain the XML-based format.
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Task Solving Performance Comparison
### Overview
This image presents a stacked bar chart comparing the number of solved tasks across different combinations of technologies: Neo4j, NetworkX, and their respective approaches (Query + DR, Query only, DR only). The chart categorizes the solved tasks into three levels (Level 1, Level 2, and Level 3).
### Components/Axes
* **X-axis:** Represents the different technology combinations:
* Neo4j (Query + DR)
* NetworkX (Query + DR)
* NetworkX + Neo4j (with Query only)
* NetworkX + Neo4j (with DR only)
* Neo4j + NetworkX (Query + DR)
* **Y-axis:** Represents the "Number of Solved Tasks", ranging from 0 to 80.
* **Legend:** Located at the top-left corner, defines the color coding for the task levels:
* Level 1: Light Green (#90EE90)
* Level 2: Blue (#6495ED)
* Level 3: Purple (#A020F0)
### Detailed Analysis
The chart consists of five stacked bars, one for each technology combination. The height of each bar represents the total number of solved tasks. Each bar is divided into three colored segments representing the number of tasks solved at each level.
* **Neo4j (Query + DR):**
* Level 1: Approximately 29 tasks
* Level 2: Approximately 24 tasks
* Level 3: Approximately 4 tasks
* Total: Approximately 57 tasks
* **NetworkX (Query + DR):**
* Level 1: Approximately 27 tasks
* Level 2: Approximately 28 tasks
* Level 3: Approximately 2 tasks
* Total: Approximately 57 tasks
* **NetworkX + Neo4j (with Query only):**
* Level 1: Approximately 28 tasks
* Level 2: Approximately 25 tasks
* Level 3: Approximately 3 tasks
* Total: Approximately 56 tasks
* **NetworkX + Neo4j (with DR only):**
* Level 1: Approximately 26 tasks
* Level 2: Approximately 24 tasks
* Level 3: Approximately 3 tasks
* Total: Approximately 53 tasks
* **Neo4j + NetworkX (Query + DR):**
* Level 1: Approximately 34 tasks
* Level 2: Approximately 33 tasks
* Level 3: Approximately 4 tasks
* Total: Approximately 71 tasks
### Key Observations
* The "Neo4j + NetworkX (Query + DR)" combination consistently outperforms all other combinations in terms of the total number of solved tasks.
* The "NetworkX + Neo4j (with DR only)" combination has the lowest total number of solved tasks.
* Level 1 tasks are generally solved more frequently than Level 2 and Level 3 tasks across all combinations.
* The difference in performance between "Neo4j (Query + DR)" and "NetworkX (Query + DR)" is minimal.
### Interpretation
The data suggests that combining Neo4j and NetworkX with both Query and DR (Data Retrieval) approaches yields the best results in solving tasks, significantly outperforming other combinations. This indicates a synergistic effect when both technologies are utilized together with both Query and DR methods. The consistent dominance of Level 1 tasks suggests that the task difficulty plays a role, with simpler tasks being solved more readily. The lower performance of the "NetworkX + Neo4j (with DR only)" combination suggests that the Query component is crucial for effective task solving in this context. The chart provides a clear comparison of the effectiveness of different technology combinations and approaches for task solving, highlighting the benefits of a combined strategy.
</details>
Figure 13: Comparison of different fusion types in respect to the task solve operation as well as the graph backend type. We report results for answering 165 GAIA validation questions across different comparison targets. DR stands for Direct Retrieval. Model: GPT-4o mini.
Graph Backend vs. Task Solve Operation We provide more detailed results in Figure 13, studying the performance of the following configurations: NetworkX + Neo4j (with query only) and NetworkX + Neo4j (with DR only) as well as Neo4j (query + DR) and NetworkX (query + DR). Overall, the fusion of backends (with DR only) offers smaller advantages than other types of fusion. This indicates that different graph querying languages have different strengths and their fusion comes with the largest combined advantage.
D.3 Runtime
We provide a runtime overview of running KGoT on the validation set of the GAIA benchmark with GPT4o-mini, Neo4j and query-based retrieval in Figure 14. The right part follows the categorization in Appendix C. We provide a more detailed analysis of the runtime in Figure 17.
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Donut Chart: KGoT Runtime Distribution
### Overview
This image presents a donut chart illustrating the distribution of runtime for different components within a system called KGoT. The chart displays the percentage of total runtime allocated to each component, along with the total runtime in seconds.
### Components/Axes
* **Title:** KGoT Runtime Distribution
* **Center Text:** Total Runtime: 35817.29 s
* **Legend:** Located around the donut chart, with labels and corresponding colors:
* tools: Light Blue
* Neo4j: Teal
* control logic: Light Green
* postprocessing: Pale Green
* **Categories:** tools, Neo4j, control logic, postprocessing.
### Detailed Analysis
The donut chart segments represent the runtime contribution of each component. The percentages are as follows:
* **tools:** 71.5% - Represented by a large light blue segment, occupying the majority of the donut.
* **Neo4j:** 11.2% - Represented by a teal segment.
* **control logic:** 11.1% - Represented by a light green segment.
* **postprocessing:** 6.07% - Represented by a pale green segment, the smallest segment in the chart.
The total runtime is stated as 35817.29 seconds.
### Key Observations
* The "tools" component dominates the runtime, consuming approximately 71.5% of the total time.
* The "Neo4j" and "control logic" components contribute roughly the same amount to the runtime, around 11% each.
* "postprocessing" has the smallest runtime contribution, at just over 6%.
### Interpretation
The chart indicates that the "tools" component is the primary performance bottleneck in the KGoT system. Optimizing this component would likely yield the most significant performance improvements. The relatively small runtime contribution of "postprocessing" suggests that optimizing this component would have a limited impact on overall performance. The similar contributions of "Neo4j" and "control logic" suggest that improvements in either of these areas could provide moderate performance gains. The total runtime of approximately 35817 seconds provides a baseline for measuring the impact of any optimizations. The chart effectively visualizes where the system spends its time, allowing for targeted performance analysis and improvement efforts.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Donut Chart: KGOT Runtime Distribution
### Overview
This image presents a donut chart illustrating the distribution of runtime for various components within the KGOT system. The chart displays the percentage of total runtime allocated to each component, with the total runtime explicitly stated in seconds.
### Components/Axes
* **Title:** KGOT Runtime Distribution
* **Center Text:** Total Runtime: 35817.29 s
* **Categories (Segments):**
* tool invocations
* tool executor
* solution formatting
* graph executor
* system robustness
* **Percentage Labels:** Each segment is labeled with its corresponding percentage of the total runtime.
* **Color Scheme:**
* tool invocations: Blue
* tool executor: Gray
* solution formatting: Light Green
* graph executor: Teal
* system robustness: Dark Blue
### Detailed Analysis
The donut chart shows a clear dominance of "tool invocations" in terms of runtime consumption. Let's analyze each segment:
1. **tool invocations:** This segment occupies the largest portion of the chart, representing approximately 71.5% of the total runtime.
2. **tool executor:** This segment is relatively small, accounting for approximately 1.76% of the total runtime.
3. **solution formatting:** This segment represents approximately 6.07% of the total runtime.
4. **graph executor:** This segment accounts for approximately 7.06% of the total runtime.
5. **system robustness:** This segment represents approximately 13.6% of the total runtime.
### Key Observations
* The vast majority of the runtime (71.5%) is spent on "tool invocations." This suggests that the process of calling and managing external tools is the most time-consuming aspect of the KGOT system.
* "tool executor" consumes a very small percentage of the runtime, indicating that the execution of individual tools is relatively fast.
* "solution formatting" and "graph executor" contribute moderate amounts to the total runtime.
* "system robustness" accounts for a significant portion (13.6%) of the runtime, suggesting that ensuring the system's stability and reliability is a substantial undertaking.
### Interpretation
The data suggests that optimizing the "tool invocations" process would yield the most significant performance improvements for the KGOT system. This could involve streamlining the tool calling mechanism, reducing the number of tool invocations, or improving the efficiency of the tools themselves. The relatively low runtime of the "tool executor" indicates that the tools themselves are not the primary bottleneck. The substantial runtime dedicated to "system robustness" highlights the importance of maintaining a stable and reliable system, potentially through extensive testing and error handling. The chart provides a clear breakdown of where the system's time is spent, allowing developers to focus their optimization efforts on the most impactful areas. The total runtime of 35817.29 seconds provides a baseline for measuring the effectiveness of any performance improvements.
</details>
Figure 14: Different runtime categorizations of the same data. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
D.4 Compute Resources
Because of the long runtime, we executed most experiments using the OpenAI API as an external resource on server compute nodes containing a AMD EPYC 7742 CPU with 128 cores running at 2.25GHz, with a total memory of 256GB. However when the LLM is called as an external resource, KGoT is able to run on commodity hardware with minimal effects on runtime.
Our experiments with locally run LLMs were executed with compute nodes containing 4x NVIDIA GH200, a respective GPU memory of 96GB, and a total memory of 896GB. In these cases, the minimum hardware requirements are dictated by the resources needed to run each LLM locally.
High-performance & scalability experiments were performed on an Apple M3 Pro with 12 cores at 4.056GHz and a total memory of 18GB.
D.5 GAIA Result Visualizations
We also implemented various automatic scripts that plot various aspects once a GAIA run is finished. In the following we provide example plots for Neo4j with query retrieval.
We provide a breakdown for each level of the GAIA benchmark into the categories that KGoTâs answers for the tasks fall into in Figure 15. We measure the runtime and costs of the various components of KGoT and illustrate them in Figure 17. We also provide insights into the tool usage, starting with the number of tasks for which a specific tools is used and whether that task was successful or not (see Figure 16). A more detailed analysis into the tool selection is provided in the plots of Figures 18 and 19 as well as the number of times the tools are used in Figure 20.
We provide now a brief explanation of the more opaque function names listed in Figure 17.
- Any function marked as not logged refers to function or tool calls that do not incur an LLM-related cost or where usage costs are logged within the tool itself.
- WebSurfer.forward submits a query to SerpApi.
- Define Cypher query given new information constructs a Cypher insert query based on newly gathered information.
- Fix JSON corrects malformed or invalid JSON for services like Neo4j.
- Define forced retrieve queries generates a Cypher retrieval query when the maximum number of iterations is reached.
- Generate forced solution generates a solution based on the state of the knowledge graph if no viable solution has been parsed after a Cypher retrieve or if the forced retrievals fails after exhausting all iterations.
<details>
<summary>figures/all_plot_all_stats.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Error Rate by Level
### Overview
This is a stacked bar chart visualizing the rate of different error types across three levels. The chart displays the percentage of each error type for each level, with the total height of each bar representing 100%. The error types are categorized as "Correct", "Correct forced", "Close call", "Wrong forced", "Other error", and "Wrong".
### Components/Axes
* **X-axis:** "Level" with markers 1, 2, and 3.
* **Y-axis:** "Rate (%)" ranging from 0 to 100.
* **Legend:** Located in the top-left corner, defining the color-coding for each error type:
* Green: Correct
* Light Green: Correct forced
* Blue: Close call
* Orange: Wrong forced
* Yellow: Other error
* Red: Wrong
* Each bar is segmented to represent the proportion of each error type within that level. The number of errors/total attempts is displayed within each segment.
### Detailed Analysis
**Level 1:**
* **Correct (Green):** 37% (20/53). The green segment occupies the lower portion of the bar.
* **Correct forced (Light Green):** 1% (1/53). A very small light green segment at the base.
* **Close call (Blue):** 0% (0/53). No blue segment.
* **Wrong forced (Orange):** 3% (2/53). A small orange segment.
* **Other error (Yellow):** 0% (0/53). No yellow segment.
* **Wrong (Red):** 54% (29/53). The largest segment, occupying the upper portion of the bar.
**Level 2:**
* **Correct (Green):** 20% (18/86).
* **Correct forced (Light Green):** 0% (0/86).
* **Close call (Blue):** 0% (0/86).
* **Wrong forced (Orange):** 5% (5/86).
* **Other error (Yellow):** 0% (0/86).
* **Wrong (Red):** 73% (63/86).
**Level 3:**
* **Correct (Green):** 3% (1/26).
* **Correct forced (Light Green):** 0% (0/26).
* **Close call (Blue):** 3% (1/26).
* **Wrong forced (Orange):** 0% (0/26).
* **Other error (Yellow):** 0% (0/26).
* **Wrong (Red):** 92% (24/26).
### Key Observations
* The proportion of "Wrong" errors increases significantly as the level increases.
* Level 1 has the highest proportion of "Wrong" errors (54%) and the lowest proportion of "Correct" errors (37%).
* Level 3 has the highest proportion of "Wrong" errors (92%) and the lowest proportion of "Correct" errors (3%).
* "Correct forced", "Close call", "Wrong forced", and "Other error" contribute minimally to the overall error rates, especially at levels 2 and 3.
### Interpretation
The data suggests a clear trend: as the level of difficulty increases, the rate of "Wrong" errors dramatically increases, while the rate of "Correct" errors decreases. This indicates that the task becomes significantly more challenging at higher levels. The consistent dominance of the "Wrong" error category at levels 2 and 3 suggests a fundamental difficulty in performing the task correctly at those levels. The low occurrence of "Correct forced", "Close call", "Wrong forced", and "Other error" suggests these categories are less relevant to the overall performance trend. The data could be used to identify areas for improvement in training or task design to reduce the error rate at higher levels. The inclusion of the number of errors/total attempts (e.g., 20/53) provides a sense of the sample size and the reliability of the percentages.
</details>
Figure 15: Number of tasks per level that succeeded or fall into a given error category. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_category_success.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: Question Success by GAIA Categories
### Overview
This horizontal bar chart visualizes the success rate of questions categorized by GAIA tools. Each bar represents a tool category, with segments indicating the number of successful and failed questions. The total number of questions is 165.
### Components/Axes
* **Title:** "Question Success by GAIA Categories"
* **Subtitle:** "Total Questions: 165" (positioned below the title)
* **Y-axis:** Lists the GAIA tool categories:
* search\_information\_tools
* calculator
* image\_recognition\_processing\_tools
* pdf\_tools
* spreadsheet\_tools
* text\_processing\_analysis\_tools
* video\_tools
* programming\_code\_tools
* audio\_tools
* document\_access\_tools
* specialized\_tools
* search\_location\_tools
* general\_utilities
* **X-axis:** "Number of Questions" (ranging from 0 to 120)
* **Legend:** Located in the top-right corner:
* Green: "Successful"
* Red: "Failed"
### Detailed Analysis
The chart displays the number of successful and failed questions for each category. The bars are arranged vertically, with the category names on the left.
* **search\_information\_tools:** 98 Successful, 23 Failed
* **calculator:** 36 Successful, 7 Failed
* **image\_recognition\_processing\_tools:** 28 Successful, 2 Failed
* **pdf\_tools:** 10 Successful, 6 Failed
* **spreadsheet\_tools:** 9 Successful, 5 Failed
* **text\_processing\_analysis\_tools:** 8 Successful, 2 Failed
* **video\_tools:** 7 Successful, 2 Failed
* **programming\_code\_tools:** 6 Successful, 1 Failed
* **audio\_tools:** 3 Successful, 0 Failed
* **document\_access\_tools:** 4 Successful, 1 Failed
* **specialized\_tools:** 1 Successful, 0 Failed
* **search\_location\_tools:** 2 Successful, 0 Failed
* **general\_utilities:** 2 Successful, 0 Failed
The bars generally show a clear dominance of successful questions over failed questions in most categories.
### Key Observations
* **Highest Success:** "search\_information\_tools" has the highest number of successful questions (98) and the highest total number of questions (121).
* **Lowest Success:** "specialized\_tools" has the lowest number of successful questions (1).
* **High Failure Rate:** "pdf\_tools" has a relatively high number of failed questions (6) compared to its successful questions (10).
* **Zero Failures:** Several categories ("audio\_tools", "specialized\_tools", "search\_location\_tools", "general\_utilities") have zero failed questions.
### Interpretation
The data suggests that the GAIA system performs exceptionally well in "search\_information\_tools," indicating a strong capability in information retrieval. Categories like "calculator" and "image\_recognition\_processing\_tools" also demonstrate good success rates. However, "pdf\_tools" appears to be an area needing improvement, as it has a noticeable number of failures. The categories with very few questions overall ("specialized\_tools", "search\_location\_tools", "general\_utilities") may not have sufficient data to draw firm conclusions.
The relationship between the categories and their success rates likely reflects the complexity of the tasks involved. Simpler tasks, like basic calculations, may have higher success rates than more complex ones, like processing PDFs. The overall high success rate (165 total questions, with a clear majority being successful) indicates that the GAIA system is generally effective. The data could be used to prioritize development efforts, focusing on improving the performance of categories with lower success rates, such as "pdf\_tools".
</details>
Figure 16: Overview over how many tasks use a given tool and whether they are successful or not. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_cost_summary_cost.png Details</summary>

### Visual Description
\n
## Bar Chart: Tool Execution Times
### Overview
This is a vertical bar chart displaying the execution time of various tools. The x-axis represents the tool name, and the y-axis represents the execution time, measured in an unspecified unit (likely seconds). The chart shows a significant variation in execution times across different tools, with some tools taking considerably longer than others.
### Components/Axes
* **X-axis Label:** Tool Name
* **Y-axis Label:** Execution Time
* **Y-axis Scale:** Ranges from approximately 0.0 to 2.5.
* **Max Value:** $2.41e+00$ (approximately 2.41)
* **Min Value:** $6.63e-04$ (approximately 0.000663)
* **Arithmetic Mean:** $1.86e-01$ (approximately 0.186)
* **Tools (X-axis Categories):**
* `define_cypher_query`
* `SurferTool`
* `define_next_step`
* `parse_solution_with_llm`
* `Wikipedia.get_page_content`
* `define_need_for_math_before_parsing`
* `fix_cypher`
* `define_math_tool_call`
* `WebSurfer.forward`
* `define_tool_calls`
* `merge_reasons_to_insert`
* `define_final_solution`
* `define_retrieve_query`
* `TextInspector`
* `define_forced_to_explore`
* `retrieve_queries`
* `ImageQuestion.run`
* `define_forced_solution`
* `LLMTool.run`
* `generatePythonCodeTool.fix_code`
* `RunPythonCodeTool.fix_json`
* `Wikipedia.ask_LLM_which_article_to_explore`
### Detailed Analysis
The tallest bar corresponds to `define_cypher_query`, with an execution time of approximately 2.41. The shortest bar corresponds to `RunPythonCodeTool.fix_json`, with an execution time of approximately 0.000663.
Here's a breakdown of approximate execution times for each tool, reading from left to right:
* `define_cypher_query`: 2.41
* `SurferTool`: 2.15
* `define_next_step`: 1.85
* `parse_solution_with_llm`: 1.65
* `Wikipedia.get_page_content`: 1.45
* `define_need_for_math_before_parsing`: 1.25
* `fix_cypher`: 1.05
* `define_math_tool_call`: 0.85
* `WebSurfer.forward`: 0.75
* `define_tool_calls`: 0.65
* `merge_reasons_to_insert`: 0.55
* `define_final_solution`: 0.45
* `define_retrieve_query`: 0.35
* `TextInspector`: 0.25
* `define_forced_to_explore`: 0.15
* `retrieve_queries`: 0.12
* `ImageQuestion.run`: 0.09
* `define_forced_solution`: 0.07
* `LLMTool.run`: 0.05
* `generatePythonCodeTool.fix_code`: 0.03
* `RunPythonCodeTool.fix_json`: 0.000663
* `Wikipedia.ask_LLM_which_article_to_explore`: 0.02
The bars generally decrease in height as you move from left to right, although there are some fluctuations.
### Key Observations
* `define_cypher_query` and `SurferTool` are significantly slower than all other tools.
* `RunPythonCodeTool.fix_json` is exceptionally fast compared to the others.
* The execution times appear to be somewhat clustered, with a group of tools taking times between approximately 0.45 and 1.85.
* There is a large range in execution times, suggesting varying levels of complexity or resource requirements for each tool.
### Interpretation
The chart demonstrates the performance characteristics of a suite of tools, likely within a larger system. The significant differences in execution times suggest that some tools are computationally more intensive or rely on slower external resources (e.g., network access to Wikipedia). The tools `define_cypher_query` and `SurferTool` are potential bottlenecks in the system, as they take the longest to execute. The extremely fast execution of `RunPythonCodeTool.fix_json` suggests a highly optimized or trivial operation.
The data could be used to identify areas for performance optimization. For example, efforts could be focused on improving the efficiency of `define_cypher_query` and `SurferTool`. The chart also provides insight into the relative complexity of different tasks performed by the system. The mean execution time provides a baseline for evaluating the performance of individual tools. The large difference between the maximum and minimum values indicates a highly skewed distribution of execution times, meaning that a few tools dominate the overall execution time.
</details>
(a) Cost in dollar.
<details>
<summary>figures/all_cost_summary_number_of_calls.png Details</summary>

### Visual Description
\n
## Bar Chart: Tool Call Frequency
### Overview
The image presents a bar chart illustrating the frequency of calls to various tools. The x-axis lists the tool names, and the y-axis represents the number of times each tool was called. The chart displays a significant variation in tool call frequencies, with some tools being called much more often than others.
### Components/Axes
* **X-axis Label:** Tool Name
* **Y-axis Label:** Frequency (Implied, no explicit label)
* **Y-axis Scale:** 0 to 2160, with increments of 250.
* **Data Series:** Single series representing the frequency of each tool call.
* **Annotations:**
* "Max: 2160" â Located at the top-right corner.
* "Arithmetic Mean: 339" â Located at the top-right corner.
* "Min: 3" â Located at the top-right corner.
* **Tool Categories (X-axis):**
* `define_next_step`
* `define_need_for_math_before_llm`
* `parse_solution_parsing`
* `define_cipher_query_given_new_fix_cipher`
* `merge_reason_for_tool_calls`
* `define_math_tool_call`
* `run_python_code_NOT_LOGGED`
* `ask_search_agent_NOT_LOGGED`
* `define_final_solution`
* `Wikipedia_det_pace_query`
* `define_retrieve_content`
* `Wikipedia_ask_LLM_generate_article`
* `WebSurfer_forward`
* `TextInspector_solution`
* `ilm_query_NOT_LOGGED`
* `LLMTool_run`
* `image_inspect_question_NOT_LOGGED`
* `extract_zincQuestion_run`
* `RunpythonCodeTool_fix_code`
* `AudioTranscriptionLoader_transcribe_audio`
### Detailed Analysis
The chart shows a decreasing trend in tool call frequency as we move from left to right.
* `define_next_step`: Approximately 2100 calls. (Purple bar, tallest)
* `define_need_for_math_before_llm`: Approximately 2050 calls. (Purple bar, second tallest)
* `parse_solution_parsing`: Approximately 1900 calls. (Purple bar)
* `define_cipher_query_given_new_fix_cipher`: Approximately 1750 calls. (Purple bar)
* `merge_reason_for_tool_calls`: Approximately 1600 calls. (Purple bar)
* `define_math_tool_call`: Approximately 1400 calls. (Purple bar)
* `run_python_code_NOT_LOGGED`: Approximately 1200 calls. (Purple bar)
* `ask_search_agent_NOT_LOGGED`: Approximately 1100 calls. (Purple bar)
* `define_final_solution`: Approximately 950 calls. (Purple bar)
* `Wikipedia_det_pace_query`: Approximately 800 calls. (Purple bar)
* `define_retrieve_content`: Approximately 700 calls. (Purple bar)
* `Wikipedia_ask_LLM_generate_article`: Approximately 600 calls. (Purple bar)
* `WebSurfer_forward`: Approximately 500 calls. (Purple bar)
* `TextInspector_solution`: Approximately 400 calls. (Purple bar)
* `ilm_query_NOT_LOGGED`: Approximately 350 calls. (Purple bar)
* `LLMTool_run`: Approximately 300 calls. (Purple bar)
* `image_inspect_question_NOT_LOGGED`: Approximately 200 calls. (Purple bar)
* `extract_zincQuestion_run`: Approximately 100 calls. (Purple bar)
* `RunpythonCodeTool_fix_code`: Approximately 50 calls. (Purple bar)
* `AudioTranscriptionLoader_transcribe_audio`: Approximately 3 calls. (Purple bar, shortest)
### Key Observations
* The tool `define_next_step` is called significantly more often than any other tool.
* The tool `AudioTranscriptionLoader_transcribe_audio` is called very rarely.
* The majority of tools are called between 500 and 1500 times.
* The arithmetic mean (339) is lower than the median, suggesting a skewed distribution with a few tools having very high call counts.
### Interpretation
The chart demonstrates a clear hierarchy in tool usage. The tools `define_next_step` and `define_need_for_math_before_llm` appear to be fundamental to the process being monitored, as they are called far more frequently than others. The low call count for `AudioTranscriptionLoader_transcribe_audio` suggests that audio transcription is a relatively infrequent requirement in this workflow. The "NOT_LOGGED" tools may indicate that these calls are not being tracked or are occurring outside of the monitored system. The large difference between the mean and the maximum value indicates that a small number of tools account for a large proportion of the total tool calls. This data could be used to optimize the system by focusing on the most frequently used tools or investigating why certain tools are rarely used. The chart provides insights into the operational characteristics of a system that relies on a suite of tools, highlighting which tools are critical and which may be candidates for improvement or removal.
</details>
(b) Number of calls.
<details>
<summary>figures/all_cost_summary_duration.png Details</summary>

### Visual Description
## Bar Chart: Execution Time of Various Functions
### Overview
This image presents a bar chart visualizing the execution time of a series of functions, likely within a software or system. The x-axis lists the function names, and the y-axis represents the execution time in seconds. The chart displays a significant variation in execution times across different functions, with some functions taking orders of magnitude longer than others.
### Components/Axes
* **X-axis:** Function Names (categorical)
* **Y-axis:** Execution Time (seconds) - Scale ranges from 0 to 12000, with increments of 2000.
* **Title:** Not explicitly present, but the chart represents "Execution Time of Various Functions".
* **Legend:** Not present. The bars are directly labeled with function names.
* **Annotations:**
* "Max: 12237.19 s" - Located in the top-right corner.
* "Arithmetic Mean: 1279.19 s" - Located in the center-right.
* "Min: 0.01 s" - Located in the center-right.
### Detailed Analysis
The chart displays 24 functions with their corresponding execution times. The functions are listed along the x-axis in the following order:
1. `ask_search_agent_NOT_LOGGED`: ~1000 s
2. `Surfertool`: ~9000 s
3. `define_next_step`: ~8000 s
4. `define_math_tool_call`: ~7000 s
5. `define_new_information`: ~6000 s
6. `parse_solution_with_llm`: ~5000 s
7. `define_tool_calls`: ~4000 s
8. `merge_reasons_to_insert`: ~3000 s
9. `WebSurfer.forward`: ~2500 s
10. `inspect_file_as_text_NOT_LOGGED`: ~2000 s
11. `Wikipedia.get_text`: ~1800 s
12. `define_final_content`: ~1600 s
13. `image_inspector.image_query`: ~1400 s
14. `define_page_content`: ~1200 s
15. `run_python_codeQuestion.run`: ~1000 s
16. `Wikipedia.ask_llm_which_article_forced`: ~800 s
17. `RunPythonCodeTool.run`: ~600 s
18. `tlm_query_NOT_LOGGED`: ~400 s
19. `tImTool.fix_code`: ~300 s
20. `generate_forced_solution`: ~200 s
21. `AudioTranscriptionLoader.transcribe_audio`: ~100 s
22. `extract_zip_NOT_LOGGED`: ~50 s
23. `define_cypher_query`: ~10 s
24. `cypher_NOT_LOGGED`: ~0.01 s
The tallest bar corresponds to `Surfertool` with an execution time of approximately 9000 seconds. The shortest bar corresponds to `cypher_NOT_LOGGED` with an execution time of approximately 0.01 seconds. The execution times generally decrease from left to right, though there are fluctuations.
### Key Observations
* The function `Surfertool` is a significant outlier, taking substantially longer to execute than any other function.
* The execution times are highly variable, spanning over four orders of magnitude (from 0.01s to 12237.19s).
* The arithmetic mean (1279.19 s) is heavily influenced by the few functions with very long execution times.
* The distribution of execution times is right-skewed, with a long tail of functions that take longer to execute.
### Interpretation
The chart suggests that the execution time of these functions varies dramatically. The `Surfertool` function appears to be a performance bottleneck, potentially due to its complexity or the resources it requires. The large difference in execution times indicates that optimizing the slower functions could significantly improve the overall system performance. The right-skewed distribution suggests that a small number of functions are responsible for the majority of the execution time. Further investigation into the `Surfertool` function and other slow functions is warranted to identify opportunities for optimization. The presence of "NOT_LOGGED" in some function names suggests that logging may not be enabled for those functions, which could hinder performance analysis. The chart provides a clear visual representation of the performance characteristics of these functions, enabling developers to prioritize optimization efforts effectively.
</details>
(c) Duration in seconds.
<details>
<summary>figures/all_cost_summary_cost_token.png Details</summary>

### Visual Description
\n
## Bar Chart: Tool Call Cost
### Overview
This is a vertical bar chart displaying the cost associated with different tool calls. The y-axis represents the cost, scaled by a factor of 10<sup>-7</sup>, and the x-axis lists the names of various tool calls. The chart visually compares the relative cost of each tool call, with taller bars indicating higher costs.
### Components/Axes
* **Y-axis Title:** "x10<sup>-7</sup>" (indicating the cost scale)
* **Y-axis Range:** Approximately 0 to 4.5
* **X-axis Title:** Tool Call Names (listed below)
* **Maximum Value Label:** "Max: $4.75e-07" (located at the top-right)
* **Minimum Value Label:** "Min: $1.02e-07" (located at the bottom-right)
* **Tool Call Categories (X-axis):**
* `LLMTool_run`
* `define_math_tool_call`
* `RunPythonCodeTool_run`
* `stageQuestion_fix_code`
* `fix_json`
* `fix_cypher`
* `TextInspector`
* `merge_reasons_to_insert`
* `generate_forced_solution`
* `define_final_solution`
* `WebSurferForward`
* `define_math_before_parsing`
* `parse_solution_with_llm`
* `Wikipedia_get_page_content`
* `define_need_for_reasoning`
* `Wikipedia_ask_LLM_which_article_to_explore`
* `define_forced_retrieve_queries`
* `define_retrieve_query`
* `SurferTool`
* `define_next_step`
* `define_tool_calls`
### Detailed Analysis
The bars are arranged alphabetically along the x-axis. The height of each bar represents the cost of the corresponding tool call, scaled by 10<sup>-7</sup>.
* **`LLMTool_run`:** Approximately 4.15 x 10<sup>-7</sup>
* **`define_math_tool_call`:** Approximately 3.85 x 10<sup>-7</sup>
* **`RunPythonCodeTool_run`:** Approximately 3.6 x 10<sup>-7</sup>
* **`stageQuestion_fix_code`:** Approximately 3.2 x 10<sup>-7</sup>
* **`fix_json`:** Approximately 2.9 x 10<sup>-7</sup>
* **`fix_cypher`:** Approximately 2.7 x 10<sup>-7</sup>
* **`TextInspector`:** Approximately 2.5 x 10<sup>-7</sup>
* **`merge_reasons_to_insert`:** Approximately 2.3 x 10<sup>-7</sup>
* **`generate_forced_solution`:** Approximately 2.1 x 10<sup>-7</sup>
* **`define_final_solution`:** Approximately 1.9 x 10<sup>-7</sup>
* **`WebSurferForward`:** Approximately 1.7 x 10<sup>-7</sup>
* **`define_math_before_parsing`:** Approximately 1.6 x 10<sup>-7</sup>
* **`parse_solution_with_llm`:** Approximately 1.4 x 10<sup>-7</sup>
* **`Wikipedia_get_page_content`:** Approximately 1.3 x 10<sup>-7</sup>
* **`define_need_for_reasoning`:** Approximately 1.2 x 10<sup>-7</sup>
* **`Wikipedia_ask_LLM_which_article_to_explore`:** Approximately 1.15 x 10<sup>-7</sup>
* **`define_forced_retrieve_queries`:** Approximately 1.1 x 10<sup>-7</sup>
* **`define_retrieve_query`:** Approximately 1.08 x 10<sup>-7</sup>
* **`SurferTool`:** Approximately 1.05 x 10<sup>-7</sup>
* **`define_next_step`:** Approximately 1.03 x 10<sup>-7</sup>
* **`define_tool_calls`:** Approximately 1.02 x 10<sup>-7</sup>
### Key Observations
* `LLMTool_run` has the highest cost, significantly exceeding other tool calls.
* `define_tool_calls` has the lowest cost.
* The costs are relatively clustered between approximately 1.0 x 10<sup>-7</sup> and 3.0 x 10<sup>-7</sup> for most tool calls.
* There is a noticeable drop in cost from the top three tool calls to the rest.
### Interpretation
The chart demonstrates the varying computational cost associated with different tool calls within a system. The substantial cost of `LLMTool_run` suggests that this tool call is particularly resource-intensive, potentially due to the complexity of the underlying Large Language Model operations. The lower costs of other tool calls indicate they are relatively efficient. This information is valuable for optimizing system performance and resource allocation. For example, if the system is cost-sensitive, developers might explore ways to reduce the usage of `LLMTool_run` or optimize its implementation. The wide range of costs suggests that different tool calls serve different purposes and have varying levels of complexity. The minimum and maximum values provide a clear range of cost expectations for tool calls.
</details>
(d) Cost per token in dollar.
<details>
<summary>figures/all_cost_summary_cost_second.png Details</summary>

### Visual Description
## Bar Chart: Tool Usage Frequency
### Overview
This is a vertical bar chart displaying the frequency of usage for various tools, likely within a Large Language Model (LLM) context. The y-axis represents the frequency (scaled by 10^-4), and the x-axis lists the tool names. The chart shows a significant variation in tool usage, with some tools being used much more frequently than others.
### Components/Axes
* **X-axis Label:** Tool Name
* **Y-axis Label:** Frequency (x10^-4)
* **Y-axis Scale:** Ranges from approximately 0 to 3.8 x 10^-4, with increments of 0.5 x 10^-4.
* **Maximum Value Indicator:** "Max: 3.79e-04" positioned near the top-right corner.
* **Minimum Value Indicator:** "Min: 3.26e-05" positioned near the bottom-right corner.
* **Tools (X-axis Categories):**
* Wikipedia.ask\_LLM\_which\_article\_to\_explore
* Wikipedia.get\_page\_content
* SurferTool
* WebSurferTool
* generate\_forced\_solution
* parse\_solution\_with\_lm
* define\_next\_step
* define\_final\_solution
* define\_retrieve\_queries
* define\_retrieve\_calls
* define\_tool\_calls
* TextInspector
* define\_new\_information
* fix\_json
* merge\_reasons\_to\_insert
* fix\_code
* ImageQuestion.run
* define\_math\_tool\_call
* LLMTool.run
* RunPythonCodeTool.run
### Detailed Analysis
The bars represent the frequency of each tool's usage. The trend is a steep decline in usage frequency as you move from left to right across the chart.
* **Wikipedia.ask\_LLM\_which\_article\_to\_explore:** Approximately 3.75 x 10^-4
* **Wikipedia.get\_page\_content:** Approximately 3.4 x 10^-4
* **SurferTool:** Approximately 2.9 x 10^-4
* **WebSurferTool:** Approximately 2.5 x 10^-4
* **generate\_forced\_solution:** Approximately 2.2 x 10^-4
* **parse\_solution\_with\_lm:** Approximately 1.9 x 10^-4
* **define\_next\_step:** Approximately 1.6 x 10^-4
* **define\_final\_solution:** Approximately 1.4 x 10^-4
* **define\_retrieve\_queries:** Approximately 1.2 x 10^-4
* **define\_retrieve\_calls:** Approximately 1.0 x 10^-4
* **define\_tool\_calls:** Approximately 0.9 x 10^-4
* **TextInspector:** Approximately 0.8 x 10^-4
* **define\_new\_information:** Approximately 0.7 x 10^-4
* **fix\_json:** Approximately 0.6 x 10^-4
* **merge\_reasons\_to\_insert:** Approximately 0.5 x 10^-4
* **fix\_code:** Approximately 0.4 x 10^-4
* **ImageQuestion.run:** Approximately 0.3 x 10^-4
* **define\_math\_tool\_call:** Approximately 0.2 x 10^-4
* **LLMTool.run:** Approximately 0.1 x 10^-4
* **RunPythonCodeTool.run:** Approximately 0.03 x 10^-4 (3.26 x 10^-5)
### Key Observations
* The "Wikipedia.ask\_LLM\_which\_article\_to\_explore" tool is used significantly more often than any other tool.
* The usage of tools declines rapidly after the first few.
* "RunPythonCodeTool.run" is the least used tool by a considerable margin.
* The difference between the most and least used tools is substantial, indicating a highly skewed distribution of tool usage.
### Interpretation
The data suggests that the LLM relies heavily on Wikipedia-related tools for information gathering, specifically for identifying relevant articles. The decreasing usage of subsequent tools indicates a workflow where the LLM first explores and retrieves information from Wikipedia, then utilizes other tools for more specialized tasks like web surfing, solution generation, and parsing. The low usage of "RunPythonCodeTool.run" might indicate that Python code execution is rarely required for the tasks this LLM is performing, or that the tool is not effectively integrated into the workflow. The chart provides insights into the LLM's problem-solving strategy and highlights the importance of knowledge retrieval from Wikipedia in its operations. The steep decline in usage could also suggest that the tools further down the list are used only in specific, less frequent scenarios.
</details>
(e) Cost per time in dollar/s.
<details>
<summary>figures/all_cost_summary_tokens_per_second.png Details</summary>

### Visual Description
\n
## Bar Chart: Tool Execution Times
### Overview
The image presents a bar chart displaying the execution time (in seconds) for various tools. The chart visually compares the performance of these tools, with the height of each bar representing the time taken for execution. The x-axis lists the tool names, and the y-axis represents the execution time in seconds.
### Components/Axes
* **X-axis Label:** Tool Name
* **Y-axis Label:** Time (s)
* **Y-axis Scale:** 0 to 2750 seconds, with increments of 500 seconds.
* **Maximum Value:** 2731.51 s (displayed at the top-right of the chart)
* **Minimum Value:** 68.70 s (displayed at the bottom-right of the chart)
* **Bar Color:** Green (all bars are the same color)
* **Tool Names (X-axis):**
* Wikipedia_ask_LLM_which_article_to_explore
* Wikipedia_get_page_content
* WebSurferTool
* WebSurfer_forward
* define_need_for_math_before_parsing
* generate_forced_solution
* parse_solution_with_llm
* define_next_step
* define_tool_calls
* define_forced_queries
* define_retrieve_query
* define_final_solution
* define_reasons_to_insert
* merge_reasons_new_information
* TextInspector
* RunPythonCodeTool
* fix_code
* fix_json
* ImageQuestion_run
* define_cypher
* define_math_tool_call
* LLMTool_run
### Detailed Analysis
The chart displays the execution times for 22 different tools. The bars are arranged in descending order of execution time from left to right, with some minor variations.
* **Wikipedia\_ask\_LLM\_which\_article\_to\_explore:** Approximately 2700 s.
* **Wikipedia\_get\_page\_content:** Approximately 2600 s.
* **WebSurferTool:** Approximately 2400 s.
* **WebSurfer\_forward:** Approximately 2300 s.
* **define\_need\_for\_math\_before\_parsing:** Approximately 2200 s.
* **generate\_forced\_solution:** Approximately 2100 s.
* **parse\_solution\_with\_llm:** Approximately 1900 s.
* **define\_next\_step:** Approximately 1700 s.
* **define\_tool\_calls:** Approximately 1600 s.
* **define\_forced\_queries:** Approximately 1400 s.
* **define\_retrieve\_query:** Approximately 1300 s.
* **define\_final\_solution:** Approximately 1100 s.
* **define\_reasons\_to\_insert:** Approximately 900 s.
* **merge\_reasons\_new\_information:** Approximately 700 s.
* **TextInspector:** Approximately 500 s.
* **RunPythonCodeTool:** Approximately 400 s.
* **fix\_code:** Approximately 300 s.
* **fix\_json:** Approximately 250 s.
* **ImageQuestion\_run:** Approximately 200 s.
* **define\_cypher:** Approximately 150 s.
* **define\_math\_tool\_call:** Approximately 100 s.
* **LLMTool\_run:** Approximately 70 s.
The trend is generally decreasing from left to right, indicating that the tools listed later in the sequence are faster.
### Key Observations
* The tool "Wikipedia\_ask\_LLM\_which\_article\_to\_explore" has the longest execution time, significantly exceeding the others.
* "LLMTool\_run" has the shortest execution time.
* There's a large disparity in execution times, with the longest taking over 39 times longer than the shortest.
* The first seven tools all take over 1500 seconds to execute.
* The last five tools all take less than 300 seconds to execute.
### Interpretation
The chart demonstrates a significant variation in the performance of different tools. The tools related to Wikipedia and web surfing appear to be the most time-consuming, likely due to the complexity of interacting with external websites and processing large amounts of text. The tools towards the end of the chart, such as "fix\_json" and "LLMTool\_run", are likely simpler operations or utilize more efficient algorithms.
The large difference in execution times suggests that optimizing the Wikipedia and web surfing tools could yield substantial performance improvements. The chart could be used to identify bottlenecks in a workflow and prioritize optimization efforts. The data suggests a clear distinction between tools that involve external data retrieval/processing and those that operate on internal data or perform simpler tasks. The outlier, "Wikipedia\_ask\_LLM\_which\_article\_to\_explore", warrants further investigation to understand the root cause of its long execution time.
</details>
(f) Tokens per second.
Figure 17: Overview over the execution time as well as the cost in dollar. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_match.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Tool Choice Correctness Analysis
### Overview
The image presents a stacked bar chart visualizing the correctness of tool choices, likely in response to a set of questions. The chart displays the distribution of "Wrong Tool Choice", "Partially Correct (Low Match)", "Partially Correct (Medium Match)", and "Correct Tool Choice" across a total of 165 questions analyzed. The chart is horizontally oriented, with the y-axis representing the number of questions and the x-axis implicitly representing the categories of correctness.
### Components/Axes
* **Title:** "Tool Choice Correctness Analysis" (centered at the top)
* **Y-axis Label:** "Number of Questions" (left side)
* **X-axis:** Implicitly represents the categories of tool choice correctness.
* **Legend:** Located in the top-right corner, with the following entries:
* Red: "Wrong Tool Choice"
* Yellow: "Partially Correct (Low Match)"
* Orange: "Partially Correct (Medium Match)"
* Green: "Correct Tool Choice"
* **Total Questions Analyzed:** "Total Questions Analyzed: 165" (bottom center)
* **Percentage Labels:** Displayed within each segment of the stacked bar, indicating the percentage of questions falling into each category.
### Detailed Analysis
The chart is composed of four stacked segments, each representing a category of tool choice correctness. The segments are stacked vertically, starting from the bottom with "Correct Tool Choice" and moving upwards through "Partially Correct (Medium Match)", "Partially Correct (Low Match)", and finally "Wrong Tool Choice".
* **Correct Tool Choice (Green):** The green segment occupies the bottom portion of the chart. It represents 36.4% of the total questions, which corresponds to approximately 60 questions (0.364 * 165 â 60).
* **Partially Correct (Medium Match) (Orange):** The orange segment sits above the green segment. It represents 35.8% of the total questions, which corresponds to approximately 59 questions (0.358 * 165 â 59).
* **Partially Correct (Low Match) (Yellow):** The yellow segment is above the orange segment. It represents 10.9% of the total questions, which corresponds to approximately 18 questions (0.109 * 165 â 18).
* **Wrong Tool Choice (Red):** The red segment is at the top of the chart. It represents 17.0% of the total questions, which corresponds to approximately 28 questions (0.170 * 165 â 28).
The sum of the percentages is 36.4% + 35.8% + 10.9% + 17.0% = 100.1%, which is slightly off due to rounding.
### Key Observations
* The largest proportion of questions (36.4%) resulted in a "Correct Tool Choice".
* "Partially Correct (Medium Match)" is very close to "Correct Tool Choice" at 35.8%.
* "Wrong Tool Choice" represents the smallest proportion of questions (17.0%).
* "Partially Correct (Low Match)" represents a relatively small proportion of questions (10.9%).
### Interpretation
The data suggests that, overall, tool choices are reasonably accurate, with a majority of questions resulting in either a correct or partially correct response. The close proximity of "Correct Tool Choice" and "Partially Correct (Medium Match)" indicates that while many choices are fully appropriate, a significant number are almost suitable, suggesting a potential for refinement or further training. The relatively low percentage of "Wrong Tool Choice" is encouraging, but still warrants attention to identify the root causes of these errors. The distinction between "Low Match" and "Medium Match" partial correctness suggests a graded scale of appropriateness, which could be valuable for targeted improvement efforts. The total number of questions analyzed (165) provides a reasonable sample size for drawing conclusions, but further analysis with a larger dataset could strengthen the findings. The chart provides a clear visual representation of the distribution of tool choice correctness, facilitating quick identification of areas for improvement.
</details>
Figure 18: Analysis of the tool selection. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_choice_analysis.png Details</summary>

### Visual Description
\n
## Alluvial Diagram: Tool Correctness to Question Success Analysis
### Overview
This image presents an alluvial diagram visualizing the relationship between "Tool Choice" and "GAIA Question" outcomes (Successful or Failed). The diagram uses colored bands to represent different levels of tool correctness (ToolMatch.CORRECT, ToolMatch.PARTIAL_LOW, ToolMatch.PARTIAL_MEDIUM, ToolMatch.WRONG) and their corresponding flow to either a successful or failed question outcome. The width of each band represents the number of instances (N).
### Components/Axes
* **X-axis:** Represents the two categories: "Tool Choice" on the left and "GAIA Question" on the right.
* **Y-axis:** Implicitly represents the different levels of tool correctness.
* **Legend:** Located in the top-left corner, it defines the color coding for each tool correctness level:
* Yellow: ToolMatch.PARTIAL_LOW (N = 18)
* Green: ToolMatch.CORRECT (N = 60)
* Orange: ToolMatch.PARTIAL_MEDIUM (N = 59)
* Red: ToolMatch.WRONG (N = 28)
* **GAIA Question Outcomes:**
* Blue: Failed (N = 125)
* Gray: Successful (N = 40)
### Detailed Analysis
The diagram shows the flow from each tool correctness level to either a successful or failed GAIA question.
* **ToolMatch.PARTIAL_LOW (Yellow, N=18):** Approximately 15 instances flow to "Failed" and 3 instances flow to "Successful".
* **ToolMatch.CORRECT (Green, N=60):** The majority (approximately 55 instances) flow to "Successful", while a smaller portion (approximately 5 instances) flow to "Failed".
* **ToolMatch.PARTIAL_MEDIUM (Orange, N=59):** A significant portion (approximately 50 instances) flow to "Failed", and a smaller portion (approximately 9 instances) flow to "Successful".
* **ToolMatch.WRONG (Red, N=28):** Almost all instances (approximately 25 instances) flow to "Failed", with only a very small number (approximately 3 instances) flowing to "Successful".
### Key Observations
* The "Failed" outcome (blue) has a significantly higher total count (N=125) than the "Successful" outcome (N=40).
* "ToolMatch.CORRECT" has the highest number of instances (N=60) and predominantly leads to "Successful" outcomes.
* "ToolMatch.WRONG" almost exclusively leads to "Failed" outcomes.
* "ToolMatch.PARTIAL_LOW" and "ToolMatch.PARTIAL_MEDIUM" have a more balanced flow to both "Successful" and "Failed" outcomes, but with a stronger tendency towards "Failed".
### Interpretation
The data suggests a strong correlation between tool correctness and question success. When the tool choice is correct ("ToolMatch.CORRECT"), the question is much more likely to be answered successfully. Conversely, when the tool choice is wrong ("ToolMatch.WRONG"), the question is almost always answered incorrectly. Partial matches ("ToolMatch.PARTIAL_LOW" and "ToolMatch.PARTIAL_MEDIUM") result in a more mixed outcome, indicating that the level of partial correctness influences the likelihood of success.
The large number of "Failed" outcomes overall suggests that the tool selection process is not consistently accurate, or that even with a correct tool, there are other factors contributing to question failure. The diagram highlights the importance of accurate tool selection for achieving successful question answering. The flow of bands visually demonstrates the probabilistic relationship between tool choice and outcome, with wider bands indicating a higher probability. The diagram is a clear visualization of the impact of tool correctness on the overall success rate of the GAIA question answering system.
</details>
Figure 19: Analysis of the tool selection. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.
<details>
<summary>figures/all_tool_usage_count.png Details</summary>

### Visual Description
\n
## Pie Chart: KGoT Tool Usage Distribution
### Overview
This image presents a pie chart illustrating the distribution of tool usage within the KGoT (Knowledge Graph of Tools) system. The chart details the percentage of usage for six unique tools across 165 GAIA questions, with a total tool usage count of 173.
### Components/Axes
* **Title:** "KGoT Tool Usage Distribution"
* **Subtitle:** "6 unique tools for 165 GAIA questions"
* **Total Tool Usage Count:** 173 (displayed in the center of the pie chart)
* **Categories (Tools):**
* ask_search_agent
* inspect_file_as_text
* llm_query
* image_inspector
* run_python_code
* extract_zip
* **Values:** Percentage of total tool usage for each tool.
### Detailed Analysis
The pie chart segments represent the proportion of times each tool was used. The largest segment, colored a deep blue, represents `ask_search_agent`. The segments are arranged clockwise, starting with `ask_search_agent` at the bottom.
* **ask_search_agent:** The largest segment, occupying approximately 61.3% of the pie chart.
* **inspect_file_as_text:** Occupies approximately 15.6% of the pie chart, colored a lighter blue.
* **llm_query:** Occupies approximately 11% of the pie chart, colored a teal.
* **image_inspector:** Occupies approximately 5.78% of the pie chart, colored a light green.
* **run_python_code:** Occupies approximately 5.2% of the pie chart, colored a pale yellow.
* **extract_zip:** Occupies approximately 1.16% of the pie chart, colored a light orange.
### Key Observations
* `ask_search_agent` is by far the most frequently used tool, accounting for over 60% of all tool usage.
* `extract_zip` is the least used tool, representing a very small fraction of the total usage.
* The usage of the other tools (`inspect_file_as_text`, `llm_query`, `image_inspector`, `run_python_code`) is relatively similar, though `inspect_file_as_text` is notably more used than the others.
### Interpretation
The data suggests that the primary method for addressing GAIA questions within the KGoT system is through the `ask_search_agent` tool. This indicates that search-based approaches are the most common strategy for answering these questions. The low usage of `extract_zip` suggests that dealing with zipped files is a rare requirement in the context of these GAIA questions. The moderate usage of `inspect_file_as_text` and `llm_query` indicates that file inspection and large language model queries are important, but less dominant, components of the workflow. The relatively similar usage of `image_inspector` and `run_python_code` suggests that image analysis and code execution are used at similar rates. The total tool usage count (173) being slightly higher than the number of GAIA questions (165) suggests that some questions may have involved the use of multiple tools.
</details>
Figure 20: Analysis of the tool usage. Graph storage: Neo4j. Retrieval type: query. Model: GPT-4o mini.