# Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
**Authors**: Mitsubishi Electric Corp., Japan
Abstract
The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select âMonster Hunter: Worldâ as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the modelsâ ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research. The dataset and code at https://github.com/wbw520/MH-MMKG.
1 Introduction
Recent multimodal large language models (MLLMs) [54], particularly closed-source ones [1, 2], have demonstrated human-like multimodal capabilities, achieving outstanding performance on benchmarks related to commonsense [42], scientific facts [55], etc. Meanwhile, for domain-specific tasks or rarely seen data [14, 9], relying solely on their own perception and built-in knowledge is inadequate for predicting accurate answers and is hardly interpretable [49, 19].
To alleviate these issues, multimodal RAG (mRAG) [57, 58] has been introduced to improve MLLM performance through heuristic [20, 10, 6] or agent-based methods [44, 33]. However, as with RAG for LLMs [18], such knowledge retrieval approaches often suffer from knowledge redundancy and low relevance [40]. As a result, knowledge graph retrieval [28, 15] has gained increasing attention. Knowledge graphs (KGs) [24], storing knowledge in a structured and comprehensive manner, offer models with more context-rich information. As a substantial multimodal extension, multimodal KGs (MMKGs) [13, 38] can offer a knowledge foundation for enhancing MLLMs.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Zinogre Attack Prediction System
### Overview
This diagram illustrates a system for predicting Zinogre's attacks in a game, likely Monster Hunter. It depicts a flow of information from the game battle screen to a model (MLLM) that retrieves knowledge and predicts possible attack continuations, ultimately informing knowledgeable players. The diagram is divided into sections representing input, processing, and output.
### Components/Axes
The diagram contains the following key components:
* **Input:** Game Battle Screen (images of Zinogre in action), Controller icon with a question mark.
* **Processing:** MLLM (Model), Knowledge Retrieval, Attack Flow Diagram.
* **Output:** Predicted Attack Sequence (Fist Combo, Tail Slam, Back Slam, 360 Spin), Knowledgeable Players (represented by player icons), MH-MMKG.
* **Labels:** "Zinogre", "Super Charged", "Headbutt", "Back Jump", "Counter Attack", "Fist Combo", "Tail Slam", "Back Slam", "360 Spin", "atk of", "cont.", "phase", "Based on the battle screen, what are Zinogre possible continues attacks?", "Zinogre is going to unleash the Counter Attack action. Based on this, it speculated that in its Super Charged state, it may follow up with a Fist Combo, or continues attacks including a Tail Slam and a Back Slam."
* **Icons:** Monster Hunter logo, robot icon representing MLLM, book icon representing knowledge.
### Detailed Analysis or Content Details
The diagram shows a flow of information starting from the game battle screen.
1. **Input:** The battle screen images of Zinogre are presented as input, along with a question: "Based on the battle screen, what are Zinogre possible continues attacks?". A controller icon suggests player input.
2. **MLLM Processing:** This input is fed into an MLLM (Model), represented by a robot icon. The MLLM processes the information and retrieves knowledge. A checkmark indicates a successful prediction.
3. **Knowledge Retrieval & Attack Flow:** The core of the system is a directed graph illustrating Zinogre's attack patterns.
* Zinogre starts in a "phase" of "Super Charged".
* From "Super Charged", Zinogre can transition to "Counter Attack" (green arrow labeled "atk of"), "Headbutt" (red arrow), or "Back Jump" (red arrow).
* "Counter Attack" leads to "Fist Combo" (green arrow labeled "cont.").
* "Tail Slam" leads to "360 Spin" (green arrow labeled "cont.").
* "Headbutt" leads to "Tail Slam" (green arrow labeled "cont.").
* "Back Jump" leads to "Back Slam" (green arrow labeled "cont.").
4. **Output:** The predicted attack sequences ("Fist Combo", "Tail Slam", "Back Slam", "360 Spin") are displayed alongside images of Zinogre performing those attacks. These predictions are then conveyed to "Knowledgeable Players" (represented by multiple player icons). The output also connects to "MH-MMKG".
### Key Observations
* The diagram emphasizes the prediction of attack continuations based on Zinogre's "Super Charged" state.
* The use of color-coded arrows (green for continuation, red for alternative paths) clearly indicates the flow of attack sequences.
* The system appears to leverage both visual input (battle screen images) and knowledge retrieval to make predictions.
* The "MH-MMKG" component is not fully explained but seems to be a knowledge base or system related to Monster Hunter.
### Interpretation
This diagram represents a system designed to assist players in anticipating Zinogre's attacks in Monster Hunter. The MLLM acts as a bridge between the game's visual information and a knowledge base of attack patterns. By analyzing the battle screen, the MLLM can predict likely attack continuations, providing players with a strategic advantage. The diagram highlights the importance of understanding Zinogre's "Super Charged" state as a key indicator of future actions. The system's effectiveness relies on the accuracy of the knowledge retrieval and the MLLM's ability to correctly interpret the visual cues from the battle screen. The inclusion of "MH-MMKG" suggests a broader knowledge management system supporting the attack prediction process. The diagram is a conceptual illustration of a complex system, likely used for research or development purposes. It doesn't provide numerical data, but rather a qualitative representation of the system's functionality.
</details>
Figure 1: Our MH-MMKG is curated by knowledgeable players. By leveraging a multi-agents retriever, the MLLMâs responses can be augmented.
One major challenge MLLMs with MMKGs is the unavailability of robust benchmarks. Existing work primarily relies on VQA datasets [29] or web resources [5, 17] for constructing MMKGs. It is highly plausible that such knowledge has already been learned by current MLLMs, making them less useful in evaluating the effectiveness of MMKGs. Additionally, existing MMKGs primarily incorporate text and images, lacking more information-rich modalities, such as video [39], which reduces their suitability for increasingly complex multimodal tasks.
Another challenge is how to retrieve knowledge accurately. The mainstream approach embeds entities in MMKGs to find relevant subgraphs [23, 21, 31, 30]. However, training an effective retriever is data-intensive and unrealistic for low-resource domains. In addition, incorporating graphs about rarely seen domains into powerful closed-source models remains impractical.
This work aims to endeavor the ability of MLLMs to tackle domain-specific tasks by leveraging well-structured external knowledge sources. To achieve this, we build a testbed with a well-known game title [9] as illustrated in Figure 1 for two reasons: First, the visual modality exhibits a large gap from real-world scenes as its visual world is computer graphics-generated mostly with imaginary cultures, creatures, etc. to which MLLMsâ perception module is not well exposed. Second, the knowledge of the gameâs world can be different from the real world, which makes the MLLMsâ built-in knowledge less useful. We in this work use âMonster Hunter: Worldâ for our testbed as the game series offers abundant knowledge about the fantasy world. As one of the most popular game titles, current MLLMs have already learned some knowledge about it. We will experimentally show its impact.
To this end, we construct MH-MMKG, which integrates text, images, videos, and intricate entity relations curated by knowledgeable players. Additionally, we design 238 carefully crafted question-answer pairs as benchmark that cover various sub-tasks, including fine-grained visual cognition, conditional reasoning, etc. Importantly, the knowledge required to answer these questions is encompassed in MH-MMKG. An MLLM can answer most questions with perfect knowledge retrieval and comprehension. MH-MMKG, as well as the benchmark, offers a new dimension of challenges for the community: MH-MMKG requires flexible perception to comprehend the visual modality in an (almost) unseen domain as well as reasoning without relying on the built-in knowledge of the MLLM.
Multi-modal knowledge retrieval through graph is the foundation for tackling this new set of challenges. We thus develop a multi-agent method, which harnesses the self-searching capabilities [26, 45, 16] of MLLMs, allowing them to autonomously retrieve relevant knowledge without training. On both close- and open-source leading MLLMs, we experimentally show the effectiveness of our method, qualifying it as a strong baseline.
Our contribution lies in developing a high-quality benchmark and exploring MLLMsâ ability to find knowledge from MMKG for solving rarely seen domain-specific tasks. These tasks require not only advanced multimodal perception but also a deep understanding of complex dependencies, conditions, and rules within MH-MMKG. Our benchmark, together with the baseline, provides a strong foundation for advancing MLLMs toward real-world challenges.
2 Related Works
2.1 Multimodal Retrieval Augmented Generation
Recent literature shows a growing surge of interest in MLLMs [54]. Despite their advancements, even sophisticated models like GPT-4o face difficulty in handling domain-specific tasks [14, 9] or reasoning [49, 19]. To mitigate these issues, mRAG [57, 58] seeks to enhance AI systems by providing more reliable, comprehensive, accurate, and up-to-date knowledge from external sources [42, 11].
Heuristic mRAG often relies on predefined retrieval strategies that prioritize grounding across multiple modalities into a single primary modality [20, 10, 6, 7], which limits their capacity to deliver precise and contextually rich knowledge. Recent studies [44, 51, 33] propose more adaptable pipelines for knowledge retrieval, utilizing multi-agent cooperation [46] to harness the modelâs intrinsic capacity for knowledge exploration. Despite differences in strategy, all these methods depend on retrieving unstructured external knowledge sources, leading to knowledge redundancy and a lack of precision [40]. In contrast, KGs offer a structured format for knowledge storage [24], particularly MMKGs [60], which can be more efficient support for MLLMs to enhance their response.
2.2 Multimodal Knowledge Graph
Compared to KGs, MMKGs incorporate diverse multimodal data, making them more suitable for complex scenarios that require multimodal collaboration. The evolution from conventional KGs to MMKGs has also been extensively explored [13, 38]. Researchers often use densely annotated VQA datasets [29] or web-based resources [5, 17] to automatically construct MMKGs [35, 4]. Ongoing efforts have demonstrated the effectiveness of them in tasks such as VQA [34], image classification [41], cross-modal retrieval [56], etc. Additionally, some studies focused on embedding MMKGs into feature vectors to enhance the reasoning capabilities of MLLMs, yielding impressive results [30, 27].
Notwithstanding these considerable achievements, current works primarily focus on integrating text and image modalities [38] via web data, which may already embedded in the knowledge of MLLMs. The exploration of additional modalities, such as audio and video, as well as practical applications of recent MLLMs for complex real-world tasks, remains limited. Our meticulously crafted MH-MMKG incorporates multiple modalities alongside complex relations as a knowledge base for a domain-specific task.
2.3 Retrieve Knowledge on Graph
By offering reliable and structured information, retrieval over KGs assists LLMs in preserving better factual accuracy [28, 15]. Common approaches include retrieving relevant subgraphs [25, 22] or integrating the graph into the modelâs learning process [37] âtechniques that have also been extended to MMKGs to improve MLLMsâ task completion capabilities [23, 30]. However, training an effective retriever is highly data-intensive, especially for modalities like video. Moreover, adding new KGs into powerful closed-source MLLMs is infeasible. Recent work aims to evoke the modelâs self-searching ability to autonomously navigate toward the necessary knowledge [45, 45, 48]. These approaches typically involve route planning [26, 45] or self-refinement mechanisms [16, 8] to improve retrieval accuracy. We extend these ideas to MMKGs and propose a multi-agent self-searching method for knowledge retrieval.
3 Datasets
The âMonster Hunterâ is a popular game series and we choose âMonster Hunter: Worldâ as our testbed to explore the potential of MLLMs in tackling the gaps in visual and knowledge using MMKGs.
3.1 MH-MMKG
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Monster Hunter Relationships & Attacks
### Overview
The image is a diagram illustrating relationships between monsters (Rathalos, Pink Rathian, Rathian) and their attacks (Triple Rush, Bite, Triple Fireballs) within the Monster Hunter universe. It also shows elemental weaknesses and resistances, and material connections to weapons. The diagram uses arrows to indicate relationships like "mated with", "subspecies", "attack of", "weakens with", "resistant with", and "material for".
### Components/Axes
The diagram centers around the monster "Rathian". Other key components include:
* **Monsters:** Rathalos, Pink Rathian, Rathian
* **Attacks:** Triple Rush, Bite, Triple Fireballs
* **Elements:** Ice Element, Fire Element
* **Weapon:** Fire Element Weapon
* **Context Boxes:** Provide additional information about attacks.
* **Captions:** Describe the attacks in more detail.
* **Arrows:** Indicate relationships between components.
### Detailed Analysis or Content Details
* **Rathalos & Rathian:** An arrow labeled "mated with" points from Rathalos to Rathian.
* **Pink Rathian & Rathian:** An arrow labeled "subspecies" points from Pink Rathian to Rathian.
* **Ice Element & Pink Rathian:** An arrow labeled "weakens with" points from Ice Element to Pink Rathian. The "..." indicates there may be other weaknesses.
* **Fire Element & Rathian:** An arrow labeled "resistant with" points from Fire Element to Rathian. The "..." indicates there may be other resistances.
* **Rathian & Fire Element Weapon:** An arrow labeled "material for" points from Rathian to Fire Element Weapon.
* **Rathian & Triple Rush:** An arrow labeled "attack of" points from Rathian to Triple Rush. A caption associated with Triple Rush states: "Rathian rushes forward, up three times..."
* **Rathian & Bite:** An arrow labeled "attack of" points from Rathian to Bite. A caption associated with Bite states: "Rathian use a lunge bite..." A context box associated with Bite states: "if hunter hit by this attack..."
* **Rathian & Triple Fireballs:** An arrow labeled "attack of when angry" points from Rathian to Triple Fireballs. A caption associated with Triple Fireballs states: "Rathian throws fireballs from its mouth in a rapid sequence..."
* **Triple Rush & Bite:** An arrow labeled "continues" points from Triple Rush to Bite.
### Key Observations
* Rathalos and Rathian are presented as a breeding pair.
* Pink Rathian is a subspecies of Rathian.
* Rathian is resistant to Fire and weak to Ice.
* Rathian's attacks include Triple Rush, Bite, and Triple Fireballs, with Triple Rush potentially leading into a Bite attack.
* The diagram emphasizes the relationship between monster characteristics (elemental weaknesses/resistances) and the materials they provide for crafting weapons.
### Interpretation
The diagram illustrates the ecological and gameplay relationships within the Monster Hunter world. It demonstrates how monsters are interconnected through breeding, subspecies variations, and elemental affinities. The inclusion of attack details and material connections highlights the strategic elements of the game, where understanding monster weaknesses and utilizing their materials is crucial for progression. The "..." notations suggest a more complex web of relationships and vulnerabilities that are not fully represented in this simplified diagram. The diagram serves as a quick reference guide for players, providing essential information about these specific monsters and their interactions within the game's ecosystem. The context boxes and captions add a layer of narrative and gameplay insight, suggesting the importance of understanding attack patterns and timing for successful hunts.
</details>
Figure 2: The subgraph for monster âRathianâ in MH-MMKG.
Our MH-MMKG is an attribute-based MMKG [60] as shown in Figure 2, where names of elements that appear in the game, such as monsters and attack actions, are treated as entities. A pair of entities can be linked with a particular relation, forming an edge between them. An entity may come with some additional information, i.e., videos related to the entity and textual context to supply more details on the entity, both of which are treated as attributes. Three knowledgeable players were recruited to build the knowledge graph: Each player built subgraphs for single monsters. Figure 2 illustrates the subgraph for âRathian,â showing intricate relationships such as attack conditions, combos, and inter-subgraph connections. We collected 22 subgraphs, which were merged together to make MH-MMKG.
Let $\mathcal{G}=(\mathcal{E},\mathcal{V},\mathcal{R})$ denote our whole KG, where $\mathcal{E}$ is the set of entities, $\mathcal{V}$ is the set of edges, and $\mathcal{R}$ is the set of relations. $\mathcal{E}$ consists of two types of entities: 1) monsters in $\mathcal{E}_{\text{o}}$ and 2) other entities in $\mathcal{E}_{\text{a}}$ . An edge $v=(e,r,e^{\prime})â\mathcal{V}$ represents $eâ\mathcal{E}$ has relation $râ\mathcal{R}$ with $e^{\prime}â\mathcal{E}$ . The attribute associated with an entity $e$ is denoted as $A(e)=(c,u)$ , which comprises a video clip $c$ and/or textual context $u$ . A video clip describes the entity $e$ , recorded in 60 fps and 4k. As current MLLMs are not able to handle videos directly, the dataset also offers human-selected keyframes (no more than 10 frames) as well as human-written captions about $c$ . The knowledgeable players who curated the dataset also selected the keyframes and wrote the captions. The textual context $u$ provides knowledge about the entity but is not included in $c$ . We denote a set of all attribute as $\mathcal{A}=\{A(e)|eâ\mathcal{E}\}$ .
Table 1: The six sub-tasks in our MH benchmark.
| I: Individual Information | Retrieve the textual information of a monster, e.g., nick name, habitat, and skill mechanics. | 24 |
| --- | --- | --- |
| II: Attack Recognition | Recognize the about to, ongoing or finished attack action of a monster. | 109 |
| III: Combo Premonition | Predict upcoming attack sequences based on the monsterâs current action or previous action track. | 28 |
| IV: Condition Awareness | Detect status on monsters or surrounding environment, e.g., whether angry, terrain or phase changes, to anticipate future battle. | 29 |
| V: Proc Effect Insight | Analyze the effects such as environment and attack on monster status and movement patterns. | 35 |
| VI: Cross Monster Analysis | Compare attack patterns or behaviors across different monsters to optimize hunting strategies. | 13 |
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: Monster Hunter Attack Action Flow
### Overview
The image depicts a diagram illustrating the process of understanding and predicting attack actions of a monster, specifically the "Stygian Zinogre" from the Monster Hunter series. The diagram shows a flow of information from perceiving a monster's action, retrieving knowledge, and predicting the next steps in the attack sequence. It appears to be a conceptual model for an AI or multi-agent system designed to analyze monster behavior.
### Components/Axes
The diagram is divided into four main sections, arranged horizontally from left to right:
1. **Perceiver:** Handles initial perception of the monster's action.
2. **Knowledge Retrieval & Expansion:** Processes the perceived action to retrieve and expand relevant knowledge.
3. **Attack Phase Prediction:** Identifies the current attack phase and predicts subsequent actions.
4. **Multi-agents Retriever:** Utilizes multiple agents to validate and refine the prediction.
Key labels include: "Caption", "Topic Selection", "Validation", "Expansion", "Textualize", "Sufficient", "Phase", "Attk of", "Follows".
The monster is labeled "Zinogre" and "Stygian Zinogre".
Attack actions include: "Headbutt", "Devour", "Heavy Pawslam", "Double Slam", "Charging Phase", "Charged Phase", "Super Charged".
### Detailed Analysis or Content Details
**1. Perceiver (Leftmost Section):**
* An image of the Stygian Zinogre is present, with a dashed outline.
* A speech bubble contains the text: "Zinogre raises it right claw, move it to the left part of the body and put it firmly against the ground on left..."
* A controller icon with a question mark and the text: "Tell me what will happen next within this attack action?"
* A summarizer icon with the text: "Zinogre will jump and slams the ground, and then repeat the attack action again."
**2. Knowledge Retrieval & Expansion (Center-Left Section):**
* "Topic Selection" is labeled above a selection process.
* "Zinogre" is shown with a selection indicator.
* "Validation" and "Expansion" are labeled as processes.
* "Retrieved Knowledge" is represented by a block of vertical lines.
* "Textualize" is labeled, leading to a location marker icon.
* "Sufficient" is labeled, connecting to the next section.
**3. Attack Phase Prediction (Center-Right Section):**
* "Stygian Zinogre" is labeled.
* "Charging Phase" leads to "Charged Phase" which leads to "Super Charged".
* "Attk of" (Attack of) is labeled, connecting "Charged Phase" and "Super Charged" to "Double Slam".
* "Phase" is labeled, connecting "Charging Phase" to "Charged Phase".
**4. Multi-agents Retriever (Rightmost Section):**
* "Headbutt", "Devour", and "Heavy Pawslam" are labeled as potential attack actions.
* "Attk of" (Attack of) is labeled, connecting "Headbutt" and "Devour" to the next stage.
* "Sufficient" is labeled, connecting the attack actions to a flow diagram with location marker icons and arrows labeled "Follows".
### Key Observations
* The diagram illustrates a sequential process, starting with perception and culminating in attack prediction.
* The "Sufficient" label suggests a threshold or confidence level is reached before proceeding to the next stage.
* The multi-agent retriever section indicates that multiple potential attack actions are considered and evaluated.
* The diagram focuses on the Stygian Zinogre, suggesting it's a specific case study or example.
* The flow is largely linear, with feedback loops implied by the "Validation" and "Expansion" processes.
### Interpretation
This diagram represents a system designed to understand and predict the attack patterns of a monster in a game context. The system takes visual input (the monster's action), retrieves relevant knowledge about the monster and its attacks, and then predicts the next steps in the attack sequence. The multi-agent retriever component suggests a probabilistic approach, where multiple possible attacks are considered and ranked based on their likelihood. The "Sufficient" label implies a confidence threshold that must be met before a prediction is considered reliable.
The diagram highlights the complexity of monster behavior and the need for a sophisticated system to accurately predict attacks. It suggests a combination of visual perception, knowledge retrieval, and probabilistic reasoning is required to achieve this goal. The diagram is likely a conceptual model for an AI system or a component within a larger game AI framework. The use of terms like "Validation" and "Expansion" suggests a learning component, where the system can refine its knowledge and improve its prediction accuracy over time. The diagram is a high-level overview and does not provide details on the specific algorithms or techniques used in each stage.
</details>
Figure 3: Our method first converts the query media into a textual caption. Next, a multi-agent self-search mechanism retrieves relevant knowledge associated with the query. Finally, the retrieved knowledge is utilized to enhance the modelâs response.
<details>
<summary>images/fast-forward-button.png Details</summary>

### Visual Description
\n
## Icon/Symbol: Directional Arrows
### Overview
The image depicts four yellow, rounded-arrow shapes arranged horizontally. The arrows point towards the right, suggesting direction, progress, or movement. There are no numerical values, axes, or legends present. This is a symbolic representation rather than a data-driven visualization.
### Components/Axes
There are no axes or legends. The primary components are the four arrow shapes.
### Detailed Analysis or Content Details
The arrows are consistently colored a bright yellow (#FFDA63, approximate). They are arranged in a linear fashion, with each arrow slightly overlapping the previous one. The arrows are approximately the same size and shape. The arrowheads are rounded, not sharply pointed. The background is a light gray (#F0F0F0, approximate).
### Key Observations
The consistent direction and shape of the arrows emphasize a unified movement or progression. The overlapping arrangement suggests a continuous flow.
### Interpretation
This icon likely represents concepts such as:
* **Progress:** The arrows symbolize forward movement towards a goal.
* **Direction:** The arrows indicate a specific path or course of action.
* **Speed/Velocity:** The arrangement could suggest increasing momentum.
* **Growth:** The arrows could represent expansion or development.
* **Flow:** The arrows could represent a process or system.
The simplicity of the design makes it versatile and easily recognizable. The yellow color often conveys optimism, energy, and positivity. The lack of specific data points indicates this is a conceptual illustration rather than a quantitative representation. It is a visual metaphor, not a factual statement.
</details>
means the continue of search and
<details>
<summary>images/sd-card.png Details</summary>

### Visual Description
\n
## Icon: MicroSD Card
### Overview
The image depicts a simplified, monochrome icon of a MicroSD card. It is a solid black shape against a white background. The icon is designed to be easily recognizable as a storage medium.
### Components/Axes
There are no axes, labels, or legends present in this image. It is a purely visual representation.
### Detailed Analysis or ### Content Details
The icon consists of the following elements:
* **Main Body:** A rectangular shape, predominantly black, representing the bulk of the MicroSD card.
* **Contact Pads:** Three rectangular cutouts at the top of the card, representing the electrical contact pads. These are white.
* **Rounded Corners:** The overall shape has rounded corners, typical of MicroSD card designs.
* **Write-Protect Switch:** There is no visible representation of a write-protect switch.
### Key Observations
The icon is highly simplified and lacks any detailed features. It focuses on the essential visual cues that identify a MicroSD card. The contrast between the black body and white contact pads is strong, enhancing visibility.
### Interpretation
This icon is a symbolic representation of a MicroSD card, commonly used to denote storage capacity, data transfer, or related functionalities in user interfaces, software applications, or documentation. The absence of detailed features suggests it is intended for general use and does not represent a specific model or capacity. The icon's simplicity makes it easily scalable and recognizable across various platforms and sizes. It is a visual shorthand for a physical object and its associated function. The icon does not contain any factual data or trends; it is purely a visual signifier.
</details>
represents the aggregated knowledge upon current entity.
3.2 MH Benchmark
We also designed 238 carefully crafted queries involving knowledge within MH-MMKG. They cover six challenging sub-tasks, as shown in Table 1 (some samples are available in Figure 5). As shown in Figure 3, each input query $Q=(q,d,z)$ consists of 1) a question $q$ , 2) a video or images $d$ , which serves as a visual reference for $q$ , and auxiliary information $z$ , which contains relevant monsterâs name $eâ\mathcal{E}_{o}$ as well as additional information that cannot be inferred purely from $d$ . The auxiliary information $z$ provides additional context to an MLLM for finding the correct subgraph. By default, we assume that $z$ is available as it facilitates knowledge retrieval and in the real scenarios players generally know them We analyze their impacts in the supplementary material.. Note that, visual data in $Q$ is different from MH-MMKGâs.
To answer $q$ without built-in knowledge about the gameâs world, a model needs to go through multiple entities and edges of $\mathcal{G}$ . For example, to answer question âWhich attack follows?â with a visual context in which a monster Rathian perform attack action Triple Rush, the knowledge will be:
$$
\displaystyle\textit{Rathian}\xrightarrow{\text{attack of}}\textit{Triple Rush}\xrightarrow{\text{continues}}\textit{Bite}, \tag{1}
$$
as shown in Figure 2. Therefore, finding the correct paths over $\mathcal{G}$ is essential for this task. Each query is annotated with a subgraph $\mathcal{I}$ of $\mathcal{G}$ as ground-truth knowledge for answering $q$ , textual descriptions $s$ of $d$ written by the knowledgeable game players, and the ground-truth answer $y$ . $\mathcal{I}$ comes with a root entity $e_{0}$ , and the paths from $e_{0}$ to each leaf is deemed as a bunch of knowledge. The details of MH-MMKG and the MH benchmark are available in the supplementary material.
3.3 Task Definition
We define three task variants over MH benchmark with different levels of supplementary information.
Knowledgeable (Know.) variant simulates a model with sufficient perception for the gameâs visual domain and sufficient (built-in) knowledge about the world to answer the given question. This idealized setting assesses how well the model reasons the answer $\hat{y}$ from perfect external knowledge and perception. Together with $Q$ and $\mathcal{G}$ , we provide a model with 1) the annotated subgraph $\mathcal{I}$ as well as 2) the textual description $s$ of $Q$ âs visual reference $d$ , and 3) the set of all human-written caption for $c$ in $\mathcal{G}$ so that it can access to ones associated with entities in $\mathcal{I}$ without visual perception, i.e.:
$$
\hat{y}=M(Q,s,\mathcal{I},\mathcal{G},\mathcal{A}). \tag{2}
$$
Perceptive variant also assumes a modelâs sufficient perception but without knowledge to answer the question. This setting evaluates the modelâs knowledge retrieval (i.e., to find $\mathcal{I}$ ) ability and the knowledge comprehension ability. The model is supplied with human-written caption for all $c$ so that it can find textual descriptions associated with entities, and is required to output both the final response $\hat{y}$ and the retrieved $\hat{\mathcal{I}}$ :
$$
\hat{y},\hat{\mathcal{I}}=M(Q,s,\mathcal{G},\mathcal{A}). \tag{3}
$$
Unaided is the most challenging variant, where the model relies entirely on its own visual perception (for both $d$ and $c$ ) to interpret the visual world and retrieve external knowledge. As with the perceptive variant, the output are both $\hat{y}$ and $\hat{\mathcal{I}}$ . This variant is formulated as:
$$
\hat{y},\hat{\mathcal{I}}=M(Q,\mathcal{G}). \tag{4}
$$
4 Method
Figure 3 illustrates the overall pipeline of our method to address our task. It is designed as a baseline method for the MH Benchmark and we describe it for the unaided variant, while it can easily adapted to the others. Given an input query $Q$ , the visual reference $dâ Q$ is first converted into a textual caption by a perceiver. Then, a multi-agent retriever, consisting of topic selection, validation, and expansion agents, retrieves relevant knowledge for $Q$ . Finally, a summarizer generates $\hat{y}$ using retrieved knowledge.
The perceiver $P$ is designed to transform $d$ in $Q$ into textual description. While retaining the original visual data could provide richer information, on-the-fly perception in the visual modality can incur high computational costs; therefore, we choose to convert $d$ into the text as a more efficient representation. The transformed text is demoted as $\hat{s}=P(Q)$ . We detail the multi-agent retriever and summarizer in the following sections.
4.1 Multi-agents Retriever
We design a fully automated search algorithm with three agents to find the subgraph of $\mathcal{G}$ to answer $q$ . First, the topic selection agent $L$ analyzes the input question $q$ to identify the topic entity $e_{0}=L(q,z,\mathcal{E}_{o})$ , which serves as the root for knowledge retrieval. Then, the expansion and validation agents are alternately activated to grow the subgraph with breadth first search, starting from $e_{0}$ , to include necessary knowledge. The expansion agent finds all plausible neighboring entities for each open entities. The validation agent, in turn, validate if the current path from the root to each open entities is enough to answer $q$ .
We denote the subgraph of $\mathcal{G}$ after the $t$ -th round by $\mathcal{K}_{t}=(\mathcal{E}^{\prime}_{t},\mathcal{V}^{\prime}_{t})$ , where $\mathcal{E}^{\prime}_{t}âeq\mathcal{E}$ , $\mathcal{V}^{\prime}_{t}âeq\mathcal{V}$ , and $\mathcal{K}_{0}=(\{e_{0}\},\emptyset)$ . We also denote the set of open entities after the $t$ -th round by $\mathcal{O}_{t}â\mathcal{E}^{\prime}_{t}$ , which is a (sub)set of newly added entities at the $t$ -th round and is required to explore further expansion.
For the $(t+1)$ -th round, the expansion agent $W$ retrieves the set $\mathcal{N}(e)$ of all plausible neighboring entities for each open entity $eâ\mathcal{O}_{t}$ as:
$$
\displaystyle\mathcal{N}(e)=W(e,\mathcal{K}_{t};Q,\mathcal{G},\mathcal{A}), \tag{5}
$$
where $W$ is an MLLM with a designated prompt (detailed in the supplementary material) to judge if the knowledge aggregated over the path on $\mathcal{G}$ from $e_{0}$ to each $e^{\prime}â\{e^{\prime}|(e,e^{\prime})â\mathcal{V}\}$ is useful to answer $q$ . $\mathcal{N}(e)$ is a set of entities that are judged to be useful. To build $\mathcal{K}_{t+1}=(\mathcal{E}^{\prime}_{t+1},\mathcal{V}^{\prime}_{t+1})$ , we first aggregate all neighboring entities and add them to $\mathcal{E}^{\prime}_{t}$ as:
$$
\displaystyle\mathcal{E}^{\prime}_{t+1}=\mathcal{E}^{\prime}_{t}\cup\{e^{\prime}|e\in\mathcal{O}_{t},e^{\prime}\in\mathcal{N}(e)\}, \tag{6}
$$
where duplicated entities are removed to make entities in $\mathcal{E}^{\prime}_{t+1}$ unique. $\mathcal{V}^{\prime}_{t+1}$ includes additional links to newly added entities, given by:
$$
\displaystyle\mathcal{V}^{\prime}_{t+1}=\mathcal{V}^{\prime}_{t}\cup\{\nu(e,e^{\prime})|e\in\mathcal{O}_{t},e^{\prime}\in\mathcal{N}(e)\}, \tag{7}
$$
where $\nu(e,e^{\prime})$ gives $vâ\mathcal{V}$ identified by $(e,e^{\prime})$ .
The validation agent $U$ checks if each path from $e_{0}$ to $eâ\mathcal{O}_{t+1}$ provides sufficient knowledge to answer $q$ :
$$
\displaystyle o(e)=U(e,\mathcal{K}_{t+1};Q,\mathcal{G},\mathcal{A}), \tag{8}
$$
where $U$ again is an MLLM with a carefully designed prompt. $o(e)=\textit{Yes}$ if the knowledge is good enough and no more expansion is needed; otherwise, $o(e)=\textit{No}$ and continues the expansion $U$ also output a caption for $c$ of the entity $e$ in unaided-online setting..
The open set $\mathcal{O}_{t+1}$ basically is the set of newly added entities in the $(t+1)$ -th round. If a newly added entity $e$ is already in $\mathcal{E}^{\prime}_{t}$ and also in $\mathcal{O}_{t}$ , the subgraph forms a loop. In this case, $e$ does not need further expansion as it is already in $\mathcal{E}^{\prime}_{t}$ . Meanwhile, $e$ may be included in the neighbor sets of multiple $eâ\mathcal{O}_{t}$ . In this case, the subgraph also forms a loop, but $e$ has not yet been explored. Thus, it should be in $\mathcal{O}_{t+1}$ . Taking these cases into account, $\mathcal{O}_{t+1}$ is given by:
$$
\displaystyle\mathcal{O}_{t+1}=\mathcal{E}^{\prime}_{t+1}\backslash\mathcal{E}^{\prime}_{t}, \tag{9}
$$
where the operator â $\backslash$ â represents set subtraction.
This multi-agent retrieval runs until no open entity is found (i.e., $\mathcal{O}_{t}=\emptyset$ ). The retrieved subgraph is given by:
$$
\hat{\mathcal{I}}=\mathcal{K}_{t}. \tag{10}
$$
4.2 Reasoning via Knowledge Augmentation
For answer reasoning, we aggregate knowledge on $\hat{\mathcal{I}}$ and represent it in text, which is fed into an LLM. The knowledge augments the reasoning process: the LLM does not need to rely on built-in knowledge about the gameâs world but (almost) pure reasoning ability is required. This is formally denoted by:
$$
\hat{y}=\text{MLLM}(Q,\aleph(\hat{\mathcal{I}},\mathcal{G},\mathcal{A},\alpha)), \tag{11}
$$
where $\hat{y}$ is the predicted answer for $Q$ , and $\aleph$ transforms each path from $e_{0}$ to each leaf of $\hat{I}$ into text given the entire knowledge (i.e., $\mathcal{G}$ and $\mathcal{A}$ ). The parameter $\alpha$ limited the number of paths in $\hat{\mathcal{I}}$ used for reasoning (defaulted as 5).
5 Results
Table 2: The experimental settings of query and knowledge retrieval. âVisionâ indicates that $d$ is used, while âH-Cap.â uses $s$ . For MMKG, âPathâ means a model can access annotated subgraphs $\mathcal{I}$ . âH-Cap.â means a model use human-written captions (for $c$ ) in $\mathcal{A}$ . âVis.-Off.â and âVis.-On.â represent how a modelâs captions are generated (offline and online).
| Methods | Query Vision | MMKG H-Cap. | Path | H-Cap. | Vis.-Off | Vis.-On |
| --- | --- | --- | --- | --- | --- | --- |
| Vanilla | â | | | | | |
| Vanilla + | | â | | | | |
| Know. | | â | â | â | | |
| Perceptive | | â | | â | | |
| Unaided-Offline | â | | | | â | |
| Unaided-Online | â | | | | | â |
Table 3: Experimental results are reported for both leading closed-source and open-source MLLMs. The Vanilla, Vanilla +, and Knowledgeable (Know.) experiments are evaluated solely on Acc., while all other experiments are assessed based on both $Acc.$ and knowledge consistency. Results improved in Online than Offline are highlighted with light green.
| Models GPT-4o [1] | Vanilla Acc. .3122 | Vanilla + Pre. .3924 | Know. Rec. .8565 | Perceptive Acc. .7383 | Unaided-Offline Pre. .5061 | Unaided-Online Rec. .7046 | Acc. .4050 | Pre. .2595 | Rec. .4416 | .5105 | .2756 | .5625 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GPT-4o mini [1] | .3218 | .4135 | .8481 | .6877 | .2963 | .5450 | .4514 | .2059 | .5028 | .3544 | .1626 | .3009 |
| Claude 3.7 Sonnet [2] | .2827 | .3375 | .8987 | .7004 | .5817 | .6322 | .3628 | .2775 | .3270 | .4388 | .2911 | .4029 |
| Claude 3.5 Sonnet [2] | .2869 | .3755 | .8776 | .7215 | .5922 | .6800 | .3966 | .3215 | .4008 | .3966 | .2330 | .3270 |
| Claude 3.5 Haiku [2] | .2356 | .3206 | .8823 | .6455 | .3739 | .5007 | .3544 | .2002 | .3164 | .3670 | .1735 | .3361 |
| Gemini 2.0 Flash [47] | .1983 | .2995 | .8438 | .6919 | .3507 | .6146 | .3839 | .1703 | .4092 | .3713 | .1515 | .3663 |
| Gemini 1.5 Pro [47] | .2194 | .2700 | .8438 | .6962 | .4761 | .6033 | .3164 | .1615 | .2194 | .4050 | .2122 | .4585 |
| Step-1o [43] | .2436 | .2815 | .8235 | .5747 | .4372 | .5095 | .3025 | .1831 | .2483 | .3403 | .2204 | .2987 |
| InternVL2.5-78B-MPO [12] | .1603 | .2616 | .8649 | .5991 | .4198 | .5428 | .2700 | .1729 | .2250 | .3080 | .1556 | .2378 |
| Qwen2.5-VL-72B [3] | .1476 | .2616 | .8734 | .6244 | .4602 | .4908 | .3206 | .2139 | .2383 | .3164 | .1615 | .1814 |
| Ovis2-16B [36] | .1645 | .2573 | .8902 | .7046 | .5407 | .5949 | .2869 | .1963 | .2383 | .3459 | .1853 | .2878 |
| MiniCPM-o-2.6 [53] | .1139 | .2405 | .8312 | .4683 | .3183 | .4001 | .2194 | .1311 | .2376 | .1687 | .0189 | .0210 |
| DeepSeek-VL2-Small [52] | .1139 | .2362 | .6455 | .3586 | .1419 | .1708 | .1814 | .0400 | .0759 | .1181 | .0042 | .0042 |
| Human (Knowledgeable) | .5252 | â | â | â | â | â | â | â | â | .9033 | .9207 | .8535 |
| Human (Random) | .0336 | â | â | â | â | â | â | â | â | .6092 | .7113 | .6457 |
5.1 Experimental Settings
We evaluated both leading closed-source models (showing in api: gpt-4o-2024-11-20, gpt-4o-mini-2024-07-18 [1], claude-3-7-sonnet-20250219, claude-3-5-sonnet-20240620, claude-3-5-haiku-20241022 [2], gemini-1.5-pro-002, gemini-2.0-flash-001 [47], and step-1o-vision-32k [43]) and open-source models (InternVL2.5-78B-MPO [12], Qwen2.5-VL-72B [3], Ovis2-16B [36], and DeepSeek-VL2 [52]) with our baseline over MH Benchmark. A single model serves as all agents and functionalities (i.e., the perceiver, summarize, as well as the topic selection, validation, and expansion agents in the multi-agent retriever) via different prompts. We use images solely as visual references because current MLLMs do not support videos as input. Due to input size limitations in most models, all images are resized to 1K for our experiments.
Evaluation Metrics. We use accuracy (Acc.) to measure the correctness of predicted answers $\hat{y}$ compared to the ground-truth answer $y$ . Since the questions in our benchmark follow an open-ended question-answering format, we employ GPT-4o as a judge with few-shot samples [59]. Knowledge consistency is also used to calculate the consistency between $\mathcal{\hat{I}}$ and $\mathcal{I}$ using precision (Pre.) and recall (Rec.), demonstrating the efficiency in searching relevant knowledge. Human evaluation for GPT-4o as a judge and metric definition are shown in the supplementary material.
5.2 Performance of MLLMs
In addition to the three tasks defined in 3.3, we evaluate the modelsâ performance without external knowledge (vanilla). Vanilla + isolates the effect of the modelâs visual ability. Furthermore, the Unaided task includes two variants: 1) Offline variant relies on pre-extracted captions (for all $c$ in MMKG) of an MLLM without access to the query $Q$ or any information obtained during knowledge retrieval. Online variant generates captions for $c$ during knowledge retrieval (i.e., in $U$ ). Table 2 summarizes the experimental setups. We also include human performance for the vanilla and unaided-online settings. We recruited voluntary participants: One is knowledgeable, and the other has not played the game series. They first answered the questions only with $Q$ and then were allowed to look into $\mathcal{G}$ .
As shown in Table 3, with the vanilla setting, all methods hardly predicted correct answers. GPT-4o achieves the best performance with an accuracy of 0.31, whereas the knowledgeable human attains 0.53. The closed-source models generally outperform open-source ones, implying that the closed-source models have richer built-in knowledge about the game. With the vanilla + setting, which allows human captions $s$ instead of images $d$ , most modelsâ performance improved, which means that MH Benchmark requires comprehension of visual context. In the knowledgeable experiment, the models achieved an accuracy of roughly 0.9. We thus argue that MH-MMKG provides sufficient knowledge to answer the questions in MH Benchmark.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: MH Benchmark Sub-task Accuracy Comparison
### Overview
This bar chart compares the accuracy of four large language models â GPT-4o, Claude 3.7, Gemini 1.5, and a "Pre." model (likely a predecessor or baseline) â across six sub-tasks within the MH Benchmark. Accuracy is represented on the y-axis, ranging from 0.0 to 1.0, while the x-axis denotes the six sub-tasks labeled I through VI. The chart uses bar graphs for GPT-4o, Claude 3.7, and Gemini 1.5, and star and circle markers to represent the "Pre." model's accuracy.
### Components/Axes
* **X-axis:** "MH Benchmark Sub-tasks" with markers I, II, III, IV, V, VI.
* **Y-axis:** "Accuracy" with a scale from 0.0 to 1.0, incrementing by 0.2.
* **Legend:**
* Blue bars: GPT-4o
* Orange bars: Claude 3.7
* Green bars: Gemini 1.5
* Yellow stars: Pre.
* White circles: Re. (likely representing a different pre-model or a revised version)
* **Data Series:** Four distinct data series representing the accuracy of each model across the six sub-tasks.
### Detailed Analysis
Here's a breakdown of the accuracy values for each model and sub-task, based on visual estimation:
* **Sub-task I:**
* GPT-4o: Approximately 0.94
* Claude 3.7: Approximately 0.88
* Gemini 1.5: Approximately 0.62
* Pre.: Approximately 0.05
* Re.: Approximately 0.86
* **Sub-task II:**
* GPT-4o: Approximately 0.38
* Claude 3.7: Approximately 0.34
* Gemini 1.5: Approximately 0.26
* Pre.: Approximately 0.24
* Re.: Approximately 0.36
* **Sub-task III:**
* GPT-4o: Approximately 0.22
* Claude 3.7: Approximately 0.24
* Gemini 1.5: Approximately 0.18
* Pre.: Approximately 0.12
* Re.: Approximately 0.46
* **Sub-task IV:**
* GPT-4o: Approximately 0.68
* Claude 3.7: Approximately 0.32
* Gemini 1.5: Approximately 0.38
* Pre.: Approximately 0.28
* Re.: Approximately 0.66
* **Sub-task V:**
* GPT-4o: Approximately 0.72
* Claude 3.7: Approximately 0.36
* Gemini 1.5: Approximately 0.40
* Pre.: Approximately 0.20
* Re.: Approximately 0.48
* **Sub-task VI:**
* GPT-4o: Approximately 0.50
* Claude 3.7: Approximately 0.64
* Gemini 1.5: Approximately 0.66
* Pre.: Approximately 0.02
* Re.: Approximately 0.04
**Trends:**
* GPT-4o generally exhibits the highest accuracy across most sub-tasks, with a slight dip in Sub-task VI.
* Claude 3.7 consistently performs well, often second to GPT-4o.
* Gemini 1.5 shows moderate accuracy, generally lower than GPT-4o and Claude 3.7.
* The "Pre." model consistently demonstrates the lowest accuracy across all sub-tasks.
* The "Re." model shows improved accuracy compared to "Pre." but generally remains lower than the other three models.
### Key Observations
* GPT-4o significantly outperforms the other models on Sub-task I, achieving an accuracy close to 0.95.
* Sub-task III presents the lowest overall accuracy scores for all models.
* Claude 3.7 and Gemini 1.5 show comparable performance on Sub-task VI, both exceeding GPT-4o's accuracy.
* The gap between GPT-4o and the other models is most pronounced in Sub-tasks I and IV.
### Interpretation
The data suggests that GPT-4o is the most accurate model across the majority of the MH Benchmark sub-tasks, indicating its superior performance in these specific areas. Claude 3.7 consistently provides strong results, positioning it as a robust alternative. Gemini 1.5 demonstrates reasonable accuracy but lags behind the top two performers. The "Pre." model serves as a clear baseline, highlighting the significant improvements achieved by the newer models. The "Re." model suggests iterative improvements are possible.
The varying performance across sub-tasks indicates that the models' strengths and weaknesses are task-dependent. Sub-task I appears to be particularly easy for all models, while Sub-task III presents a significant challenge. The differences in accuracy highlight the importance of evaluating models on a diverse set of tasks to gain a comprehensive understanding of their capabilities. The consistent low performance of the "Pre." model underscores the advancements made in language model technology. The "Re." model's performance suggests that fine-tuning or architectural changes can lead to incremental improvements.
</details>
Figure 4: Performance comparison of GPT-4o, Claude 3.7 Sonnet, and Gemini 1.5 Pro across 6 sub-tasks in MH Benchmark.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Diagram: AI Agent Capabilities in Monster Hunter World
### Overview
This diagram showcases the capabilities of an AI agent designed to interact with the game *Monster Hunter World*. It presents four distinct scenarios (I-IV) demonstrating the agent's abilities in Individual Information, Action Recognition, Combo Premonition, and Condition Awareness. Each scenario includes a question posed to the agent, the agent's "Vanilla Answer" (baseline performance), an "Augmented Answer" (improved performance), and a visualization of "Retrieved Paths" â likely representing the reasoning process or data used to arrive at the answer. A final section (V) details the overall results.
### Components/Axes
The diagram is structured into five main sections, labeled I through V, arranged horizontally. Each section contains:
* **Question:** A text-based question posed to the AI agent.
* **Vanilla Answer:** The agent's initial response.
* **Augmented Answer:** The agent's improved response.
* **Retrieved Paths:** A diagram illustrating the agent's reasoning, using nodes and colored lines.
* **Recall & Precision:** Numerical metrics associated with the Augmented Answer.
Section V presents a bar chart comparing the Recall and Precision of the Vanilla and Augmented answers across all four scenarios. The axes of the bar chart are labeled "Recall" and "Precision", with values ranging from 0 to 1.
### Detailed Analysis or Content Details
**I: Individual Information**
* **Question:** "What is the nickname of Tigrex in Monster Hunter World?"
* **Vanilla Answer:** "The nickname of Tigrex in Monster Hunter World is 'The Roaring Wyvern'."
* **Augmented Answer:** "The nickname of Tigrex in the game is the 'Absolutely Strong One'."
* **Recall:** 1
* **Precision:** 1
* **Retrieved Paths:** A node labeled "Tigrex" connects to a text block describing Tigrex. Lines connect "Tigrex" to the answer.
**II: Action Recognition**
* **Question:** "Showing the beginning of an attack of Kushala Daora. What will happen next within this attack action?"
* **Vanilla Answer:** "Kushala Daora will perform a wind-based attack, likely creating a wind pressure or tornado effect in the area."
* **Augmented Answer:** "Kushala Daora will execute the Aerial Crash attack, charging toward hunter seamlessly transitioning into the Ground Phase."
* **Recall:** 0.25
* **Precision:** 1
* **Retrieved Paths:** A diagram showing phases of Kushala Daora's attack, with lines connecting "Phase" nodes to "Aerial Triple Breath Attack", "Aerial Kick", and "Aerial Crash".
**III: Combo Premonition**
* **Question:** "Analyze all possible continues attacks of Zingore in the following phase after the showing attack."
* **Vanilla Answer:** "Zingore may follow up with Thunder Claw attack, Electric Charge."
* **Augmented Answer:** "Zingore charged the augmented action of Thunder Charge, Electric Charge."
* **Recall:** 0.5
* **Precision:** 0.5
* **Retrieved Paths:** A diagram showing Zingore's attack sequence, with lines connecting "Charged Phase" to "Thunder Charge" and "Leap Attack".
**IV: Condition Awareness**
* **Question:** "Predict the ongoing attack action Frostfang Barioth will release."
* **Vanilla Answer:** "Frostfang Barioth will stick to the wall if he is close."
* **Augmented Answer:** "Frostfang Barioth is preparing a Super Flam attack."
* **Recall:** 1
* **Precision:** 0.4
* **Retrieved Paths:** A diagram showing Frostfang Barioth's attack sequence, with lines connecting "Backstep Breath" to "Leap Attack" and "Aerial Ice Breath".
**V: Overall Results**
* **Bar Chart:**
* **Vanilla Recall:** Approximately 0.6 (uncertainty ± 0.1)
* **Vanilla Precision:** Approximately 0.5 (uncertainty ± 0.1)
* **Augmented Recall:** Approximately 0.7 (uncertainty ± 0.1)
* **Augmented Precision:** Approximately 0.6 (uncertainty ± 0.1)
### Key Observations
* The Augmented Answer consistently demonstrates improved Recall compared to the Vanilla Answer.
* Precision is more variable, with some Augmented Answers showing improvement and others showing a decrease.
* The "Retrieved Paths" diagrams suggest a complex reasoning process involving multiple steps and potential attack sequences.
* The Recall and Precision values are relatively low, indicating room for further improvement in the AI agent's performance.
### Interpretation
This diagram demonstrates the potential of AI agents to understand and predict events within the complex environment of *Monster Hunter World*. The Augmented Answers consistently outperform the Vanilla Answers, suggesting that the implemented enhancements are effective. The "Retrieved Paths" visualizations offer insight into the agent's reasoning process, revealing how it connects different game elements to arrive at its conclusions. The bar chart in Section V provides a quantitative assessment of the agent's performance, highlighting areas where further development is needed. The discrepancies between Recall and Precision suggest a trade-off between identifying all relevant possibilities (Recall) and accurately predicting the correct outcome (Precision). The agent appears to be better at identifying potential actions (high Recall) but struggles with pinpointing the exact action that will occur (lower Precision). This could be due to the inherent uncertainty in the game environment or limitations in the agent's ability to model complex interactions. The diagram suggests a promising direction for AI research in gaming, with the potential to create more intelligent and adaptive game agents.
</details>
Figure 5: Examples of the 6 sub-tasks in the MH Benchmark, each generated by GPT-4o for both the Vanilla Answer and the Augmented Answer using an unaided-online retrieval.
The perceptive, unaided-offline, and unaided-online settings evaluate a modelâs ability to retrieve relevant knowledge In the perceptive setting, which uses human captions for both queries and knowledge, all models demonstrated a strong capacity to identify relevant knowledge, significantly improving performance compared to vanilla +. Additionally, the models exhibit high knowledge consistency. In the unaided-offline setting, all models performed worse than in the perceptive setting, suggesting that visual perception plays a crucial role in knowledge retrieval. Nevertheless, they are still still better than the vanilla setting, demonstrating the benefits of knowledge retrieval. The unaided-online setting further challenges the modelsâ visual perception to generate captions with richer context (i.e., $Q$ and the knowledge on the path in $\hat{\mathcal{I}}$ ) compared to the unaided-offline. Our results show that only some models surpassed the offline variant (highlighted in green), yet the improvement is evidentâGPT-4o improves by 0.1, and Claude 3.7 Sonnet by 0.07. This suggests that knowing what to do enhances a modelâs visual perception, strengthening its planning ability, and ultimately results in improved reasoning ability. Additionally, we find that recall is consistently higher than precision, suggesting that models tend to retrieve more knowledge. On the contrary, humans showed higher precision (though both precision and recall are high compared to MLLMs).
We also analyzed the performance differences in the unaided-online setting across six sub-tasks over GPT-4o, Claude 3.7 Sonnet, and Gemini 1.5 Pro, as illustrated in Figure 4. GPT-4o achieves the best performance in all sub-tasks except I and VI. It is particularly strong in sub-tasks IV and V, which are more reliant on fine-grained visual perception. Additionally, sub-task I is the simplest, involving only a single relationship $v$ , for which the models generally perform well. For sub-task VI, although the accuracy is high, models fail to find the correct path (model uses its inherent knowledge for response). These results suggest that visual perception remains a key challenge for MH benchmark.
Figure 5 presents six examples (one per sub-task) in the vanilla and unaided-online settings, where all predictions are by GPT-4o. In the retrieved path panes, the green paths are both in $\mathcal{I}$ and $\hat{\mathcal{I}}$ (i.e., correctly retrieved paths), the dotted green paths are in $\mathcal{I}$ but not in $\hat{\mathcal{I}}$ , and the black paths are in $\hat{\mathcal{I}}$ but not in $\mathcal{I}$ . The key relationship or essential information for deriving the correct answer is highlighted in orange. We report the precision and recall metrics for each example. GPT-4o retrieves many knowledge paths, capturing all relevant knowledge for prediction, though precision is low. Interestingly, for challenging cases like sub-tasks II, GPT-4o successfully recognized detailed visual concepts (e.g., the monster flying backward and the monster sticking to the wall) and generated plausible captions, which require fine-grained visual perception.
5.3 Analysis of Factors Affecting Performance
Impact of Captioning Ability. Table 4 compares performance of captioning, where the metric similarity (Sim.) measures the similarity between generated and human captions using GPT-4o (see supplementary for details). The table also summarizes the accuracy and knowledge consistency scores. The results indicate that the unaided-online setting consistently gives higher similarity scores than the unaided-offline, suggesting that awareness of query $Q$ and retrieval process consistently enhances captioning performance. Also, the similarity scores positively correlated with the reasoning accuracy scores. We also evaluated the unaided-offline variant on GPT-4o but its offline captions are replaced with ones by video models InternVideo2.5 [50] and VideoChat-Flash [32]. The results showed lower performance than the original GPT-4o across all metrics, indicating that current video models have limited capability in visual comprehension for MH Benchmark.
Table 4: Reasoning and knowledge consistency performance and captioning performance. Improved performance scores due to online captioning are highlighted in green.
| Model GPT-4o [1] GPT-4o [1] | Vis.-Off â | Vis.-On â | Acc. .4050 .5105 | Pre. .2595 .2756 | Rec. .4416 .5625 | Sim. .2806 .2948 |
| --- | --- | --- | --- | --- | --- | --- |
| Claude 3.7 Sonnet [2] | â | | .3628 | .2775 | .3270 | .2776 |
| Claude 3.7 Sonnet [2] | | â | .4388 | .2911 | .4029 | .3208 |
| Gemini 1.5 Pro [47] | â | | .3164 | .1615 | .2194 | .1608 |
| Gemini 1.5 Pro [47] | | â | .4050 | .2122 | .4585 | .1746 |
| InternVideo2.5 [50] | â | | .3697 | .1960 | .2959 | .0525 |
| VideoChat-Flash [32] | â | | .3445 | .2135 | .2863 | .0644 |
Keyframe Selection. Due to the limited number of input tokens, current MLLMs cannot handle videos, and keyframe sampling is necessary. MH-MMKG provides human-selected keyframes to facilitate the evaluation on this point. To show the difference between human-selected keyframes and typically-adopted equal-interval sampling, Table 5 summarizes the results on GPT-4o, where sampling is done in two frames per second, and the maximum number of frames is capped at ten frames. The keyframe selection strategy does impact the performance: Human-selected keyframes seem to provide more informative visual cues, though the difference is not substantial.
Table 5: Impact of Keyframes selection.
| Human Sampling | .4050 .3725 | .2595 .2199 | .4416 .3893 | .5105 .4840 | .2756 .2388 | .5625 .5254 |
| --- | --- | --- | --- | --- | --- | --- |
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy and Data Count vs. Retrieved Knowledge
### Overview
This line chart displays the relationship between the number of retrieved knowledge items (Top K Retrieved Knowledge) and both the accuracy and the number of data points for two models: GPT and Claude. Accuracy is measured on the left y-axis, while the number of data points is measured on the right y-axis. The chart shows how accuracy and data count change as the amount of retrieved knowledge increases from 1 to 8.
### Components/Axes
* **X-axis:** "Top K Retrieved Knowledge" ranging from 1 to 8.
* **Left Y-axis:** "Accuracy" ranging from 0.3 to 0.7.
* **Right Y-axis:** "Number of Data" ranging from 0 to 238.
* **Legend:** Located in the top-left corner, containing the following lines:
* "GPT Acc" (Blue, solid line with circle markers)
* "Claude Acc" (Orange, dashed line with square markers)
* "GPT Num" (Blue, dashed line with triangle markers)
* "Claude Num" (Orange, dotted line with diamond markers)
### Detailed Analysis
**GPT Accuracy ("GPT Acc" - Blue, solid line):**
The line starts at approximately 0.31 at K=1, rises sharply to around 0.48 at K=3, continues to increase to approximately 0.53 at K=5, and plateaus around 0.54 for K=6, 7, and 8.
**Claude Accuracy ("Claude Acc" - Orange, dashed line):**
The line begins at approximately 0.33 at K=1, increases rapidly to around 0.44 at K=3, and then rises more slowly to approximately 0.47 at K=8.
**GPT Number of Data Points ("GPT Num" - Blue, dashed line):**
The line starts at approximately 225 at K=1, decreases sharply to around 100 at K=4, and continues to decrease to approximately 25 at K=8.
**Claude Number of Data Points ("Claude Num" - Orange, dotted line):**
The line begins at approximately 238 at K=1, decreases rapidly to around 110 at K=4, and continues to decrease to approximately 30 at K=8.
### Key Observations
* GPT consistently demonstrates higher accuracy than Claude across all values of K.
* Both models exhibit a diminishing return in accuracy as K increases beyond 3.
* The number of data points decreases significantly as K increases for both models. This suggests that as more knowledge is retrieved, the data becomes more selective or filtered.
* The initial drop in data points is very steep for both models, indicating a substantial reduction in the dataset size with each additional retrieved knowledge item.
### Interpretation
The chart suggests that retrieving a small amount of knowledge (K=3) provides the most significant improvement in accuracy for both GPT and Claude. Beyond this point, the gains in accuracy are minimal, while the number of data points available decreases substantially. This could indicate that the most relevant information is retrieved early on, and adding more knowledge introduces noise or irrelevant data that diminishes the overall performance. The difference in accuracy between GPT and Claude suggests that GPT is better at utilizing retrieved knowledge to improve its performance, or that the retrieval mechanism itself is more effective for GPT. The steep decline in the number of data points as K increases could be due to a filtering process that prioritizes the most relevant knowledge items, or it could be a consequence of the retrieval algorithm itself. The chart highlights a trade-off between accuracy and data availability â increasing K improves accuracy up to a point, but at the cost of reducing the amount of data used for evaluation.
</details>
(a) Using K paths as knowledge.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison
### Overview
This bar chart compares the accuracy of three large language models (GPT-4o, Claude 3.7, and Gemini Pro) using two different search strategies: Breadth-First Search (BFS) and Depth-First Search (DFS). Accuracy is measured on the y-axis, and the models are displayed on the x-axis. Within each model's bar grouping, there are two bars representing BFS and DFS, and two marker types representing "Pre." and "Re.".
### Components/Axes
* **X-axis:** "Models" with categories: GPT-4o, Claude 3.7, Gemini Pro.
* **Y-axis:** "Accuracy" ranging from approximately 0.2 to 0.6.
* **Legend:**
* Green bar: BFS (Breadth-First Search)
* Red bar: DFS (Depth-First Search)
* Yellow star: Pre. (Precision)
* White circle: Re. (Recall)
### Detailed Analysis
The chart consists of three groups of bars, one for each model. Within each group, there's a green bar for BFS and a red bar for DFS. Superimposed on each bar group are yellow stars ("Pre.") and white circles ("Re.").
**GPT-4o:**
* BFS: The green bar reaches approximately 0.52 accuracy. A white circle ("Re.") is positioned at approximately 0.48 accuracy, and a yellow star ("Pre.") is at approximately 0.32 accuracy.
* DFS: The red bar reaches approximately 0.47 accuracy. A white circle ("Re.") is positioned at approximately 0.42 accuracy, and a yellow star ("Pre.") is at approximately 0.30 accuracy.
**Claude 3.7:**
* BFS: The green bar reaches approximately 0.44 accuracy. A white circle ("Re.") is positioned at approximately 0.43 accuracy, and a yellow star ("Pre.") is at approximately 0.32 accuracy.
* DFS: The red bar reaches approximately 0.38 accuracy. A white circle ("Re.") is positioned at approximately 0.40 accuracy, and a yellow star ("Pre.") is at approximately 0.32 accuracy.
**Gemini Pro:**
* BFS: The green bar reaches approximately 0.34 accuracy. A white circle ("Re.") is positioned at approximately 0.40 accuracy, and a yellow star ("Pre.") is at approximately 0.26 accuracy.
* DFS: The red bar reaches approximately 0.32 accuracy. A white circle ("Re.") is positioned at approximately 0.36 accuracy, and a yellow star ("Pre.") is at approximately 0.24 accuracy.
### Key Observations
* GPT-4o consistently demonstrates the highest accuracy for both BFS and DFS strategies.
* BFS generally outperforms DFS across all three models, although the difference is more pronounced for GPT-4o.
* The "Re." (Recall) values are consistently higher than the "Pre." (Precision) values for each model and search strategy.
* Gemini Pro exhibits the lowest accuracy for both search strategies.
### Interpretation
The data suggests that GPT-4o is the most accurate model among the three tested, regardless of the search strategy employed. The consistent outperformance of BFS indicates that a broader search approach is more effective for these models in the context of the task being evaluated. The higher recall values compared to precision values suggest that the models are better at identifying relevant items (high recall) but may also include some irrelevant items (lower precision). The relatively low accuracy of Gemini Pro suggests it may require further optimization or is less suited for this particular task. The separation of "Pre." and "Re." markers provides insight into the trade-offs between precision and recall for each model and search strategy. The consistent placement of the "Pre." markers lower than the "Re." markers suggests a general tendency towards higher recall at the expense of precision.
</details>
(b) BFS vs DFS
Figure 6: Ablation experiments for proposed multi-agents search.
The Number of Paths Used in Reasoning. In our experiments, the number of the paths in $\hat{\mathcal{I}}$ used for reasoning is limited. All evaluations so far used 5 paths, though this number can change the performance Our baseline uses the first 5 paths. A shorter path is thus preferred.. Figure 6(a) shows the relationship between the number of paths and the reasoning accuracy. GPT-4o gives the optimal performance when 5 paths are used. We also show in the figure the number of queries $Q$ for which at least a certain number of paths are retrieved. GPT-4o tends to find more paths as its knowledge (the recall is higher), while Claude 3.7 is more conservative in retrieval (the precision is higher).
BFS versus DFS. The retrieval strategy can also impact performance. BFS finds shorter paths first, while depth-first search (DFS) can find longer paths at the early stage of retrieval. We evaluated the performance for BFS (the proposed baseline) and DFS when three paths are used. As shown in Figure 6(b), BFS consistently performs better. This means that our queries can be answered within fewer hops in $\mathcal{G}$ , which can be seen as a limitation of our MH Benchmark, as it only requires fewer steps of reasoning.
6 Conclusion
This work explores the ability of MLLMs to handle domain-specific tasks by retrieving knowledge from MMKG. We introduce MH-MMKG and the MH benchmark as a testbed. Additionally, we propose a baseline with a multi-agent knowledge retriever that allows MLLMs to autonomously access relevant knowledge without requiring additional training. Experimental results on both highlight the importance of visual perception of MLLMs and finding relevant knowledge for reasoning. Our work paves the way for more adaptable and context-aware MLLMs in complex real-world scenarios. Future research may focus on expanding the benchmark size, incorporating a wider range of knowledge types, exploring the potential of video modalities, and developing more advanced knowledge retrieval methods.
Acknowledgement
This work is supported by World Premier International Research Center Initiative (WPI), MEXT, Japan. This work is also supported by JST ACT-X Grant No. JPMJAX24C8, JSPS KAKENHI No. 24K20795, CREST Grant No. JPMJCR20D3, and JST FOREST Grant No. JPMJFR216O.
References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
- Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- Baumgartner et al. [2020] Matthias Baumgartner, Luca Rossetto, and Abraham Bernstein. Towards using semantic-web technologies for multi-modal knowledge graph construction. In ACM Multimedia, pages 4645â4649, 2020.
- Bollacker et al. [2008] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247â1250, 2008.
- Bonomo and Bianco [2025] Mirco Bonomo and Simone Bianco. Visual rag: Expanding mllm visual knowledge without fine-tuning. arXiv preprint arXiv:2501.10834, 2025.
- Caffagni et al. [2024] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In CVPR, pages 1818â1826, 2024.
- Chen et al. [2025] Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, and Hui Xiong. Plan-on-graph: Self-correcting adaptive planning of large language model on knowledge graphs. In AAAI, 2025.
- Chen et al. [2024a] Peng Chen, Pi Bu, Jun Song, Yuan Gao, and Bo Zheng. Can vlms play action role-playing games? take black myth wukong as a study case. arXiv preprint arXiv:2409.12889, 2024a.
- Chen et al. [2022] Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William Cohen. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In EMNLP, pages 5558â5570, 2022.
- Chen et al. [2023] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Chen et al. [2024b] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024b.
- Chen et al. [2024c] Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo, Xiang Chen, Qian Li, Wen Zhang, Jiaoyan Chen, Yushan Zhu, et al. Knowledge graphs meet multi-modal learning: A comprehensive survey. arXiv preprint arXiv:2402.05391, 2024c.
- Cui et al. [2024] Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. In WACV, pages 958â979, 2024.
- Edge et al. [2024] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024.
- Fang et al. [2024] Siyuan Fang, Kaijing Ma, Tianyu Zheng, Xinrun Du, Ningxuan Lu, Ge Zhang, and Qingkun Tang. Karpa: A training-free method of adapting knowledge graph as references for large language modelâs reasoning path aggregation. arXiv preprint arXiv:2412.20995, 2024.
- Ferrada et al. [2017] SebastiĂĄn Ferrada, Benjamin Bustos, and Aidan Hogan. Imgpedia: a linked dataset with content-based analysis of wikimedia images. In ISWC, pages 84â93, 2017.
- Gao et al. [2023] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Guan et al. [2024] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, pages 14375â14385, 2024.
- Gui et al. [2022] Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander G Hauptmann, Yonatan Bisk, and Jianfeng Gao. Kat: A knowledge augmented transformer for vision-and-language. In NAACL, pages 956â968, 2022.
- Guo et al. [2021] Hao Guo, Jiuyang Tang, Weixin Zeng, Xiang Zhao, and Li Liu. Multi-modal entity alignment in hyperbolic space. Neurocomputing, 461:598â607, 2021.
- He et al. [2024] Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. In NeurIPS, 2024.
- Ishiwatari et al. [2020] Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In EMNLP, pages 7360â7370, 2020.
- Ji et al. [2021] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and S Yu Philip. A survey on knowledge graphs: Representation, acquisition, and applications. TNNLS, 33(2):494â514, 2021.
- Ji et al. [2024] Yixin Ji, Kaixin Wu, Juntao Li, Wei Chen, Mingjie Zhong, Xu Jia, and Min Zhang. Retrieval and reasoning on kgs: Integrate knowledge graphs into large language models for complex question answering. In Findings of EMNLP, pages 7598â7610, 2024.
- Jiang et al. [2023] Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. In EMNLP, pages 9237â9251, 2023.
- Jiang et al. [2024] Xuhui Jiang, Yinghan Shen, Zhichao Shi, Chengjin Xu, Wei Li, Huang Zihe, Jian Guo, and Yuanzhuo Wang. Mm-chatalign: A novel multimodal reasoning framework based on large language models for entity alignment. In Findings of EMNLP, pages 2637â2654, 2024.
- Jin et al. [2024] Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. Large language models on graphs: A comprehensive survey. TKDE, 2024.
- Krishna et al. [2017] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32â73, 2017.
- Lee et al. [2024] Junlin Lee, Yequan Wang, Jing Li, and Min Zhang. Multimodal reasoning with multimodal knowledge graph. In ACL, pages 10767â10782, 2024.
- Li et al. [2023] Qian Li, Cheng Ji, Shu Guo, Zhaoji Liang, Lihong Wang, and Jianxin Li. Multi-modal knowledge graph transformer framework for multi-modal entity alignment. In Findings of EMNLP, pages 987â999, 2023.
- Li et al. [2024a] Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574, 2024a.
- Li et al. [2024b] Yangning Li, Yinghui Li, Xingyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S Yu, Fei Huang, et al. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. arXiv preprint arXiv:2411.02937, 2024b.
- Lin and Byrne [2022] Weizhe Lin and Bill Byrne. Retrieval augmented visual question answering with outside knowledge. In EMNLP, pages 11238â11254, 2022.
- Liu et al. [2019] Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio, and David S Rosenblum. Mmkg: multi-modal knowledge graphs. In ESWC, pages 459â474. Springer, 2019.
- Lu et al. [2024] Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv:2405.20797, 2024.
- Luo et al. [2024] Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, and Shirui Pan. Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models. arXiv preprint arXiv:2410.13080, 2024.
- Lymperaiou and Stamou [2024] Maria Lymperaiou and Giorgos Stamou. A survey on knowledge-enhanced multimodal learning. Artificial Intelligence Review, 57(10):284, 2024.
- Nguyen et al. [2024] Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, and Luu Anh Tuan. Video-language understanding: A survey from model architecture, model training, and data perspectives. ACL, 2024.
- Peng et al. [2024] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921, 2024.
- Ravi et al. [2023] Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, and Vered Shwartz. Vlc-bert: Visual question answering with contextualized commonsense knowledge. In WACV, pages 1155â1165, 2023.
- Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pages 146â162. Springer, 2022.
- Stepfun [2024] Stepfun. Step-1o. In https://platform.stepfun.com/, 2024.
- Su et al. [2024] Cheng Su, Jinbo Wen, Jiawen Kang, Yonghua Wang, Yuanjia Su, Hudan Pan, Zishao Zhong, and M Shamim Hossain. Hybrid rag-empowered multi-modal llm for secure data management in internet of medical things: A diffusion-based contract approach. IEEE Internet of Things Journal, 2024.
- Sun et al. [2025] Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In ICLR, 2025.
- Talebirad and Nadiri [2023] Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023.
- Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Wang et al. [2024a] Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Direct: Diagnostic reasoning for clinical notes via large language models. In NeurIPS, pages 74999â75011, 2024a.
- Wang et al. [2024b] Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024b.
- Wang et al. [2025] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling. arXiv preprint arXiv:2501.12386, 2025.
- Wang et al. [2024c] Zheng Wang, Shu Teo, Jieer Ouyang, Yongjun Xu, and Wei Shi. M-RAG: Reinforcing large language model performance through retrieval-augmented generation with multiple partitions. In ACL, pages 1966â1978, 2024c.
- Wu et al. [2024] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024.
- Yao et al. [2024] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- Yin et al. [2023] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, pages 9556â9567, 2024.
- Zeng et al. [2023] Yawen Zeng, Qin Jin, Tengfei Bao, and Wenfeng Li. Multi-modal knowledge hypergraph for diverse image retrieval. In AAAI, pages 3376â3383, 2023.
- Zhao et al. [2024] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024.
- Zhao et al. [2023] Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. Retrieving multimodal information for augmented generation: A survey. In Findings of EMNLP, pages 4736â4756, 2023.
- Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS, 36:46595â46623, 2023.
- Zhu et al. [2022] Xiangru Zhu, Zhixu Li, Xiaodan Wang, Xueyao Jiang, Penglei Sun, Xuwu Wang, Yanghua Xiao, and Nicholas Jing Yuan. Multi-modal knowledge graph construction and application: A survey. TKDE, 36(2):715â735, 2022.
\thetitle
Supplementary Material Contents
1. 1 Introduction
1. 2 Related Works
1. 2.1 Multimodal Retrieval Augmented Generation
1. 2.2 Multimodal Knowledge Graph
1. 2.3 Retrieve Knowledge on Graph
1. 3 Datasets
1. 3.1 MH-MMKG
1. 3.2 MH Benchmark
1. 3.3 Task Definition
1. 4 Method
1. 4.1 Multi-agents Retriever
1. 4.2 Reasoning via Knowledge Augmentation
1. 5 Results
1. 5.1 Experimental Settings
1. 5.2 Performance of MLLMs
1. 5.3 Analysis of Factors Affecting Performance
1. 6 Conclusion
1. 7 Detail of MH Benchmark Construction
1. 7.1 MH-MMKG
1. 7.2 MH Benchmark
1. 8 Experiment Details
1. 8.1 Prompt Template for Our Method
1. 8.2 Knowledge Consistency Calculation
1. 8.3 Human Evaluation of GPT-4o as a Judge
1. 8.4 Additional Experiments for MH Benchmark
1. 8.5 More Result Samples
7 Detail of MH Benchmark Construction
In this section, we detailed construction of our MH-MMKG and MH benchmark.
7.1 MH-MMKG
A total of 22 monsters are incorporated into the graph construction, with each represented as a subgraph connected through various relationships, such as species relation and elemental weaknesses. Each subgraph contains rich information crucial for successful conquests, particularly regarding attack strategies, combos, attack phases, and launch conditions. The monsters are: Anjanath, Azure Rathalos, Barroth, Bazelgeuse, Brachydios, Diablos, Frostfang Barioth, Glavenus, Kushala Daora, Legiana, Nergigante, Rathalos, Rathian, Teostra, Tigrex, Uragaan, Zinogre, Pink Rathian, Yian Garuga, Stygian Zinogre, and Radobaan from Monster Hunter World.
To ensure the quality, we hired three experienced Monster Hunter World players, each with over 200 hours of game play experience. They were tasked with gathering relevant monster information from sources such as Wiki, YouTube, and Bilibili to construct the graph. Additionally, since each monster has unique characteristics within the game, the structure of each subgraph is tailored accordingly. The entities are classified into 7 types as show in Table 6. Most of them are attack actions, making MH-MMKG more focused on battles with monsters. We also plan to explore more game elements in the future. Some entities are attached with text, image or video as its attribution. Note that all video or images are captured from Arena field. While for queries in MH Benchmark all visual media are captured from the Wild field. We also show the length statistic of video clips in Figure 7. It can be observed that most videos are around 1s to 5s.
Table 6: Types of entity in MH-MMKG.
| Topic Entity | Names of monsters that can serve as root entities for knowledge retrieval. Each entity is accompanied by an image of monster as its attribute. | 22 |
| --- | --- | --- |
| Attack Action | Possible attack movements of a monster, each accompanied by text, images (key frames for video), or a video as its attribute. Each of them also attached with human written-caption for the video. | 265 |
| Attack Phase | In different phases, a monster will have varying attack patterns, damage, combos, and other attributes. Only some monsters have unique phase settings. Textual context is attached as attribution. | 20 |
| Element | The element indicates a monsterâs weakened resistance to a specific type of attack. | 9 |
| Weapon | Types of damage for weapons crafted from monster materials. | 10 |
| Props | Various types of game props for interacting with monsters. | 6 |
| Attack Effects | The effects of monster attacks or skills during battle, including generated ice patches on ground, scratches, and explosions. Textual context is attached as attribution. | 9 |
There are also 158 kinds of edges and 16 of them are base edges: âhas attack action ofâ, âcontinues with attack action ofâ, âhas attack variant ofâ, âhas attack phase ofâ, âchange attack phase toâ, âis mostly weaken withâ, âis weaken withâ, âis resistant withâ, âprovide materials forâ, âcan be stopped byâ, âhas attack variants ofâ, âgeneratesâ, âcauseâ, âturns toâ, âmated pair withâ, âhas subspecies ofâ. Some base edges (mostly the first two of them) are further combined with specific constrain mechanism to form 142 variants. The samples of constrains are: âis angryâ, âhunter step intoâ, âis close toâ, âstick on the wallâ, âis knocked byâ, etc. (We do not show all of them as they too many ).
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Chart: Density Plot of Duration
### Overview
The image presents a density plot illustrating the distribution of "Duration" measured in seconds. The plot shows a unimodal distribution, peaking around 3 seconds, and tapering off towards zero at both ends.
### Components/Axes
* **X-axis:** Labeled "Duration (seconds)". Scale ranges from 0.0 to 17.5 seconds, with markings at 2.5, 5.0, 7.5, 10.0, 12.5, 15.0, and 17.5.
* **Y-axis:** Labeled "Density". Scale ranges from 0.00 to 0.35, with markings at 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, and 0.35.
* **Plot:** A filled area representing the density distribution, colored in a light blue.
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
The density plot shows a single peak, indicating the most frequent duration value.
* **Peak:** The maximum density is approximately 0.33, occurring at a duration of roughly 3.0 seconds.
* **Left Tail:** The density decreases rapidly from 0 seconds to the peak at 3 seconds. At 0 seconds, the density is approximately 0.01.
* **Right Tail:** The density decreases more gradually from the peak at 3 seconds to 17.5 seconds. At 7.5 seconds, the density is approximately 0.03. At 10 seconds, the density is approximately 0.01.
* **Area Under Curve:** The total area under the curve represents the total probability (which is 1).
### Key Observations
* The distribution is heavily skewed to the right, meaning there are more durations clustered around lower values, with a few longer durations.
* The majority of durations fall between 0 and 7.5 seconds.
* Durations exceeding 10 seconds are relatively rare.
### Interpretation
This density plot suggests that the "Duration" variable represents a process or event where most instances are relatively short, with a small number of instances lasting significantly longer. This could represent, for example, the duration of customer service calls, the time taken to complete a task, or the length of a user's session on a website. The peak at 3 seconds indicates that 3 seconds is the most common duration. The right skew suggests that while most durations are short, there's a non-negligible chance of encountering longer durations. Further analysis might involve calculating summary statistics like the mean, median, and standard deviation to quantify the distribution more precisely.
</details>
Figure 7: Video clip length statistic.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Diagram: Kushala Daora Monster Hunter Phase/Attack Flow
### Overview
This diagram depicts the phases and attacks of the monster Kushala Daora, likely from the Monster Hunter game series. It shows transitions between phases (Aerial and Ground) and the attacks associated with each phase. The diagram uses arrows to indicate the flow of transitions and relationships between phases and attacks.
### Components/Axes
The diagram consists of the following components:
* **Monster Image:** An image of Kushala Daora is located in the bottom-left corner.
* **Phases:** Two main phases are identified: "Aerial Phase" and "Ground Phase", positioned centrally.
* **Attacks:** A series of attacks are listed, connected to the phases via arrows. These include: "Aerial Kick", "Aerial Super Breath", "Aerial Triple Breath", "Aerial Crash", "Leap Attack", "Charge Attack", "Ground Breath", "Ground Charged Breath", and "Fly".
* **Arrows:** Arrows indicate transitions between phases and the association of attacks with phases. There are three arrow styles:
* Black arrows: Represent attacks originating from a phase.
* Blue arrows: Represent phase changes.
* Red dotted arrows: Represent phase changes.
* **Legend:** A legend in the top-right corner explains the meaning of the arrow types:
* "has phase of" (associated with black arrows)
* "change phase to" (associated with blue and red dotted arrows)
* "has attack of" (associated with black arrows)
### Detailed Analysis / Content Details
Here's a breakdown of the connections and flow:
* **Kushala Daora** transitions to **Aerial Phase** via a blue arrow.
* **Kushala Daora** transitions to **Ground Phase** via a red dotted arrow.
* **Aerial Phase** has the following attacks:
* "Aerial Kick"
* "Aerial Super Breath"
* "Aerial Triple Breath"
* "Aerial Crash"
* **Ground Phase** has the following attacks:
* "Leap Attack"
* "Charge Attack"
* "Ground Breath"
* "Ground Charged Breath"
* **Fly** transitions to **Ground Phase** via a blue arrow.
* **Ground Phase** transitions to **Aerial Phase** via a red dotted arrow.
### Key Observations
* The diagram highlights the cyclical nature of the monster's behavior, transitioning between aerial and ground phases.
* Each phase has a distinct set of attacks associated with it.
* The "Fly" action seems to directly lead into the "Ground Phase".
* The diagram doesn't provide any quantitative data (e.g., attack frequency, phase duration). It's purely a qualitative representation of the monster's behavior.
### Interpretation
The diagram serves as a behavioral map for Kushala Daora, likely intended for players of the Monster Hunter game. It illustrates the monster's core combat loop: alternating between aerial and ground phases, each with a specific set of attacks. Understanding this flow is crucial for players to predict the monster's actions and develop effective strategies. The distinction between the blue and red dotted arrows for phase transitions might indicate different conditions or triggers for those changes. The diagram suggests that the monster is not simply alternating phases randomly, but that certain actions ("Fly") can directly influence the phase transition. The lack of quantitative data suggests that the diagram is a simplified representation of a more complex system, focusing on the essential behavioral patterns.
</details>
Figure 8: Subgraph structure for Kushala Daora.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Diagram: Brachydios Attack Flow
### Overview
The image is a diagram illustrating the attack patterns and resulting effects of the monster "Brachydios". It depicts a flow of attacks originating from Brachydios, leading to various explosive slime effects. The diagram uses arrows to indicate relationships between attacks and their consequences, with different arrow colors representing different types of relationships.
### Components/Axes
The diagram consists of the following components:
* **Brachydios:** A drawing of the monster, positioned on the left side of the diagram.
* **Attacks:** Represented by icons with the label "Punch Explosion" and text labels like "Slime Reapply", "Jump Attack", and "Tail Swipe".
* **Effects:** Represented by icons with the label "Explosive Slime" and "Slime Explosion".
* **Arrows:** Colored arrows indicating relationships:
* Black arrow: "has attack of"
* Cyan arrow: "has attack variant"
* Orange arrow: "generates"
* Purple arrow: "causes"
* **Text Labels:** "green slime", "orange slime", and "..." (indicating more attacks).
### Detailed Analysis or Content Details
The diagram shows the following attack flow:
1. **Brachydios** "has attack of" **Slime Reapply**.
2. **Brachydios** "has attack of" **Punch Explosion**.
* **Punch Explosion** "generates" **Explosive Slime**.
* **Punch Explosion** has a variant of "green slime".
* **Punch Explosion** has a variant of "orange slime".
3. **Brachydios** "has attack of" **Jump Attack**.
* **Jump Attack** "causes" **Slime Explosion**.
4. **Brachydios** "has attack of" **Tail Swipe**.
5. **Explosive Slime** "causes" **Slime Explosion**.
The "..." indicates that Brachydios has additional attacks not explicitly shown in the diagram.
### Key Observations
* The diagram focuses on attacks that involve slime and explosions.
* "Punch Explosion" is a central attack, with variants related to different slime colors.
* The diagram illustrates a chain reaction where attacks can generate effects that then cause further effects.
* The diagram is a simplified representation of Brachydios's attack patterns, likely for game design or strategy purposes.
### Interpretation
The diagram demonstrates the attack mechanics of Brachydios, emphasizing its ability to inflict explosive slime damage. The different arrow colors highlight the relationships between attacks, variants, and resulting effects. The "generates" and "causes" relationships suggest a cascading effect where one attack can trigger a series of explosions. The inclusion of slime colors ("green slime", "orange slime") suggests that these variants may have different properties or effects. The diagram is likely intended to help players understand Brachydios's attack patterns and develop strategies to counter them. The "..." suggests that the diagram is not exhaustive and that Brachydios may have other attacks not depicted. The diagram is a visual aid for understanding a complex system of interactions within a game or similar context.
</details>
Figure 9: Subgraph structure for Brachydios.
We present some sub-graphs to illustrate structural diversity, focusing only on attack actions and their related entities, as the other components resemble Figure 2 in the main paper. The main paper showcases the graph structure of Zinogre, known for its extensive combo attacks. Here, we provide two additional examples: Kushala Daora and Brachydios. Kushala Daora exhibits distinct attack patterns in its Aerial Phase (attacking from the air) and Ground Phase (attacking on the ground), as shown in Figure 8. Certain attacks can transition between these phases, making this information crucial for an MLLM to accurately answer related questions. Brachydios, on the other hand, has attack variations that depend on the color of the slime on its fists or head, as illustrated in Figure 9. The color change alters both the attack variant and its effect, adding another layer of complexity to its combat behavior. MLLMs have to comprehend such complex information to correctly answer the question in MH Benchmark.
7.2 MH Benchmark
To differentiate from MH-MMKG, all visual media for queries are captured from the Wild field. Additionally, we present statistics on the average number of entities and depth of knowledge associated with each query in the MH Benchmark, as shown in Table 7. It can be observed that sub-task I is relatively simple, as it relies solely on the topic node. In contrast, sub-tasks II and VI involve a greater number of steps and deeper analysis, as they pertain to combo recognition and cross-monster comparison, both of which require more complex reasoning.
Table 7: Average number of entities and depth of knowledge for each query in MH Benchmark.
| Number avg Depth avg | 1 1 | 2.339 2.278 | 3.535 3.250 | 2.4137 2.4137 | 3.028 2.900 | 4.076 3.038 |
| --- | --- | --- | --- | --- | --- | --- |
Table 8: Prompt for perceiver agent.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| You will receive consecutive video frames displaying the battle screen with the monster {monster name}. |
| The given âQuestionâ regarding the battle screen is: {question} |
| Generate a âDescriptionâ of the battle scene as your âResponseâ, detailing the monsterâs limb and body movements, mouth actions, surroundings, and other relevant details. |
| Note that you should not give any assumptions for the âDescriptionâ. |
| Note that you should directly output your âResponseâ and do not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
Table 9: Prompt for topic entity selection agent.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| You will receive consecutive video frames displaying the battle screen with the monster: {monster name}. |
| The given âQuestionâ regarding the battle screen is: {question} |
| All possible monster names âOptionsâ are structured in a list format as follows:{topic entity} |
| Note that your âResponseâ is to directly output the name of the monster you are looking for. |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. Your âResponseâ: |
Table 10: Prompt for expansion agent.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| The text description of the battle screen is: {caption}. |
| Based on the battle screen, here is the âQuestionâ you need to answer: {question}. |
| To answer the above question, you are now searching a knowledge graph to find the route towards relevant knowledge. The following contents are the knowledge you found so far (up to current entity {entity}): |
| ***** |
| {memory} |
| ***** |
| You need to select the relevant âNeighbor Entityâ that may provide knowledge to answer the question. The relation and condition from current entity âentityâ to all âNeighbor Entityâ are: |
| ***** |
| {neighbor entity} |
| ***** |
| Your âResponseâ is directly output the name of all relevant âNeighbor Entityâ and separate them directly by â;â. |
| If there is no relevant âNeighbor Entityâ, directly output âNoneâ. |
| Note that if the âNeighbor Entityâ is an attack action, always choose it (if it is not highly irrelevant). |
| Note that if the âNeighbor Entityâ is a phase, you can only choose one. |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
8 Experiment Details
In this section, we show the detailed settings of our baseline method, including prompt for each agent, additional experiments, and more samples.
8.1 Prompt Template for Our Method
We first present the prompt templates for all agents in the retrieval pipeline. Table 8 shows the prompt for the perceiver agent, which translates input images into text based on the given question. Table 9 provides the prompt for the topic selection agent, responsible for selecting the starting entity for knowledge retrieval from the graph. Table 10 contains the prompt for the expansion agent, which plans the next neighboring entity for search. Table 11 presents the prompt for the validation agent, designed to assess the efficiency of knowledge transfer from the starting entity to the current entity. Finally, Table 12 includes the prompt for the summarizer agent, which synthesizes the retrieved knowledge for final answer generation. Among these, the {monster name}, displayed in blue text, represents additional information as a part of question. The {entity} represents the name of current entity during search. The {question} refers to the input query, while {topic entities} denote the names of all topic entities. {entity infp} is the visual irrelavent additional information for an entity. The {caption} is the generated description by the perceiver agent.
The {neighbor entity} are options of neighbor for current entity. It is presented in a text format consisting of a combination of entity-edge triplets and corresponding constraints or conditions (if any). Here is a neighbor sample for monster âFrostfang Bariothâ attack action entity âStraight Ice Breathâ:
- âStraight Ice Breathâ continues with attack action of âSuper Fang Slamâ (Condition: When hunter hitted by the breathâŠ)
- âStraight Ice Breathâ continues with attack action of âTail Spinâ (Condition: When Frostfang Barioth already released twoâŠ)
In our prompt, we instruct the model to select relevant neighboring entities while placing greater emphasis on attack action entities, as most tasks are designed around them. For phase entities, we allow the model to select only one, in accordance with the game mechanics.
The {memory} records the search path from the starting entity to the current entity, including entity names and all relevant information at each step. Below is an example illustrating this transition from a knowledge path:
$$
\displaystyle\textit{Zinogre}\xrightarrow{\text{phase of}}\textit{Charged Phase}\xrightarrow{\text{attack of}}\textit{Double Slam} \tag{12}
$$
will be transferred into:
- âZinogreâ: Additional Information: Zinogre has the appearance of a wild wolf and lives in the mountains full of dense trees âŠ
- âZinogreâ has attack phase of âCharged Phaseâ.
- âCharged Phaseâ: Additional Information: Zinogre is charged, the body will be surrounded by electric âŠ
- âCharged Phaseâ has attack action of âDouble Slamâ.
- âDouble Slamâ: Action Description: Zinogre lowers his head and rubs the ground withâŠ
Table 11: Prompt for validation agent. Content in [] is used solely for unaided-online experiments.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| The text description of the battle screen is: {caption}. |
| Based on the battle screen, here is the âQuestionâ you need to answer: {question}. |
| To answer the above question, you are now searching a knowledge graph to find the route towards relevant knowledge. |
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| To answer the above question, you are now searching a knowledge graph to find the route towards relevant knowledge. The following contents are the knowledge you found so far (up to current entity {entity}): |
| ***** |
| {memory} |
| ***** |
| And here is some information of current entity: {entity info}. |
| [You will also receive consecutive video frames showing the battle screen with the monster {monster name} as visual information for current entity {entity}. |
| Make a âDescriptionâ (do not affected by previous text description of the battle screen for the âQuestionâ) for the battle screen as a part of your âResponseâ. âDescriptionâ should include monsterâs limb and body movements, mouth, surrounding and others details. |
| Note that you should not give any assumptions for the âDescriptionâ.] |
| You have to decide whether visual and text information of this entity together with previous found knowledge is sufficient for answering this âQuestionâ. |
| For sufficient analysis, your âAnswerâ is âYesâ or âNoâ. |
| [Directly output your âResponseâ as the combination of âAnswerâ and âDescriptionâ, separating them directly by â;â. ] |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
Table 12: Prompt for summarizer agent.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| You will receive consecutive video frames displaying the battle screen with the monster {monster name}. Based on the battle screen, here is the âQuestionâ you need to answer: {question}. |
| Here is the âKnowledgeâ you retrieved from a knowledge graph for this âQuestionâ: |
| ***** |
| {knowledge} |
| ***** |
| Your âResponseâ is to provide the answer for this âQuestionâ based on the retrieved Knowledge. |
| Note that you should not give any analysis. |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
Note that Additional Information is the attribution of an entity (if exist). Action Description is given as human-made caption in Knowledgeable experiments, pre-extracted from visual attribution (if exist) in unaided-offline, and dynamic generated for visual attribution in unaided-online (if exist). Especially, the [] highlights the content for unaided-online that requires the model to comprehend visual references during validation and output corresponding description as the temporal visual attribution for the current entity.
As shown in Table 12, the final agent summary will treat all retrieved paths as {knowledge} using the same strategy as {memory}. Each path will be converted into a text description and attached to the query as input.
Table 13 shown the prompt template for unaided-offline experiments. It is used to pre-extract the visual reference (images or video for MLLMs or Video models, respectively in our experiments) into text description. This transition is not related to query or search memory.
Note that, the prompts for the agent pipeline were developed using InternVL2.5-78B [12], with the expectation that even open-source models, by their instruction-following capabilities, can understand these prompts and generate responses in the required format. This ensures a fair comparison for all close-source models in the main paper. We further conducted a preliminary prompt robustness analyses for GPT-4o and Claude 3.7 (unaided-online). Our observations show that Claude generally exhibited robust performance across prompt variations, particularly for agents with straightforward instructions such as Perceiver, Topic Selection, and Summarizer. However, GPT-4o exhibited sensitivity to lexical choice. For instance, in the Validation agent, the use of the term âsufficientâ to determine whether the retrieved knowledge is enough and the retrieval should be stopped. When we replaced it with ânecessary,â GPT-4o tended to more cautious during retrieval. This minor change led to a .0546 and .0871 drops on Acc. and Rec., respectively, though with a .0194 improvement in Pre. These findings suggest that prompt robustness is both model-specific and agent-specific.
Table 13: Prompt for offline caption pre-extraction.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| You will receive consecutive video frames showing the battle screen as visual information for {entity}. |
| Make a âDescriptionâ for the battle screen as your âResponseâ. âDescriptionâ should include monsterâs limb and body movements, mouth, surrounding and others details. |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
Table 14: Prompt for accuracy calculation using GPT-4o as a judge.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| Here is a âQuestionâ need to be answered: {question}. |
| There are also two answers for this âQuestionâ: |
| Answers 1: {answer gt}. |
| Answers 2: {answer pred}. |
| Your âResponseâ is to decide whether the content of these two answers are similar. |
| If similar directly output âYesâ. |
| If not similar directly output âNoâ. |
| Note that you may ignore the format difference. |
| Ignore the difference of monster name before word, e.g., Zinogre Leap Attack and Leap Attack are with same meaning. |
| Here are some samples for decide similarity: |
| Sample 1: |
| âQuestionâ: Tell me what is the specific name of attack action that Zinogre is performing? |
| âAnswer 1â: Static Charge |
| âAnswer 2â: Thunder Charge B |
| âResponseâ: No |
| Sample 2: |
| âQuestionâ: Start with counterattack, Zinogre released the attack action shown in the input battle screen. Tell me what is the next attack action? |
| âAnswer 1â: Zinogre Back Slam |
| âAnswer 2â: Back Slam |
| âResponseâ: Yes |
| Sample 3: |
| âQuestionâ: What attack action Brachydios is unleashing? |
| âAnswer 1â: Brachydios is unleashing the Brachydios Ground Slime Explosion attack |
| âAnswer 2â: Ground Slime Explosion |
| âResponseâ: Yes |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
Table 15: Prompt for similarity calculation between generated and human-made caption using GPT-4o as a judge.
| You are a professional Monster Hunter player. You are playing âMonster Hunter: Worldâ. |
| --- |
| Here are two text description of a monster attack action. |
| Your âResponseâ is to decide whether the content of these two text descriptions are similar. |
| Your should focus on the details of movement and some key information that can help you to discriminate the action. |
| If similar directly output âYesâ. |
| If not similar directly output âNoâ. |
| The First description is {truth}. |
| The Second description is {generated}. |
| Note that you should not output any information other than your âResponseâ. |
| Now, start to complete your task. |
| Your âResponseâ: |
8.2 Knowledge Consistency Calculation
As defined in the main paper, the modelâs final output is a retrieved subgraph, denoted as $\hat{\mathcal{I}}$ . We consider each path from the root entity to a leaf entity as a unique knowledge instance and represent the set of such paths as $\hat{\mathcal{L}}$ . The knowledge consistency is computed between $\hat{\mathcal{L}}$ and the ground-truth knowledge paths $\mathcal{L}$ using a one-to-one matching approach.
The recall and precision of retrieved knowledge paths are defined as follows:
$$
\text{Recall}=\frac{|\hat{\mathcal{L}}\cap\mathcal{L}|}{|\mathcal{L}|} \tag{13}
$$
$$
\text{Precision}=\frac{|\hat{\mathcal{L}}\cap\mathcal{L}|}{|\hat{\mathcal{L}}|} \tag{14}
$$
where $\hat{\mathcal{L}}\cap\mathcal{L}$ represents the set of correctly retrieved knowledge paths. Recall measures the proportion of ground-truth knowledge paths successfully retrieved by the model, while precision measures the proportion of retrieved paths that are correct.
8.3 Human Evaluation of GPT-4o as a Judge
Tables 14 and 15 present the templates for using GPT-4o as a judge [59] to assess result accuracy (Acc.) and caption similarity (Sim.). For accuracy evaluation, we prompt GPT-4o to compare the similarity between the ground-truth answer {answer gt} and the generated answer {answer pred}. Additionally, we provide three few-shot examples as references for the model.
For caption similarity assessment, GPT-4o directly compares the human-written caption {truth} with the model-generated caption {generated}. To further evaluate GPT-4oâs judging performance, we conducted a human experiment. As shown in Table 16, two knowledgeable players independently evaluated 200 randomly selected samples from GPT-4oâs judgments across all experiments for each model. A judgment was considered correct if both evaluators agreed. Our findings indicate that while there are some variations across models, GPT-4o demonstrates a high overall accuracy in judgment (0.926). Although caption similarity scoring is lower, it remains sufficiently high for such a subjective task. Overall, the results show that using GPT-4o as a judge is with high feasibility.
Table 16: Human evaluation for GPT-4o judgment accuracy. Each modelâs generation for answer and caption is evaluated by 200 randomly select samples through two knowledgeable players.
| GPT-4o [1] Claude 3.7 Sonnet [2] Ovis2-16B [36] | 0.925 0.900 0.955 | 0.865 0.840 0.810 |
| --- | --- | --- |
| average | 0.926 | 0.838 |
8.4 Additional Experiments for MH Benchmark
In Table 17, we present the impact of incorporating the monsterâs name (Name) and additional information (Extra) as part of the input question $q$ . The metric Top. represents the accuracy of the model in selecting the correct topic entity as the retrieval root. We observe that removing the monsterâs name leads to a significant performance drop due to incorrect root entity selection (low Top.).
Additional information refers to contextual hints, such as a monster being angry, which players can infer from the gameâs text. These details are generally too subtle to be captured from images by MLLMs. Removing only the additional information also results in an obvious performance drop, indicating that such visually independent cues are essential for the model to generate the correct answer. One interesting observation is that with additional information Top. can be improved than no Name and Extra setting.
Table 17: Impact of having monster name and Extra information in question. â means having such information.
| Name | Extra | Unaided-Online Acc. | Pre. | Rec. | Top. |
| --- | --- | --- | --- | --- | --- |
| .2731 | .1251 | .2413 | .5210 | | |
| â | .3781 | .2080 | .4434 | .7365 | |
| â | | .4075 | .2120 | .4636 | 1 |
| â | â | .5105 | .2756 | .5625 | 1 |
Table 18 reports average retrieval time (in seconds), number of agent calls (n rounds), and per-call response time in the format of mean ± std. Experiments were conducted using GPT-4o and Gemini 2.0 Flash via API, and InternVL2.5 on a local GPU server with 2 A6000. The results reveal efficiency as a limitation of the current agent pipeline. More results will be included.
Table 18: Computational cost per sample in average.
| GPT-4o Gemeni 2.0 Flash InternVL | 92.46 ± 68.93 17.15 ± 10.92 57.06 ± 41.78 | 7.20 ± 5.01 11.04 ± 9.42 9.32 ± 3.58 | 10.92 ± 9.70 1.12 ± 0.91 7.33 ± 6.95 |
| --- | --- | --- | --- |
We also perform ablation studies to assess the impact of using cross-models for two key agents: Summarizer (knowledge utilization) and Validation (knowledge retrieval), keeping other agents fixed. Table 19 shows Acc. results across GPT-4o, Claude 3.7, and InternVL2.5, with diagonal values representing the results of original single-model pipeline. Summarizer replacement yields little change between GPT-4o and Claude, indicating that performance gains stem more from retrieved knowledge quality than summarization strength (InternVLâs column with better knowledge, the improvement is more evident than that in the rows, showing in green). In contrast, using a weaker model (InternVL) for Validation causes a sharp performance drop (in red), underscoring the importance of this role. Yet, upgrading only the Validation agent in InternVL brings limited benefit, suggesting other retrieval-stage agents affect a lot.
Table 19: Ablation for cross-models agent pipeline.
| GPT-4o Claude InternVL | .5105 .4510 .3876 | .4994 .4338 .3624 | .4864 .4086 .3080 | .5105 .5052 .3413 | .4716 .4338 .3225 | .4128 .3676 .3080 |
| --- | --- | --- | --- | --- | --- | --- |
8.5 More Result Samples
This section presents some randomly selected examples of generated answers via various models.
Figure 10 shows a sample for âGlavenusâ continues attack action recognition. Both GPT-4o and Claude 3.7 output wrong answer, although GPT-4o catch the path towards true knowledge. Show models lack the ability to comprehend the knowledge.
Figure 11 shows a sample for âBazelgeuseâ attack action recognition. Although some difference in response, both GPT-4o and Gemini 1.5 Pro generate correct answer. GPT-4o find more paths as its knowledge augmentation.
Figure 12 shows a sample for âBarrothâ attack action recognition. Both GPT-4o and Claude 3.7 generate the correct answer, however, GPT-4oâs answer is more clear, showing better instruct following ability.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Diagram: Combo Premonition - AI Interaction & Attack Sequence Analysis
### Overview
This diagram illustrates a process involving two Large Language Models (LLMs), GPT-4o and Claude 3.7 Sonnet, analyzing a video sequence of a monster named Glavenus performing a series of attacks. The diagram depicts the input, processing, and output of each LLM, along with associated metrics (Recall and Precision) and augmented answers. The overall goal appears to be to understand and describe Glavenus's attack patterns.
### Components/Axes
The diagram is structured around a central flow, with input from a video sequence, processing by the LLMs, and output in the form of retrieved paths and augmented answers. Key components include:
* **Header:** "III: Combo Premonition"
* **LLMs:** GPT-4o (left side) and Claude 3.7 Sonnet (right side). Represented by their respective logos.
* **Input:** A series of four screenshots from a video game, depicting Glavenus in various attack poses.
* **Prompt:** A text prompt provided to both LLMs: "screen, Glavenus seem to have continues action within this attack action. Describe the continues action."
* **Retrieved Paths:** Diagrams showing the sequence of attacks identified by each LLM. These diagrams use icons representing different attack moves.
* **Augmented Answer:** Textual descriptions generated by each LLM, detailing the attack sequence.
* **Metrics:** Recall and Precision scores associated with each LLM's analysis.
* **Feedback Indicators:** Red "X" marks indicating potential issues with the augmented answers.
### Detailed Analysis or Content Details
**GPT-4o Side:**
* **Input Screenshots:** Four images of Glavenus.
* **Prompt:** "screen, Glavenus seem to have continues action within this attack action. Describe the continues action."
* **Retrieved Paths:** A diagram showing Glavenus at the center, with three arrows pointing to different attacks:
* Green arrow: Glavenus -> Heated Tailspin
* Yellow arrow: Glavenus -> Slam Slice Tail Scrape
* Blue arrow: Glavenus -> Sword Swing
* **Recall:** 1
* **Precision:** 0.33
* **Augmented Answer:** "Glavenus performs the âHeated Tailspinâ attack. Its tail, glowing red-hot, is swung in a wide arc, scattering sparks and fiery particles across the area..."
**Claude 3.7 Sonnet Side:**
* **Input Screenshots:** Same four images of Glavenus.
* **Prompt:** Same as GPT-4o.
* **Retrieved Paths:** A simpler diagram showing a linear sequence:
* Glavenus -> Heated Tailspin
* **Recall:** 0
* **Precision:** 0
* **Augmented Answer:** "After Glavenus initiates its Heated Tailspin attack, the monster continues through a complete dynamic rotation sequence. The attack begins with Glavenus..."
### Key Observations
* **Discrepancy in Attack Identification:** GPT-4o identifies three distinct attacks (Heated Tailspin, Slam Slice Tail Scrape, and Sword Swing), while Claude 3.7 Sonnet only identifies the Heated Tailspin.
* **Performance Metrics:** GPT-4o has a Recall of 1 and Precision of 0.33, indicating it identified at least one relevant attack but with limited accuracy. Claude 3.7 Sonnet has a Recall of 0 and Precision of 0, suggesting it failed to identify the full range of attacks.
* **Augmented Answer Quality:** Both LLMs provide descriptions of the Heated Tailspin attack. However, the red "X" marks suggest issues with the completeness or accuracy of the generated answers.
* **Diagrammatic Representation:** The diagrams representing the "Retrieved Paths" are visually distinct, with GPT-4o's diagram showing a branching structure and Claude 3.7 Sonnet's diagram showing a linear sequence.
### Interpretation
The diagram demonstrates an attempt to leverage LLMs for analyzing complex action sequences in a video game context. The differing results between GPT-4o and Claude 3.7 Sonnet highlight the challenges of accurately identifying and describing dynamic events. GPT-4o appears to be more capable of recognizing multiple attack patterns, but its lower precision suggests it may also be prone to false positives. Claude 3.7 Sonnet, while simpler in its analysis, may be more conservative in its predictions.
The low Recall and Precision scores, coupled with the red "X" marks, indicate that the current approach is not yet fully reliable. Further refinement of the prompts, training data, or LLM architectures may be necessary to improve the accuracy and completeness of the analysis. The diagram also suggests that visual context (the screenshots) is crucial for the LLMs to perform effectively. The prompt focuses on "continues action" which may be a key factor in the LLM's ability to identify attack sequences. The branching structure in GPT-4o's diagram suggests it is attempting to model the conditional nature of Glavenus's attacks â i.e., different attacks may follow depending on the situation.
</details>
Figure 10: A sample for âGlavenusâ continues attack recognition.
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Diagram: Attack Recognition
### Overview
The image presents a comparative diagram illustrating the attack recognition capabilities of two Large Language Models (LLMs): GPT-4o and Gemini 1.5 Pro. The diagram focuses on a scenario involving the character "Bazelgeuse" and attempts to predict its next action after a specific attack phase. The diagram is divided into two main sections, one for each LLM, each showing the retrieved paths and augmented answers.
### Components/Axes
The diagram consists of the following components:
* **Header:** "II: Attack Recognition" with an image of Bazelgeuse in mid-attack.
* **Central Question:** "What will Bazelgeuse doing after flying in this attack action?"
* **LLM Representations:** Two circular icons representing GPT-4o (left) and Gemini 1.5 Pro (right).
* **Retrieved Paths:** Two separate sections, one for each LLM, showing a directed graph of potential attack paths.
* **Nodes:** Representing attack phases (e.g., "Normal Phase", "Charge Attack", "Air Crush").
* **Edges:** Representing transitions between attack phases.
* **Augmented Answer:** Textual responses generated by each LLM.
* **Evaluation Metrics:** "Recall" and "Precision" values associated with each LLM's path.
* **Checkmarks:** Green checkmarks indicating successful predictions.
### Detailed Analysis or Content Details
**GPT-4o Section (Left):**
* **Input:** Bazelgeuse (icon) -> Normal Phase -> Charge Attack -> Shoulder Charge.
* **Intermediate Path:** Bazelgeuse -> Normal Phase -> Carpet Bombing.
* **Recall:** 1
* **Precision:** 0.25
* **Augmented Answer:** "Bazelgeuse will likely perform the 'Carpet Bombing' attack, dropping scales from the air as it glides over the battlefield."
**Gemini 1.5 Pro Section (Right):**
* **Input:** Bazelgeuse (icon) -> Normal Phase -> Air Crush.
* **Intermediate Path:** Bazelgeuse -> Normal Phase -> Carpet Bombing.
* **Recall:** 1
* **Precision:** 0.5
* **Augmented Answer:** "It will continue to drop more scales as part of its 'Carpet Bombing' attack."
**Common Elements:**
* Both LLMs identify "Carpet Bombing" as a likely next attack.
* Both LLMs have a "Recall" of 1.
* The central question is positioned between the two LLM sections.
* The image of Bazelgeuse is positioned at the top center of the diagram.
### Key Observations
* GPT-4o has a lower precision (0.25) compared to Gemini 1.5 Pro (0.5), suggesting that while it correctly identifies the "Carpet Bombing" attack, it has more false positives in its retrieved paths.
* Both models correctly predict the "Carpet Bombing" attack, as indicated by the checkmarks.
* The diagram visually represents the reasoning process of each LLM, showing the different paths considered before arriving at the final answer.
* The intermediate path of "Normal Phase" -> "Carpet Bombing" is common to both models.
### Interpretation
The diagram demonstrates a comparative analysis of two LLMs' ability to recognize and predict attacks in a game context. Both models successfully identify the "Carpet Bombing" attack, but Gemini 1.5 Pro exhibits higher precision, indicating a more refined understanding of the attack patterns. The "Recall" of 1 for both models suggests they are effective at identifying the correct attack when it occurs. The diagram highlights the importance of both recall and precision in evaluating the performance of LLMs in tasks requiring accurate prediction and reasoning. The visual representation of the retrieved paths provides insight into the models' decision-making processes, revealing the different attack sequences considered before arriving at the final answer. The difference in precision suggests that Gemini 1.5 Pro is better at filtering out irrelevant attack paths, leading to a more accurate prediction. The checkmarks serve as a clear indicator of successful predictions, reinforcing the effectiveness of both models in this specific scenario. The diagram suggests that LLMs can be valuable tools for understanding and predicting complex game mechanics.
</details>
Figure 11: A sample for âBazelgeuseâ attack action recognition.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Diagram: Attack Recognition Comparison - GPT-4o vs. Claude 3.7 Sonnet
### Overview
This diagram compares the performance of two Large Language Models (LLMs), GPT-4o and Claude 3.7 Sonnet, in recognizing attack actions performed by a creature named "Barroth". The diagram illustrates the input, processing, and output of each model, along with metrics for recall and precision. The diagram is divided into sections for each model, showing the retrieved paths of potential attacks and the final augmented answer.
### Components/Axes
The diagram consists of the following components:
* **Header:** "II: Attack Recognition" with an image of Barroth performing an attack.
* **Input:** A question posed to both models: "Tell me what is the specific name of attack action that Barroth is performing?" accompanied by an image of Barroth.
* **Model Representations:** Icons representing GPT-4o and Claude 3.7 Sonnet.
* **Retrieved Paths (GPT-4o):** A diagram showing Barroth at the center, with arrows pointing to potential attack actions: Shoulder Charge, Head Slam, Mud Attack, and Rush.
* **Retrieved Paths (Claude 3.7 Sonnet):** Similar to GPT-4o, showing Barroth with arrows to Shoulder Charge, Head Slam, and Rush.
* **Augmented Answer (GPT-4o):** "Head Slam."
* **Augmented Answer (Claude 3.7 Sonnet):** A text block stating: "Based on the image, the Barroth appears to be performing the 'Barroth Head Slam' attack. The monster is in a stationary position with its..."
* **Performance Metrics:** Recall and Precision values for each model.
* **Validation Indicators:** Green checkmarks indicating correct identification of the attack.
### Detailed Analysis / Content Details
**GPT-4o Section:**
* **Input:** The question "Tell me what is the specific name of attack action that Barroth is performing?" is presented with an image of Barroth.
* **Retrieved Paths:** Barroth is at the center.
* Green arrow from Barroth to Shoulder Charge.
* Green arrow from Barroth to Head Slam.
* Green arrow from Barroth to Mud Attack.
* Green arrow from Barroth to Rush.
* **Augmented Answer:** "Head Slam."
* **Performance Metrics:** Recall: 1, Precision: 0.25.
* **Validation:** A green checkmark is present.
**Claude 3.7 Sonnet Section:**
* **Input:** Same question and image as GPT-4o.
* **Retrieved Paths:** Barroth is at the center.
* Green arrow from Barroth to Shoulder Charge.
* Green arrow from Barroth to Head Slam.
* Green arrow from Barroth to Rush.
* **Augmented Answer:** "Based on the image, the Barroth appears to be performing the 'Barroth Head Slam' attack. The monster is in a stationary position with its..."
* **Performance Metrics:** Recall: 1, Precision: 0.33.
* **Validation:** A green checkmark is present.
### Key Observations
* Both models correctly identify "Head Slam" as the attack.
* Claude 3.7 Sonnet provides a more detailed explanation of its reasoning, including the monster's stationary position.
* Claude 3.7 Sonnet has a higher precision score (0.33) compared to GPT-4o (0.25), suggesting it is more selective in its attack predictions.
* Both models have a recall of 1, indicating they successfully identified the correct attack when it was present in the retrieved paths.
* GPT-4o retrieves more potential attack paths (four) than Claude 3.7 Sonnet (three).
### Interpretation
The diagram demonstrates a comparison of two LLMs' ability to recognize a specific attack action from an image. Both models achieve a correct identification (indicated by the green checkmarks), but differ in their approach and performance metrics. Claude 3.7 Sonnet exhibits a higher precision, suggesting it is less prone to false positives, while GPT-4o explores a wider range of potential attacks. The detailed explanation provided by Claude 3.7 Sonnet offers insight into its reasoning process, which could be valuable for understanding its decision-making. The difference in precision could be attributed to the models' underlying architectures, training data, or prompting strategies. The diagram highlights the potential of LLMs for visual recognition tasks, specifically in identifying actions within complex scenes. The recall of 1 for both models suggests that when the correct attack is considered, both models are capable of identifying it.
</details>
Figure 12: A sample for âBarrothâ attack action recognition.