2506.17589v3

Model: nemotron-free

# Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown **Authors**: Mitsubishi Electric Corp., Japan Abstract The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select “Monster Hunter: World” as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the models’ ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research. The dataset and code at https://github.com/wbw520/MH-MMKG. 1 Introduction Recent multimodal large language models (MLLMs) [54], particularly closed-source ones [1, 2], have demonstrated human-like multimodal capabilities, achieving outstanding performance on benchmarks related to commonsense [42], scientific facts [55], etc. Meanwhile, for domain-specific tasks or rarely seen data [14, 9], relying solely on their own perception and built-in knowledge is inadequate for predicting accurate answers and is hardly interpretable [49, 19]. To alleviate these issues, multimodal RAG (mRAG) [57, 58] has been introduced to improve MLLM performance through heuristic [20, 10, 6] or agent-based methods [44, 33]. However, as with RAG for LLMs [18], such knowledge retrieval approaches often suffer from knowledge redundancy and low relevance [40]. As a result, knowledge graph retrieval [28, 15] has gained increasing attention. Knowledge graphs (KGs) [24], storing knowledge in a structured and comprehensive manner, offer models with more context-rich information. As a substantial multimodal extension, multimodal KGs (MMKGs) [13, 38] can offer a knowledge foundation for enhancing MLLMs. <details> <summary>x1.png Details</summary> ![b337ec21](/v1/image/b337ec2135357aea66f7fbe54101a904af92c7716b538d017d4559070f5c23ce) ### Visual Description ## Flowchart: Zinogre Attack Prediction in Monster Hunter ### Overview The flowchart illustrates a decision-making process for predicting possible attacks by the monster Zinogre in its Super Charged phase, using a machine learning model (MLLM) and knowledge retrieval from gameplay data. It includes battle screen visuals, action predictions, and connections to "Knowledgeable Players" via the MH-MMKG system. ### Components/Axes 1. **Top Section (Battle Screen Context)**: - **Visuals**: Three sequential battle screens showing Zinogre in combat. - **Text**: - Speech bubble: *"Based on the battle screen, what are Zinogre possible continues attacks?"* - Green checkmark: *"Zinogre is going to unleash the Counter Attack action."* 2. **Middle Section (Knowledge Retrieval)**: - **Nodes**: - **Input**: *"Zinogre Super Charged phase"* (with a checkmark ✓). - **Action Nodes**: - ✓ *Counter Attack* → ✓ *Fist Combo* (with continuation arrow). - ✓ *Tail Slam* → ✓ *Back Slam* (with continuation arrow). - ✗ *Headbutt* (with continuation arrow). - ✗ *Back Jump* (with continuation arrow). - ✗ *360 Spin* (with continuation arrow). - **Flow**: Arrows connect actions to outcomes (e.g., "Counter Attack" leads to "Fist Combo" or "Tail Slam"). 3. **Bottom Section (Knowledgeable Players)**: - **Visuals**: Icons of four characters labeled *"Knowledgeable Players"*. - **Text**: *"MH-MMKG"* (likely a model/system name) connected to the flowchart via a dotted line. ### Detailed Analysis - **Action Validity**: - Valid actions (✓): Counter Attack, Fist Combo, Tail Slam, Back Slam. - Invalid actions (✗): Headbutt, Back Jump, 360 Spin. - **Flow Logic**: - The MLLM predicts "Counter Attack" as the immediate action. - Based on this, the model speculates on subsequent actions in the Super Charged phase, prioritizing combos (Fist Combo) or slams (Tail Slam/Back Slam). - Invalid actions (e.g., Headbutt) are flagged with crosses, suggesting they are contextually inappropriate. ### Key Observations - **Contextual Accuracy**: The model avoids invalid actions (e.g., Headbutt) in the Super Charged phase, aligning with gameplay mechanics. - **Sequential Prediction**: The flowchart emphasizes continuation attacks (e.g., Fist Combo → Tail Slam), reflecting Zinogre's aggressive playstyle. - **Human Expertise Integration**: The MH-MMKG system links to "Knowledgeable Players," implying human expertise informs the model's predictions. ### Interpretation The diagram demonstrates how an AI model (MLLM) leverages knowledge retrieval to simulate Zinogre's behavior in a Super Charged state. By cross-referencing gameplay data (e.g., valid/invalid actions), the model generates contextually accurate predictions. The exclusion of moves like Headbutt highlights the model's ability to filter implausible actions based on phase-specific mechanics. The integration of "Knowledgeable Players" via MH-MMKG suggests a hybrid approach, combining machine learning with human expertise to refine predictions. This system could be used for training players or analyzing monster behavior patterns in competitive gameplay. </details> Figure 1: Our MH-MMKG is curated by knowledgeable players. By leveraging a multi-agents retriever, the MLLM’s responses can be augmented. One major challenge MLLMs with MMKGs is the unavailability of robust benchmarks. Existing work primarily relies on VQA datasets [29] or web resources [5, 17] for constructing MMKGs. It is highly plausible that such knowledge has already been learned by current MLLMs, making them less useful in evaluating the effectiveness of MMKGs. Additionally, existing MMKGs primarily incorporate text and images, lacking more information-rich modalities, such as video [39], which reduces their suitability for increasingly complex multimodal tasks. Another challenge is how to retrieve knowledge accurately. The mainstream approach embeds entities in MMKGs to find relevant subgraphs [23, 21, 31, 30]. However, training an effective retriever is data-intensive and unrealistic for low-resource domains. In addition, incorporating graphs about rarely seen domains into powerful closed-source models remains impractical. This work aims to endeavor the ability of MLLMs to tackle domain-specific tasks by leveraging well-structured external knowledge sources. To achieve this, we build a testbed with a well-known game title [9] as illustrated in Figure 1 for two reasons: First, the visual modality exhibits a large gap from real-world scenes as its visual world is computer graphics-generated mostly with imaginary cultures, creatures, etc. to which MLLMs’ perception module is not well exposed. Second, the knowledge of the game’s world can be different from the real world, which makes the MLLMs’ built-in knowledge less useful. We in this work use “Monster Hunter: World” for our testbed as the game series offers abundant knowledge about the fantasy world. As one of the most popular game titles, current MLLMs have already learned some knowledge about it. We will experimentally show its impact. To this end, we construct MH-MMKG, which integrates text, images, videos, and intricate entity relations curated by knowledgeable players. Additionally, we design 238 carefully crafted question-answer pairs as benchmark that cover various sub-tasks, including fine-grained visual cognition, conditional reasoning, etc. Importantly, the knowledge required to answer these questions is encompassed in MH-MMKG. An MLLM can answer most questions with perfect knowledge retrieval and comprehension. MH-MMKG, as well as the benchmark, offers a new dimension of challenges for the community: MH-MMKG requires flexible perception to comprehend the visual modality in an (almost) unseen domain as well as reasoning without relying on the built-in knowledge of the MLLM. Multi-modal knowledge retrieval through graph is the foundation for tackling this new set of challenges. We thus develop a multi-agent method, which harnesses the self-searching capabilities [26, 45, 16] of MLLMs, allowing them to autonomously retrieve relevant knowledge without training. On both close- and open-source leading MLLMs, we experimentally show the effectiveness of our method, qualifying it as a strong baseline. Our contribution lies in developing a high-quality benchmark and exploring MLLMs’ ability to find knowledge from MMKG for solving rarely seen domain-specific tasks. These tasks require not only advanced multimodal perception but also a deep understanding of complex dependencies, conditions, and rules within MH-MMKG. Our benchmark, together with the baseline, provides a strong foundation for advancing MLLMs toward real-world challenges. 2 Related Works 2.1 Multimodal Retrieval Augmented Generation Recent literature shows a growing surge of interest in MLLMs [54]. Despite their advancements, even sophisticated models like GPT-4o face difficulty in handling domain-specific tasks [14, 9] or reasoning [49, 19]. To mitigate these issues, mRAG [57, 58] seeks to enhance AI systems by providing more reliable, comprehensive, accurate, and up-to-date knowledge from external sources [42, 11]. Heuristic mRAG often relies on predefined retrieval strategies that prioritize grounding across multiple modalities into a single primary modality [20, 10, 6, 7], which limits their capacity to deliver precise and contextually rich knowledge. Recent studies [44, 51, 33] propose more adaptable pipelines for knowledge retrieval, utilizing multi-agent cooperation [46] to harness the model’s intrinsic capacity for knowledge exploration. Despite differences in strategy, all these methods depend on retrieving unstructured external knowledge sources, leading to knowledge redundancy and a lack of precision [40]. In contrast, KGs offer a structured format for knowledge storage [24], particularly MMKGs [60], which can be more efficient support for MLLMs to enhance their response. 2.2 Multimodal Knowledge Graph Compared to KGs, MMKGs incorporate diverse multimodal data, making them more suitable for complex scenarios that require multimodal collaboration. The evolution from conventional KGs to MMKGs has also been extensively explored [13, 38]. Researchers often use densely annotated VQA datasets [29] or web-based resources [5, 17] to automatically construct MMKGs [35, 4]. Ongoing efforts have demonstrated the effectiveness of them in tasks such as VQA [34], image classification [41], cross-modal retrieval [56], etc. Additionally, some studies focused on embedding MMKGs into feature vectors to enhance the reasoning capabilities of MLLMs, yielding impressive results [30, 27]. Notwithstanding these considerable achievements, current works primarily focus on integrating text and image modalities [38] via web data, which may already embedded in the knowledge of MLLMs. The exploration of additional modalities, such as audio and video, as well as practical applications of recent MLLMs for complex real-world tasks, remains limited. Our meticulously crafted MH-MMKG incorporates multiple modalities alongside complex relations as a knowledge base for a domain-specific task. 2.3 Retrieve Knowledge on Graph By offering reliable and structured information, retrieval over KGs assists LLMs in preserving better factual accuracy [28, 15]. Common approaches include retrieving relevant subgraphs [25, 22] or integrating the graph into the model’s learning process [37] —techniques that have also been extended to MMKGs to improve MLLMs’ task completion capabilities [23, 30]. However, training an effective retriever is highly data-intensive, especially for modalities like video. Moreover, adding new KGs into powerful closed-source MLLMs is infeasible. Recent work aims to evoke the model’s self-searching ability to autonomously navigate toward the necessary knowledge [45, 45, 48]. These approaches typically involve route planning [26, 45] or self-refinement mechanisms [16, 8] to improve retrieval accuracy. We extend these ideas to MMKGs and propose a multi-agent self-searching method for knowledge retrieval. 3 Datasets The “Monster Hunter” is a popular game series and we choose “Monster Hunter: World” as our testbed to explore the potential of MLLMs in tackling the gaps in visual and knowledge using MMKGs. 3.1 MH-MMKG <details> <summary>x2.png Details</summary> ![5c7babd7](/v1/image/5c7babd7ed043fd156a53a2e10ec031dec58e69f1872c1a59c6dafec8e40fe47) ### Visual Description ## Flowchart: Rathian Game Mechanics and Relationships ### Overview The image is a flowchart detailing the relationships, behaviors, and attack mechanics of "Rathian," a fictional creature (likely from a game or media franchise). It includes nodes for subspecies, elemental weaknesses/resistances, attack patterns, and contextual descriptions. Arrows indicate causal or associative relationships, while text boxes provide narrative context and attack captions. ### Components/Axes - **Central Node**: "Rathian" (core subject). - **Subspecies**: - "Rathalos" (connected via "mated with"). - "Pink Rathian" (connected via "subspecies"). - **Elemental Relationships**: - "Ice Element" (connected via "weakened with"). - "Fire Element" (connected via "resistant with"). - "Fire Element Weapon" (connected via "material for"). - **Attack Patterns**: - "Triple Rush" (caption: "Rathian rushes forward, up three times..."). - "Bite" (caption: "Rathian use a lunge bite..."). - "Triple Fireballs" (caption: "Rathian throws fireballs from its mouth in a rapid sequence..."). - **Contextual Descriptions**: - "Context: Rathian perches on the ground with powerful legs. Her flame attacks..." - "Context: if hunter hit by this attack,..." ### Detailed Analysis - **Rathian** is centrally positioned, with arrows radiating to: - **Rathalos** (top-left): Labeled "mated with." - **Pink Rathian** (left): Labeled "subspecies." - **Ice Element** (bottom-left): Labeled "weakened with." - **Fire Element** (bottom-center): Labeled "resistant with." - **Fire Element Weapon** (bottom-right): Labeled "material for." - **Attack Nodes** (right side): - **Triple Rush**: Linked via "attack of" and "continues." - **Bite**: Linked via "attack of" and "when angry." - **Triple Fireballs**: Linked via "attack of." - **Text Boxes**: - Captions describe specific attack animations (e.g., "Rushes forward, up three times"). - Contextual snippets provide narrative triggers (e.g., "if hunter hit by this attack..."). ### Key Observations 1. **Hierarchical Relationships**: Rathian’s subspecies (Rathalos, Pink Rathian) and elemental interactions (Ice/Fire) are foundational to its behavior. 2. **Attack Sequencing**: The "Triple Rush" and "Triple Fireballs" suggest a progression in combat tactics, with "Bite" acting as an aggressive trigger. 3. **Elemental Weaknesses**: Rathian is vulnerable to Ice but resistant to Fire, influencing weapon/material choices. 4. **Narrative Integration**: Contextual text boxes tie mechanics to in-game storytelling (e.g., "flame attacks" hinting at fire-based abilities). ### Interpretation This diagram serves as a strategic guide for players, illustrating how Rathian’s biology, weaknesses, and attacks interconnect. The emphasis on Fire Element resistance and Ice vulnerability suggests players should prioritize Ice-based weapons. The attack sequences (e.g., "Triple Rush" leading to "Bite") imply a pattern of movement and aggression, requiring players to anticipate and counter these moves. The inclusion of Rathalos and Pink Rathian hints at a broader ecosystem, where mating and subspecies dynamics may influence gameplay (e.g., breeding mechanics or territorial behavior). The flowchart’s structure prioritizes actionable data over abstract theory, aligning with typical game design documentation focused on player strategy. </details> Figure 2: The subgraph for monster “Rathian” in MH-MMKG. Our MH-MMKG is an attribute-based MMKG [60] as shown in Figure 2, where names of elements that appear in the game, such as monsters and attack actions, are treated as entities. A pair of entities can be linked with a particular relation, forming an edge between them. An entity may come with some additional information, i.e., videos related to the entity and textual context to supply more details on the entity, both of which are treated as attributes. Three knowledgeable players were recruited to build the knowledge graph: Each player built subgraphs for single monsters. Figure 2 illustrates the subgraph for “Rathian,” showing intricate relationships such as attack conditions, combos, and inter-subgraph connections. We collected 22 subgraphs, which were merged together to make MH-MMKG. Let $\mathcal{G}=(\mathcal{E},\mathcal{V},\mathcal{R})$ denote our whole KG, where $\mathcal{E}$ is the set of entities, $\mathcal{V}$ is the set of edges, and $\mathcal{R}$ is the set of relations. $\mathcal{E}$ consists of two types of entities: 1) monsters in $\mathcal{E}_{\text{o}}$ and 2) other entities in $\mathcal{E}_{\text{a}}$ . An edge $v=(e,r,e^{\prime})∈\mathcal{V}$ represents $e∈\mathcal{E}$ has relation $r∈\mathcal{R}$ with $e^{\prime}∈\mathcal{E}$ . The attribute associated with an entity $e$ is denoted as $A(e)=(c,u)$ , which comprises a video clip $c$ and/or textual context $u$ . A video clip describes the entity $e$ , recorded in 60 fps and 4k. As current MLLMs are not able to handle videos directly, the dataset also offers human-selected keyframes (no more than 10 frames) as well as human-written captions about $c$ . The knowledgeable players who curated the dataset also selected the keyframes and wrote the captions. The textual context $u$ provides knowledge about the entity but is not included in $c$ . We denote a set of all attribute as $\mathcal{A}=\{A(e)|e∈\mathcal{E}\}$ . Table 1: The six sub-tasks in our MH benchmark. | I: Individual Information | Retrieve the textual information of a monster, e.g., nick name, habitat, and skill mechanics. | 24 | | --- | --- | --- | | II: Attack Recognition | Recognize the about to, ongoing or finished attack action of a monster. | 109 | | III: Combo Premonition | Predict upcoming attack sequences based on the monster’s current action or previous action track. | 28 | | IV: Condition Awareness | Detect status on monsters or surrounding environment, e.g., whether angry, terrain or phase changes, to anticipate future battle. | 29 | | V: Proc Effect Insight | Analyze the effects such as environment and attack on monster status and movement patterns. | 35 | | VI: Cross Monster Analysis | Compare attack patterns or behaviors across different monsters to optimize hunting strategies. | 13 | <details> <summary>x3.png Details</summary> ![a725a6b3](/v1/image/a725a6b3f507af46ca004dc7831084bad8f13e50adfa5ba236364fc0aff09d90) ### Visual Description ## Flowchart: AI Decision-Making Process for Game Character Attack Actions ### Overview The diagram illustrates a multi-stage AI decision-making system for determining attack actions in a game, using "Zinogre" as a case study. It shows how the AI processes input (caption/perceiver output), validates knowledge, expands attack phases, and resolves actions through multi-agent retrieval. The flowchart includes validation checks, phase expansions, and action resolution with success/failure indicators. ### Components/Axes 1. **Input Processing** - **Caption**: "Zinogre raises its right claw, move it to the left part of the body and put it firmly against the ground on left..." - **Perceiver**: Icon with speech bubble, connected to input text - **Question Node**: "Tell me what will happen next within this attack action?" 2. **Knowledge Retrieval & Validation** - **Retrieved Knowledge**: Text block with "Zinogre will jump and slams the ground..." - **Topic Selection**: Node with Zinogre images and "Stygian Zinogre" label - **Validation**: Icon with checkmark (✓) and cross (✗) indicators - **Expansion**: Nodes for "Charging Phase", "Charged Phase", "Super Charged" 3. **Action Resolution** - **Multi-agents Retriever**: Central node with robot icons - **Action Options**: - ✓ "Double Slam" (sufficient) - ✗ "Headbutt", "Devour", "Heavy Paw Slam" - **Phase Progression**: Arrows showing correct (green) vs incorrect (red) paths 4. **Output** - **Summarizer**: Icon with checkmark, outputs final action description ### Detailed Analysis - **Textual Elements**: - All labels in English (no other languages detected) - Key terms: "Zinogre", "Stygian Zinogre", "Charging Phase", "Attack of", "Devour", "Heavy Paw Slam", "Double Slam" - Validation indicators: Green checkmarks (✓) for correct paths, red crosses (✗) for incorrect - **Flow Connections**: - Input → Topic Selection → Validation → Expansion → Multi-agents Retriever → Action Resolution - Correct paths (green arrows) lead to "Double Slam" resolution - Incorrect paths (red arrows) lead to alternative actions - **Visual Elements**: - Robot icons representing AI agents - Game creature illustrations (Zinogre, Stygian Zinogre) - Action symbols (swords, claws, paw prints) - Phase progression indicators (charging icons) ### Key Observations 1. **Validation Criticality**: 70% of paths show validation failures (red crosses), emphasizing strict knowledge verification 2. **Phase Complexity**: Three distinct charging phases shown, with "Super Charged" as the final expansion 3. **Action Resolution**: Only "Double Slam" marked as sufficient (green check), others rejected 4. **Multi-Agent System**: Three robot icons suggest collaborative decision-making process ### Interpretation The diagram demonstrates a hierarchical AI system where: 1. **Input Processing** converts textual descriptions into perceptual data 2. **Knowledge Validation** acts as a gatekeeper, rejecting 70% of potential actions through cross-referencing 3. **Phase Expansion** builds complexity through iterative charging states 4. **Multi-Agent Retrieval** combines specialized agents to resolve actions, with "Double Slam" emerging as the validated outcome The system's design prioritizes accuracy over speed, with multiple validation checkpoints. The final action resolution shows domain-specific knowledge integration, where only actions matching both perceptual input and validated knowledge are accepted. The "Double Slam" resolution suggests this is the most contextually appropriate action given the Zinogre's described posture and attack pattern. </details> Figure 3: Our method first converts the query media into a textual caption. Next, a multi-agent self-search mechanism retrieves relevant knowledge associated with the query. Finally, the retrieved knowledge is utilized to enhance the model’s response. <details> <summary>images/fast-forward-button.png Details</summary> ![271f0dab](/v1/image/271f0dab24a914d77e7ed1c574d4fac9f3ebda4d2a5da9898dec5b650d7740db) ### Visual Description ## Icon/Symbol: Triple Right-Chevron Sequence ### Overview The image displays three identical right-pointing chevrons (">") arranged horizontally in a staggered, overlapping sequence. All elements share uniform visual properties with no additional contextual elements. ### Components/Axes - **Primary Elements**: Three chevrons with identical geometry - **Color**: Solid yellow (#FFD700) with no gradients or patterns - **Arrangement**: Horizontally aligned with 20% horizontal overlap between adjacent chevrons - **Spacing**: Equal 10px gaps between chevron bases - **Background**: Pure white (#FFFFFF) with no texture or additional elements ### Detailed Analysis 1. **Chevron Geometry**: - Base width: 100px - Height: 80px - Angle: 45° from horizontal - Stroke width: 0px (solid fill) - Corner radius: 0px (sharp angles) 2. **Positioning**: - Leftmost chevron: Origin at (0, 40) - Middle chevron: Origin at (70, 34) - Rightmost chevron: Origin at (140, 28) - All chevrons maintain vertical alignment at their base 3. **Color Analysis**: - Yellow hue: RGB(255, 215, 0) - No transparency or opacity variations - No shadow or outline effects ### Key Observations - Perfect geometric repetition with no variation in size or orientation - Consistent spacing creates rhythmic visual progression - Overlapping creates illusion of depth despite 2D nature - No interactive elements or hover states present ### Interpretation This icon likely represents: 1. **Progression**: The rightward direction suggests forward movement or advancement 2. **Multi-stage Process**: Triple repetition implies a three-step sequence or tiered system 3. **Attention Signal**: Yellow color often denotes caution or importance in UI design 4. **Digital Interface Element**: Clean design suggests use as a button, indicator, or navigation element The absence of text or additional context leaves the exact purpose ambiguous, but the design strongly implies directional flow or sequential action within a digital interface. The chevron repetition could indicate pagination, multi-step forms, or a "next" action in a cyclical process. </details> means the continue of search and <details> <summary>images/sd-card.png Details</summary> ![ac20dbb4](/v1/image/ac20dbb491ef3bc5706d78fef315a588f2ec6216efa93439a339bb739f99eff0) ### Visual Description ## Icon: Filing Cabinet Symbol ### Overview The image depicts a minimalist black icon representing a filing cabinet. It features a rectangular body with three vertically aligned rectangular slots at the top, resembling document storage compartments. No text, labels, or numerical data are present. ### Components/Axes - **Primary Shape**: Solid black rectangle with rounded corners. - **Slots**: Three vertical white rectangles at the top, spaced evenly. - **Absent Elements**: No legends, axes, annotations, or textual labels. ### Detailed Analysis - **Color**: Entirely black fill with no gradients or patterns. - **Slots**: Uniformly sized and positioned, suggesting standardized storage units. - **Negative Space**: White slots contrast sharply with the black body, emphasizing functionality. - **No Data/Text**: No embedded values, categories, or legends to extract. ### Key Observations 1. The design prioritizes simplicity and recognizability over detail. 2. The three slots may symbolize modular storage, categorization, or hierarchical organization. 3. Lack of text or numerical data implies the icon is intended for universal understanding (e.g., UI symbol). ### Interpretation This icon likely represents document management, archival systems, or organizational tools. The absence of text aligns with standard iconography practices, where visual cues (e.g., slots for papers) convey meaning without language. The black color could signify neutrality, authority, or digital minimalism. No trends or anomalies exist due to the static, non-data nature of the image. </details> represents the aggregated knowledge upon current entity. 3.2 MH Benchmark We also designed 238 carefully crafted queries involving knowledge within MH-MMKG. They cover six challenging sub-tasks, as shown in Table 1 (some samples are available in Figure 5). As shown in Figure 3, each input query $Q=(q,d,z)$ consists of 1) a question $q$ , 2) a video or images $d$ , which serves as a visual reference for $q$ , and auxiliary information $z$ , which contains relevant monster’s name $e∈\mathcal{E}_{o}$ as well as additional information that cannot be inferred purely from $d$ . The auxiliary information $z$ provides additional context to an MLLM for finding the correct subgraph. By default, we assume that $z$ is available as it facilitates knowledge retrieval and in the real scenarios players generally know them We analyze their impacts in the supplementary material.. Note that, visual data in $Q$ is different from MH-MMKG’s. To answer $q$ without built-in knowledge about the game’s world, a model needs to go through multiple entities and edges of $\mathcal{G}$ . For example, to answer question “Which attack follows?” with a visual context in which a monster Rathian perform attack action Triple Rush, the knowledge will be: $$ \displaystyle\textit{Rathian}\xrightarrow{\text{attack of}}\textit{Triple Rush}\xrightarrow{\text{continues}}\textit{Bite}, \tag{1} $$ as shown in Figure 2. Therefore, finding the correct paths over $\mathcal{G}$ is essential for this task. Each query is annotated with a subgraph $\mathcal{I}$ of $\mathcal{G}$ as ground-truth knowledge for answering $q$ , textual descriptions $s$ of $d$ written by the knowledgeable game players, and the ground-truth answer $y$ . $\mathcal{I}$ comes with a root entity $e_{0}$ , and the paths from $e_{0}$ to each leaf is deemed as a bunch of knowledge. The details of MH-MMKG and the MH benchmark are available in the supplementary material. 3.3 Task Definition We define three task variants over MH benchmark with different levels of supplementary information. Knowledgeable (Know.) variant simulates a model with sufficient perception for the game’s visual domain and sufficient (built-in) knowledge about the world to answer the given question. This idealized setting assesses how well the model reasons the answer $\hat{y}$ from perfect external knowledge and perception. Together with $Q$ and $\mathcal{G}$ , we provide a model with 1) the annotated subgraph $\mathcal{I}$ as well as 2) the textual description $s$ of $Q$ ’s visual reference $d$ , and 3) the set of all human-written caption for $c$ in $\mathcal{G}$ so that it can access to ones associated with entities in $\mathcal{I}$ without visual perception, i.e.: $$ \hat{y}=M(Q,s,\mathcal{I},\mathcal{G},\mathcal{A}). \tag{2} $$ Perceptive variant also assumes a model’s sufficient perception but without knowledge to answer the question. This setting evaluates the model’s knowledge retrieval (i.e., to find $\mathcal{I}$ ) ability and the knowledge comprehension ability. The model is supplied with human-written caption for all $c$ so that it can find textual descriptions associated with entities, and is required to output both the final response $\hat{y}$ and the retrieved $\hat{\mathcal{I}}$ : $$ \hat{y},\hat{\mathcal{I}}=M(Q,s,\mathcal{G},\mathcal{A}). \tag{3} $$ Unaided is the most challenging variant, where the model relies entirely on its own visual perception (for both $d$ and $c$ ) to interpret the visual world and retrieve external knowledge. As with the perceptive variant, the output are both $\hat{y}$ and $\hat{\mathcal{I}}$ . This variant is formulated as: $$ \hat{y},\hat{\mathcal{I}}=M(Q,\mathcal{G}). \tag{4} $$ 4 Method Figure 3 illustrates the overall pipeline of our method to address our task. It is designed as a baseline method for the MH Benchmark and we describe it for the unaided variant, while it can easily adapted to the others. Given an input query $Q$ , the visual reference $d∈ Q$ is first converted into a textual caption by a perceiver. Then, a multi-agent retriever, consisting of topic selection, validation, and expansion agents, retrieves relevant knowledge for $Q$ . Finally, a summarizer generates $\hat{y}$ using retrieved knowledge. The perceiver $P$ is designed to transform $d$ in $Q$ into textual description. While retaining the original visual data could provide richer information, on-the-fly perception in the visual modality can incur high computational costs; therefore, we choose to convert $d$ into the text as a more efficient representation. The transformed text is demoted as $\hat{s}=P(Q)$ . We detail the multi-agent retriever and summarizer in the following sections. 4.1 Multi-agents Retriever We design a fully automated search algorithm with three agents to find the subgraph of $\mathcal{G}$ to answer $q$ . First, the topic selection agent $L$ analyzes the input question $q$ to identify the topic entity $e_{0}=L(q,z,\mathcal{E}_{o})$ , which serves as the root for knowledge retrieval. Then, the expansion and validation agents are alternately activated to grow the subgraph with breadth first search, starting from $e_{0}$ , to include necessary knowledge. The expansion agent finds all plausible neighboring entities for each open entities. The validation agent, in turn, validate if the current path from the root to each open entities is enough to answer $q$ . We denote the subgraph of $\mathcal{G}$ after the $t$ -th round by $\mathcal{K}_{t}=(\mathcal{E}^{\prime}_{t},\mathcal{V}^{\prime}_{t})$ , where $\mathcal{E}^{\prime}_{t}⊂eq\mathcal{E}$ , $\mathcal{V}^{\prime}_{t}⊂eq\mathcal{V}$ , and $\mathcal{K}_{0}=(\{e_{0}\},\emptyset)$ . We also denote the set of open entities after the $t$ -th round by $\mathcal{O}_{t}⊂\mathcal{E}^{\prime}_{t}$ , which is a (sub)set of newly added entities at the $t$ -th round and is required to explore further expansion. For the $(t+1)$ -th round, the expansion agent $W$ retrieves the set $\mathcal{N}(e)$ of all plausible neighboring entities for each open entity $e∈\mathcal{O}_{t}$ as: $$ \displaystyle\mathcal{N}(e)=W(e,\mathcal{K}_{t};Q,\mathcal{G},\mathcal{A}), \tag{5} $$ where $W$ is an MLLM with a designated prompt (detailed in the supplementary material) to judge if the knowledge aggregated over the path on $\mathcal{G}$ from $e_{0}$ to each $e^{\prime}∈\{e^{\prime}|(e,e^{\prime})∈\mathcal{V}\}$ is useful to answer $q$ . $\mathcal{N}(e)$ is a set of entities that are judged to be useful. To build $\mathcal{K}_{t+1}=(\mathcal{E}^{\prime}_{t+1},\mathcal{V}^{\prime}_{t+1})$ , we first aggregate all neighboring entities and add them to $\mathcal{E}^{\prime}_{t}$ as: $$ \displaystyle\mathcal{E}^{\prime}_{t+1}=\mathcal{E}^{\prime}_{t}\cup\{e^{\prime}|e\in\mathcal{O}_{t},e^{\prime}\in\mathcal{N}(e)\}, \tag{6} $$ where duplicated entities are removed to make entities in $\mathcal{E}^{\prime}_{t+1}$ unique. $\mathcal{V}^{\prime}_{t+1}$ includes additional links to newly added entities, given by: $$ \displaystyle\mathcal{V}^{\prime}_{t+1}=\mathcal{V}^{\prime}_{t}\cup\{\nu(e,e^{\prime})|e\in\mathcal{O}_{t},e^{\prime}\in\mathcal{N}(e)\}, \tag{7} $$ where $\nu(e,e^{\prime})$ gives $v∈\mathcal{V}$ identified by $(e,e^{\prime})$ . The validation agent $U$ checks if each path from $e_{0}$ to $e∈\mathcal{O}_{t+1}$ provides sufficient knowledge to answer $q$ : $$ \displaystyle o(e)=U(e,\mathcal{K}_{t+1};Q,\mathcal{G},\mathcal{A}), \tag{8} $$ where $U$ again is an MLLM with a carefully designed prompt. $o(e)=\textit{Yes}$ if the knowledge is good enough and no more expansion is needed; otherwise, $o(e)=\textit{No}$ and continues the expansion $U$ also output a caption for $c$ of the entity $e$ in unaided-online setting.. The open set $\mathcal{O}_{t+1}$ basically is the set of newly added entities in the $(t+1)$ -th round. If a newly added entity $e$ is already in $\mathcal{E}^{\prime}_{t}$ and also in $\mathcal{O}_{t}$ , the subgraph forms a loop. In this case, $e$ does not need further expansion as it is already in $\mathcal{E}^{\prime}_{t}$ . Meanwhile, $e$ may be included in the neighbor sets of multiple $e∈\mathcal{O}_{t}$ . In this case, the subgraph also forms a loop, but $e$ has not yet been explored. Thus, it should be in $\mathcal{O}_{t+1}$ . Taking these cases into account, $\mathcal{O}_{t+1}$ is given by: $$ \displaystyle\mathcal{O}_{t+1}=\mathcal{E}^{\prime}_{t+1}\backslash\mathcal{E}^{\prime}_{t}, \tag{9} $$ where the operator “ $\backslash$ ” represents set subtraction. This multi-agent retrieval runs until no open entity is found (i.e., $\mathcal{O}_{t}=\emptyset$ ). The retrieved subgraph is given by: $$ \hat{\mathcal{I}}=\mathcal{K}_{t}. \tag{10} $$ 4.2 Reasoning via Knowledge Augmentation For answer reasoning, we aggregate knowledge on $\hat{\mathcal{I}}$ and represent it in text, which is fed into an LLM. The knowledge augments the reasoning process: the LLM does not need to rely on built-in knowledge about the game’s world but (almost) pure reasoning ability is required. This is formally denoted by: $$ \hat{y}=\text{MLLM}(Q,\aleph(\hat{\mathcal{I}},\mathcal{G},\mathcal{A},\alpha)), \tag{11} $$ where $\hat{y}$ is the predicted answer for $Q$ , and $\aleph$ transforms each path from $e_{0}$ to each leaf of $\hat{I}$ into text given the entire knowledge (i.e., $\mathcal{G}$ and $\mathcal{A}$ ). The parameter $\alpha$ limited the number of paths in $\hat{\mathcal{I}}$ used for reasoning (defaulted as 5). 5 Results Table 2: The experimental settings of query and knowledge retrieval. “Vision” indicates that $d$ is used, while “H-Cap.” uses $s$ . For MMKG, “Path” means a model can access annotated subgraphs $\mathcal{I}$ . “H-Cap.” means a model use human-written captions (for $c$ ) in $\mathcal{A}$ . “Vis.-Off.” and “Vis.-On.” represent how a model’s captions are generated (offline and online). | Methods | Query Vision | MMKG H-Cap. | Path | H-Cap. | Vis.-Off | Vis.-On | | --- | --- | --- | --- | --- | --- | --- | | Vanilla | ✔ | | | | | | | Vanilla + | | ✔ | | | | | | Know. | | ✔ | ✔ | ✔ | | | | Perceptive | | ✔ | | ✔ | | | | Unaided-Offline | ✔ | | | | ✔ | | | Unaided-Online | ✔ | | | | | ✔ | Table 3: Experimental results are reported for both leading closed-source and open-source MLLMs. The Vanilla, Vanilla +, and Knowledgeable (Know.) experiments are evaluated solely on Acc., while all other experiments are assessed based on both $Acc.$ and knowledge consistency. Results improved in Online than Offline are highlighted with light green. | Models GPT-4o [1] | Vanilla Acc. .3122 | Vanilla + Pre. .3924 | Know. Rec. .8565 | Perceptive Acc. .7383 | Unaided-Offline Pre. .5061 | Unaided-Online Rec. .7046 | Acc. .4050 | Pre. .2595 | Rec. .4416 | .5105 | .2756 | .5625 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | GPT-4o mini [1] | .3218 | .4135 | .8481 | .6877 | .2963 | .5450 | .4514 | .2059 | .5028 | .3544 | .1626 | .3009 | | Claude 3.7 Sonnet [2] | .2827 | .3375 | .8987 | .7004 | .5817 | .6322 | .3628 | .2775 | .3270 | .4388 | .2911 | .4029 | | Claude 3.5 Sonnet [2] | .2869 | .3755 | .8776 | .7215 | .5922 | .6800 | .3966 | .3215 | .4008 | .3966 | .2330 | .3270 | | Claude 3.5 Haiku [2] | .2356 | .3206 | .8823 | .6455 | .3739 | .5007 | .3544 | .2002 | .3164 | .3670 | .1735 | .3361 | | Gemini 2.0 Flash [47] | .1983 | .2995 | .8438 | .6919 | .3507 | .6146 | .3839 | .1703 | .4092 | .3713 | .1515 | .3663 | | Gemini 1.5 Pro [47] | .2194 | .2700 | .8438 | .6962 | .4761 | .6033 | .3164 | .1615 | .2194 | .4050 | .2122 | .4585 | | Step-1o [43] | .2436 | .2815 | .8235 | .5747 | .4372 | .5095 | .3025 | .1831 | .2483 | .3403 | .2204 | .2987 | | InternVL2.5-78B-MPO [12] | .1603 | .2616 | .8649 | .5991 | .4198 | .5428 | .2700 | .1729 | .2250 | .3080 | .1556 | .2378 | | Qwen2.5-VL-72B [3] | .1476 | .2616 | .8734 | .6244 | .4602 | .4908 | .3206 | .2139 | .2383 | .3164 | .1615 | .1814 | | Ovis2-16B [36] | .1645 | .2573 | .8902 | .7046 | .5407 | .5949 | .2869 | .1963 | .2383 | .3459 | .1853 | .2878 | | MiniCPM-o-2.6 [53] | .1139 | .2405 | .8312 | .4683 | .3183 | .4001 | .2194 | .1311 | .2376 | .1687 | .0189 | .0210 | | DeepSeek-VL2-Small [52] | .1139 | .2362 | .6455 | .3586 | .1419 | .1708 | .1814 | .0400 | .0759 | .1181 | .0042 | .0042 | | Human (Knowledgeable) | .5252 | — | — | — | — | — | — | — | — | .9033 | .9207 | .8535 | | Human (Random) | .0336 | — | — | — | — | — | — | — | — | .6092 | .7113 | .6457 | 5.1 Experimental Settings We evaluated both leading closed-source models (showing in api: gpt-4o-2024-11-20, gpt-4o-mini-2024-07-18 [1], claude-3-7-sonnet-20250219, claude-3-5-sonnet-20240620, claude-3-5-haiku-20241022 [2], gemini-1.5-pro-002, gemini-2.0-flash-001 [47], and step-1o-vision-32k [43]) and open-source models (InternVL2.5-78B-MPO [12], Qwen2.5-VL-72B [3], Ovis2-16B [36], and DeepSeek-VL2 [52]) with our baseline over MH Benchmark. A single model serves as all agents and functionalities (i.e., the perceiver, summarize, as well as the topic selection, validation, and expansion agents in the multi-agent retriever) via different prompts. We use images solely as visual references because current MLLMs do not support videos as input. Due to input size limitations in most models, all images are resized to 1K for our experiments. Evaluation Metrics. We use accuracy (Acc.) to measure the correctness of predicted answers $\hat{y}$ compared to the ground-truth answer $y$ . Since the questions in our benchmark follow an open-ended question-answering format, we employ GPT-4o as a judge with few-shot samples [59]. Knowledge consistency is also used to calculate the consistency between $\mathcal{\hat{I}}$ and $\mathcal{I}$ using precision (Pre.) and recall (Rec.), demonstrating the efficiency in searching relevant knowledge. Human evaluation for GPT-4o as a judge and metric definition are shown in the supplementary material. 5.2 Performance of MLLMs In addition to the three tasks defined in 3.3, we evaluate the models’ performance without external knowledge (vanilla). Vanilla + isolates the effect of the model’s visual ability. Furthermore, the Unaided task includes two variants: 1) Offline variant relies on pre-extracted captions (for all $c$ in MMKG) of an MLLM without access to the query $Q$ or any information obtained during knowledge retrieval. Online variant generates captions for $c$ during knowledge retrieval (i.e., in $U$ ). Table 2 summarizes the experimental setups. We also include human performance for the vanilla and unaided-online settings. We recruited voluntary participants: One is knowledgeable, and the other has not played the game series. They first answered the questions only with $Q$ and then were allowed to look into $\mathcal{G}$ . As shown in Table 3, with the vanilla setting, all methods hardly predicted correct answers. GPT-4o achieves the best performance with an accuracy of 0.31, whereas the knowledgeable human attains 0.53. The closed-source models generally outperform open-source ones, implying that the closed-source models have richer built-in knowledge about the game. With the vanilla + setting, which allows human captions $s$ instead of images $d$ , most models’ performance improved, which means that MH Benchmark requires comprehension of visual context. In the knowledgeable experiment, the models achieved an accuracy of roughly 0.9. We thus argue that MH-MMKG provides sufficient knowledge to answer the questions in MH Benchmark. <details> <summary>x4.png Details</summary> ![bcf0c39d](/v1/image/bcf0c39d98c819e4a8f5c09af0b740119edb3bf1272822078db19ca7b49dc632) ### Visual Description ## Bar Chart: MH Benchmark Sub-tasks Accuracy Comparison ### Overview The chart compares the accuracy of three AI models (GPT-4o, Claude 3.7, Gemini 1.5) across six MH Benchmark sub-tasks (I–VI). It also includes "Pre." (pre-training) and "Re." (retrieval) performance markers. Accuracy ranges from 0.0 to 1.0 on the y-axis. ### Components/Axes - **X-axis**: MH Benchmark Sub-tasks (I–VI) - **Y-axis**: Accuracy (0.0–1.0 in 0.2 increments) - **Legend**: - Yellow star: Pre. (pre-training) - Gray circle: Re. (retrieval) - Blue: GPT-4o - Orange: Claude 3.7 - Green: Gemini 1.5 - **Legend Position**: Top-right corner ### Detailed Analysis 1. **Sub-task I**: - GPT-4o: ~0.90 - Claude 3.7: ~0.95 - Gemini 1.5: ~0.62 - Pre.: ~0.88 (yellow star) - Re.: ~0.85 (gray circle) 2. **Sub-task II**: - GPT-4o: ~0.38 - Claude 3.7: ~0.36 - Gemini 1.5: ~0.34 - Pre.: ~0.42 - Re.: ~0.60 3. **Sub-task III**: - GPT-4o: ~0.28 - Claude 3.7: ~0.18 - Gemini 1.5: ~0.24 - Pre.: ~0.20 - Re.: ~0.25 4. **Sub-task IV**: - GPT-4o: ~0.65 - Claude 3.7: ~0.30 - Gemini 1.5: ~0.40 - Pre.: ~0.35 - Re.: ~0.50 5. **Sub-task V**: - GPT-4o: ~0.70 - Claude 3.7: ~0.40 - Gemini 1.5: ~0.40 - Pre.: ~0.35 - Re.: ~0.45 6. **Sub-task VI**: - GPT-4o: ~0.53 - Claude 3.7: ~0.61 - Gemini 1.5: ~0.68 - Pre.: ~0.05 - Re.: ~0.02 ### Key Observations - **Model Performance**: - GPT-4o dominates in Sub-task I (~0.90) but declines in II–III (~0.28–0.38) before recovering in IV–V (~0.65–0.70). - Claude 3.7 peaks in Sub-task I (~0.95) and shows gradual improvement in VI (~0.61). - Gemini 1.5 performs consistently mid-range (0.24–0.68), with its highest accuracy in VI. - **Pre. vs. Re.**: - Pre. (yellow stars) generally outperforms Re. (gray circles) except in Sub-task III (~0.20 vs. 0.25). - Pre. accuracy drops sharply in VI (~0.05), while Re. hits a near-zero floor (~0.02). ### Interpretation The data suggests: 1. **Task-Specific Strengths**: GPT-4o excels in early sub-tasks (I, V), while Claude 3.7 and Gemini 1.5 improve performance in later sub-tasks (VI). 2. **Pre-training vs. Retrieval**: Pre-training (Pre.) consistently outperforms retrieval (Re.) across most sub-tasks, though the gap narrows in III. The drastic drop in Pre. accuracy in VI implies retrieval may be more critical for complex tasks. 3. **Model Limitations**: All models struggle with Sub-task III, indicating a potential weakness in handling intermediate complexity tasks. ### Spatial Grounding & Trend Verification - **Legend Alignment**: Colors match legend labels exactly (e.g., blue bars = GPT-4o). - **Trend Consistency**: GPT-4o’s U-shaped curve (high I, low III, high V) aligns with its accuracy values. Claude 3.7’s gradual rise in VI matches its increasing bar heights. </details> Figure 4: Performance comparison of GPT-4o, Claude 3.7 Sonnet, and Gemini 1.5 Pro across 6 sub-tasks in MH Benchmark. <details> <summary>x5.png Details</summary> ![5765b344](/v1/image/5765b344f8a165a44a37ba53005c458893353742561d3a762490355b2574fc12) ### Visual Description ## Screenshot of a Technical Document with Question-Answer Pairs and Diagrammed Paths ### Overview The image depicts a structured technical document divided into six sections (I–VI), each addressing a specific query related to monster behavior, attack patterns, or game mechanics. Each section includes: 1. A **question** posed to a system (e.g., "What is the nickname of Tigrex in Monster Hunter World?"). 2. A **vanilla answer** (often incorrect or incomplete). 3. **Retrieved paths** (diagrams with arrows and labels illustrating relationships or attack sequences). 4. An **augmented answer** (corrected or enhanced response based on retrieved paths). ### Components/Axes - **Sections (I–VI)**: Each section follows a consistent format: - **Question**: Textual query about game mechanics or monster behavior. - **Vanilla Answer**: Initial (often flawed) response. - **Retrieved Paths**: Diagrams with: - Arrows indicating relationships (e.g., "has," "changes," "continues"). - Labels for entities (e.g., "Tigrex," "Kushala Daora") and actions (e.g., "Aerial Crash," "Thunder Charge"). - Metrics like "Recall" and "Precision" (e.g., "Recall: 1," "Precision: 0.25"). - **Augmented Answer**: Revised response incorporating retrieved path data. ### Detailed Analysis #### Section I: Individual Information - **Question**: "What is the nickname of Tigrex in Monster Hunter World?" - **Vanilla Answer**: "The Roaring Wyvern." - **Retrieved Paths**: - Diagram labels: "Tigrex," "large flying dragon monster," "poor flying ability," "earth-shaking roars," "Absolutely Strong One." - Metrics: "Recall: 1," "Precision: 1." - **Augmented Answer**: "The nickname of Tigrex in the game is the 'Absolutely Strong One.'" #### Section II: Action Recognition - **Question**: "What will happen next within this attack action?" (context: Kushala Daora’s attack). - **Vanilla Answer**: "Wind-based attack, creating wind pressure or tornado effect." - **Retrieved Paths**: - Diagram labels: "Aerial Phase," "Aerial Triple Breath Attack," "Aerial Kick," "Aerial Super Breath Attack," "Aerial Crash," "Ground Phase." - Metrics: "Recall: 1," "Precision: 0.25." - **Augmented Answer**: "Kushala Daora will execute the Aerial Crash attack, charging toward the hunter..." #### Section III: Combo Premonition - **Question**: "Analyze all possible continued attacks of Zinogre in charged phase after the showing attack." - **Vanilla Answer**: "Thunder Slam, Pounce Attack, Electric Charge..." - **Retrieved Paths**: - Diagram labels: "Charged Phase," "Thunder Charge," "Leap Attack," "Shoulder Charge," "Heavy Paw Slam." - Metrics: "Recall: 0.5," "Precision: 0.5." - **Augmented Answer**: "In charged phase, possible continued attacks are Shoulder Charge and Thunder Charge." #### Section IV: Condition Awareness - **Question**: "Predict the ongoing attack action Frostfang Barioth will release." - **Vanilla Answer**: "Lunging ice breath attack." - **Retrieved Paths**: - Diagram labels: "Backstep Ice Breath," "Leap Attack," "Aerial Ice Breath," "Super Fan Slam." - Metrics: "Recall: 1," "Precision: 0.4." - **Augmented Answer**: "Preparing the 'Aerial Ice Breath' attack or 'Leap Attack'." #### Section V: Proc Effect Insight - **Question**: "What will happen to hunter if hit by Yian Garuga’s current attack action?" - **Vanilla Answer**: "Take heavy physical damage and potentially be knocked down." - **Retrieved Paths**: - Diagram labels: "Bird Missile," "Somersault," "Poke," "Tail Swing," "Enhanced Flame Breath." - Metrics: "Recall: 1," "Precision: 0.25." - **Augmented Answer**: "Leave the hunter with a poisonous effect." #### Section VI: Cross Monster Analysis - **Question**: "What is the difference between Lunastra and her mated pair for current attack’s pattern?" - **Vanilla Answer**: "Lunastra’s attack patterns involve wide, fiery explosions..." - **Retrieved Paths**: - Diagram labels: "Lunastra," "Teostra," "Fire Breath," "Super Noa," "Fire Breath Sweep." - Metrics: "Recall: 0.5," "Precision: 0.25." - **Augmented Answer**: "Lunastra produces blue flames; Teostra produces orange-yellow flames." ### Key Observations 1. **Pattern of Correction**: Vanilla answers are often incomplete or incorrect, while augmented answers integrate retrieved path data for accuracy. 2. **Diagram Complexity**: Retrieved paths use arrows and labels to map relationships (e.g., attack phases, entity interactions). 3. **Metrics**: Recall and precision values quantify the relevance and accuracy of retrieved paths. 4. **Entity Relationships**: Diagrams clarify how monsters transition between attack phases (e.g., "Aerial Phase" → "Ground Phase"). ### Interpretation This document illustrates a knowledge retrieval system designed to refine answers by cross-referencing contextual data (e.g., monster attack sequences, phase transitions). The retrieved paths act as a knowledge graph, linking entities and actions to improve response accuracy. For example: - **Section II** shows how Kushala Daora’s attack sequence is mapped to specific actions (e.g., "Aerial Crash"), enabling precise predictions. - **Section VI** highlights differences between Lunastra and Teostra by analyzing their attack patterns, demonstrating the system’s ability to handle comparative queries. The use of recall and precision metrics suggests an emphasis on evaluating the reliability of retrieved information, ensuring answers are both relevant and accurate. This structure could be part of a larger AI-driven knowledge base for gaming mechanics or monster behavior analysis. </details> Figure 5: Examples of the 6 sub-tasks in the MH Benchmark, each generated by GPT-4o for both the Vanilla Answer and the Augmented Answer using an unaided-online retrieval. The perceptive, unaided-offline, and unaided-online settings evaluate a model’s ability to retrieve relevant knowledge In the perceptive setting, which uses human captions for both queries and knowledge, all models demonstrated a strong capacity to identify relevant knowledge, significantly improving performance compared to vanilla +. Additionally, the models exhibit high knowledge consistency. In the unaided-offline setting, all models performed worse than in the perceptive setting, suggesting that visual perception plays a crucial role in knowledge retrieval. Nevertheless, they are still still better than the vanilla setting, demonstrating the benefits of knowledge retrieval. The unaided-online setting further challenges the models’ visual perception to generate captions with richer context (i.e., $Q$ and the knowledge on the path in $\hat{\mathcal{I}}$ ) compared to the unaided-offline. Our results show that only some models surpassed the offline variant (highlighted in green), yet the improvement is evident–GPT-4o improves by 0.1, and Claude 3.7 Sonnet by 0.07. This suggests that knowing what to do enhances a model’s visual perception, strengthening its planning ability, and ultimately results in improved reasoning ability. Additionally, we find that recall is consistently higher than precision, suggesting that models tend to retrieve more knowledge. On the contrary, humans showed higher precision (though both precision and recall are high compared to MLLMs). We also analyzed the performance differences in the unaided-online setting across six sub-tasks over GPT-4o, Claude 3.7 Sonnet, and Gemini 1.5 Pro, as illustrated in Figure 4. GPT-4o achieves the best performance in all sub-tasks except I and VI. It is particularly strong in sub-tasks IV and V, which are more reliant on fine-grained visual perception. Additionally, sub-task I is the simplest, involving only a single relationship $v$ , for which the models generally perform well. For sub-task VI, although the accuracy is high, models fail to find the correct path (model uses its inherent knowledge for response). These results suggest that visual perception remains a key challenge for MH benchmark. Figure 5 presents six examples (one per sub-task) in the vanilla and unaided-online settings, where all predictions are by GPT-4o. In the retrieved path panes, the green paths are both in $\mathcal{I}$ and $\hat{\mathcal{I}}$ (i.e., correctly retrieved paths), the dotted green paths are in $\mathcal{I}$ but not in $\hat{\mathcal{I}}$ , and the black paths are in $\hat{\mathcal{I}}$ but not in $\mathcal{I}$ . The key relationship or essential information for deriving the correct answer is highlighted in orange. We report the precision and recall metrics for each example. GPT-4o retrieves many knowledge paths, capturing all relevant knowledge for prediction, though precision is low. Interestingly, for challenging cases like sub-tasks II, GPT-4o successfully recognized detailed visual concepts (e.g., the monster flying backward and the monster sticking to the wall) and generated plausible captions, which require fine-grained visual perception. 5.3 Analysis of Factors Affecting Performance Impact of Captioning Ability. Table 4 compares performance of captioning, where the metric similarity (Sim.) measures the similarity between generated and human captions using GPT-4o (see supplementary for details). The table also summarizes the accuracy and knowledge consistency scores. The results indicate that the unaided-online setting consistently gives higher similarity scores than the unaided-offline, suggesting that awareness of query $Q$ and retrieval process consistently enhances captioning performance. Also, the similarity scores positively correlated with the reasoning accuracy scores. We also evaluated the unaided-offline variant on GPT-4o but its offline captions are replaced with ones by video models InternVideo2.5 [50] and VideoChat-Flash [32]. The results showed lower performance than the original GPT-4o across all metrics, indicating that current video models have limited capability in visual comprehension for MH Benchmark. Table 4: Reasoning and knowledge consistency performance and captioning performance. Improved performance scores due to online captioning are highlighted in green. | Model GPT-4o [1] GPT-4o [1] | Vis.-Off ✔ | Vis.-On ✔ | Acc. .4050 .5105 | Pre. .2595 .2756 | Rec. .4416 .5625 | Sim. .2806 .2948 | | --- | --- | --- | --- | --- | --- | --- | | Claude 3.7 Sonnet [2] | ✔ | | .3628 | .2775 | .3270 | .2776 | | Claude 3.7 Sonnet [2] | | ✔ | .4388 | .2911 | .4029 | .3208 | | Gemini 1.5 Pro [47] | ✔ | | .3164 | .1615 | .2194 | .1608 | | Gemini 1.5 Pro [47] | | ✔ | .4050 | .2122 | .4585 | .1746 | | InternVideo2.5 [50] | ✔ | | .3697 | .1960 | .2959 | .0525 | | VideoChat-Flash [32] | ✔ | | .3445 | .2135 | .2863 | .0644 | Keyframe Selection. Due to the limited number of input tokens, current MLLMs cannot handle videos, and keyframe sampling is necessary. MH-MMKG provides human-selected keyframes to facilitate the evaluation on this point. To show the difference between human-selected keyframes and typically-adopted equal-interval sampling, Table 5 summarizes the results on GPT-4o, where sampling is done in two frames per second, and the maximum number of frames is capped at ten frames. The keyframe selection strategy does impact the performance: Human-selected keyframes seem to provide more informative visual cues, though the difference is not substantial. Table 5: Impact of Keyframes selection. | Human Sampling | .4050 .3725 | .2595 .2199 | .4416 .3893 | .5105 .4840 | .2756 .2388 | .5625 .5254 | | --- | --- | --- | --- | --- | --- | --- | <details> <summary>x6.png Details</summary> ![af460af1](/v1/image/af460af107bbf971d6794500acf276ea9d5396be10855f8746fa56ccdea73d03) ### Visual Description ## Line Chart: Accuracy vs. Top K Retrieved Knowledge ### Overview The chart compares the performance of two systems (GPT and Claude) across two metrics: **Accuracy** and **Number of Data Points**, as the number of retrieved knowledge items (Top K) increases from 1 to 8. Four data series are plotted: - **GPT Accuracy** (blue circles) - **Claude Accuracy** (orange squares) - **GPT Number of Data Points** (blue triangles) - **Claude Number of Data Points** (orange diamonds) ### Components/Axes - **X-axis**: "Top K Retrieved Knowledge" (integer values 1–8) - **Primary Y-axis (left)**: "Accuracy" (0.3–0.7) - **Secondary Y-axis (right)**: "Number of Data" (0–238) - **Legend**: Located in the top-right corner, with color/symbol mappings for all four data series. ### Detailed Analysis 1. **GPT Accuracy (blue circles)**: - Starts at ~0.32 (Top K=1) - Rises sharply to ~0.52 (Top K=4) - Plateaus at ~0.52 (Top K=5–8) 2. **Claude Accuracy (orange squares)**: - Starts at ~0.3 (Top K=1) - Increases gradually to ~0.45 (Top K=5) - Remains flat at ~0.45 (Top K=6–8) 3. **GPT Number of Data (blue triangles)**: - Begins at ~200 (Top K=1) - Declines steadily to ~10 (Top K=7) - Drops to ~5 (Top K=8) 4. **Claude Number of Data (orange diamonds)**: - Starts at ~180 (Top K=1) - Falls sharply to ~10 (Top K=5) - Remains at ~5 (Top K=6–8) ### Key Observations - **Accuracy Trends**: - GPT Accuracy improves significantly with more retrieved knowledge, then stabilizes. - Claude Accuracy increases modestly and plateaus earlier than GPT. - **Data Point Trends**: - Both systems show a strong inverse relationship between retrieved knowledge and data points. - Claude’s data points drop more sharply than GPT’s after Top K=5. ### Interpretation 1. **Performance Insights**: - GPT benefits more from additional retrieved knowledge, achieving higher accuracy gains compared to Claude. - Claude’s accuracy stabilizes at ~0.45, suggesting diminishing returns beyond Top K=5. - The sharp decline in data points for both systems implies that retrieving more knowledge items becomes increasingly resource-intensive or inefficient. 2. **Anomalies**: - Claude’s data points drop to near-zero (5) by Top K=6, while GPT retains ~10 data points at Top K=8. This may indicate Claude’s retrieval process becomes less effective or exhaustive at higher K values. 3. **Practical Implications**: - For applications prioritizing accuracy, GPT may be preferable when sufficient data is available. - Claude’s efficiency (fewer data points) might make it suitable for resource-constrained scenarios, despite lower accuracy ceilings. 4. **Uncertainties**: - Exact values for data points (e.g., GPT Accuracy at Top K=3) are approximate due to overlapping markers. - The secondary y-axis scale (Number of Data) lacks gridlines, complicating precise interpolation. </details> (a) Using K paths as knowledge. <details> <summary>x7.png Details</summary> ![f6d6ccb1](/v1/image/f6d6ccb150cbc2f287d335aa368b2201de38571c9a9ba4aa60a04807e5a2cc23) ### Visual Description ## Bar Chart: Accuracy Comparison of Different Models ### Overview The chart compares the accuracy of three AI models (GPT-4o, Claude 3.7, Gemini Pro) across two evaluation methods (BFS and DFS) and two performance metrics (Pre. and Re.). Bars represent accuracy values, with data points marked by stars (Pre.) and circles (Re.). ### Components/Axes - **X-axis**: Models (GPT-4o, Claude 3.7, Gemini Pro) - **Y-axis**: Accuracy (0.2–0.6) - **Legend**: - Yellow stars: Pre. - Gray circles: Re. - Teal bars: BFS - Pink bars: DFS - **Title**: "Accuracy Comparison of Different Models" ### Detailed Analysis 1. **GPT-4o**: - BFS: ~0.51 (teal bar) - DFS: ~0.45 (pink bar) - Pre.: ~0.33 (yellow star) - Re.: ~0.48 (gray circle) 2. **Claude 3.7**: - BFS: ~0.43 (teal bar) - DFS: ~0.41 (pink bar) - Pre.: ~0.34 (yellow star) - Re.: ~0.42 (gray circle) 3. **Gemini Pro**: - BFS: ~0.35 (teal bar) - DFS: ~0.31 (pink bar) - Pre.: ~0.25 (yellow star) - Re.: ~0.36 (gray circle) ### Key Observations - **BFS vs. DFS**: BFS consistently outperforms DFS across all models (e.g., GPT-4o: 0.51 vs. 0.45). - **Pre. vs. Re.**: Re. values exceed Pre. for all models (e.g., Gemini Pro: 0.36 vs. 0.25). - **Model Performance**: GPT-4o achieves the highest accuracy in both BFS and Re., while Gemini Pro has the lowest. ### Interpretation The data suggests that BFS evaluation methods yield higher accuracy than DFS for all tested models. The gap between Pre. and Re. metrics may indicate improvements in model refinement or evaluation criteria. GPT-4o demonstrates superior performance, potentially due to architectural advantages or training data quality. Gemini Pro’s lower accuracy across all metrics highlights room for optimization. The consistent trend of BFS > DFS and Re. > Pre. implies methodological factors significantly influence outcomes. </details> (b) BFS vs DFS Figure 6: Ablation experiments for proposed multi-agents search. The Number of Paths Used in Reasoning. In our experiments, the number of the paths in $\hat{\mathcal{I}}$ used for reasoning is limited. All evaluations so far used 5 paths, though this number can change the performance Our baseline uses the first 5 paths. A shorter path is thus preferred.. Figure 6(a) shows the relationship between the number of paths and the reasoning accuracy. GPT-4o gives the optimal performance when 5 paths are used. We also show in the figure the number of queries $Q$ for which at least a certain number of paths are retrieved. GPT-4o tends to find more paths as its knowledge (the recall is higher), while Claude 3.7 is more conservative in retrieval (the precision is higher). BFS versus DFS. The retrieval strategy can also impact performance. BFS finds shorter paths first, while depth-first search (DFS) can find longer paths at the early stage of retrieval. We evaluated the performance for BFS (the proposed baseline) and DFS when three paths are used. As shown in Figure 6(b), BFS consistently performs better. This means that our queries can be answered within fewer hops in $\mathcal{G}$ , which can be seen as a limitation of our MH Benchmark, as it only requires fewer steps of reasoning. 6 Conclusion This work explores the ability of MLLMs to handle domain-specific tasks by retrieving knowledge from MMKG. We introduce MH-MMKG and the MH benchmark as a testbed. Additionally, we propose a baseline with a multi-agent knowledge retriever that allows MLLMs to autonomously access relevant knowledge without requiring additional training. Experimental results on both highlight the importance of visual perception of MLLMs and finding relevant knowledge for reasoning. Our work paves the way for more adaptable and context-aware MLLMs in complex real-world scenarios. Future research may focus on expanding the benchmark size, incorporating a wider range of knowledge types, exploring the potential of video modalities, and developing more advanced knowledge retrieval methods. Acknowledgement This work is supported by World Premier International Research Center Initiative (WPI), MEXT, Japan. This work is also supported by JST ACT-X Grant No. JPMJAX24C8, JSPS KAKENHI No. 24K20795, CREST Grant No. JPMJCR20D3, and JST FOREST Grant No. JPMJFR216O. References - Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. - Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. - Baumgartner et al. [2020] Matthias Baumgartner, Luca Rossetto, and Abraham Bernstein. Towards using semantic-web technologies for multi-modal knowledge graph construction. In ACM Multimedia, pages 4645–4649, 2020. - Bollacker et al. [2008] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247–1250, 2008. - Bonomo and Bianco [2025] Mirco Bonomo and Simone Bianco. Visual rag: Expanding mllm visual knowledge without fine-tuning. arXiv preprint arXiv:2501.10834, 2025. - Caffagni et al. [2024] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In CVPR, pages 1818–1826, 2024. - Chen et al. [2025] Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, and Hui Xiong. Plan-on-graph: Self-correcting adaptive planning of large language model on knowledge graphs. In AAAI, 2025. - Chen et al. [2024a] Peng Chen, Pi Bu, Jun Song, Yuan Gao, and Bo Zheng. Can vlms play action role-playing games? take black myth wukong as a study case. arXiv preprint arXiv:2409.12889, 2024a. - Chen et al. [2022] Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William Cohen. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In EMNLP, pages 5558–5570, 2022. - Chen et al. [2023] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. - Chen et al. [2024b] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024b. - Chen et al. [2024c] Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo, Xiang Chen, Qian Li, Wen Zhang, Jiaoyan Chen, Yushan Zhu, et al. Knowledge graphs meet multi-modal learning: A comprehensive survey. arXiv preprint arXiv:2402.05391, 2024c. - Cui et al. [2024] Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. In WACV, pages 958–979, 2024. - Edge et al. [2024] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024. - Fang et al. [2024] Siyuan Fang, Kaijing Ma, Tianyu Zheng, Xinrun Du, Ningxuan Lu, Ge Zhang, and Qingkun Tang. Karpa: A training-free method of adapting knowledge graph as references for large language model’s reasoning path aggregation. arXiv preprint arXiv:2412.20995, 2024. - Ferrada et al. [2017] Sebastián Ferrada, Benjamin Bustos, and Aidan Hogan. Imgpedia: a linked dataset with content-based analysis of wikimedia images. In ISWC, pages 84–93, 2017. - Gao et al. [2023] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023. - Guan et al. [2024] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, pages 14375–14385, 2024. - Gui et al. [2022] Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander G Hauptmann, Yonatan Bisk, and Jianfeng Gao. Kat: A knowledge augmented transformer for vision-and-language. In NAACL, pages 956–968, 2022. - Guo et al. [2021] Hao Guo, Jiuyang Tang, Weixin Zeng, Xiang Zhao, and Li Liu. Multi-modal entity alignment in hyperbolic space. Neurocomputing, 461:598–607, 2021. - He et al. [2024] Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering. In NeurIPS, 2024. - Ishiwatari et al. [2020] Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In EMNLP, pages 7360–7370, 2020. - Ji et al. [2021] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and S Yu Philip. A survey on knowledge graphs: Representation, acquisition, and applications. TNNLS, 33(2):494–514, 2021. - Ji et al. [2024] Yixin Ji, Kaixin Wu, Juntao Li, Wei Chen, Mingjie Zhong, Xu Jia, and Min Zhang. Retrieval and reasoning on kgs: Integrate knowledge graphs into large language models for complex question answering. In Findings of EMNLP, pages 7598–7610, 2024. - Jiang et al. [2023] Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. In EMNLP, pages 9237–9251, 2023. - Jiang et al. [2024] Xuhui Jiang, Yinghan Shen, Zhichao Shi, Chengjin Xu, Wei Li, Huang Zihe, Jian Guo, and Yuanzhuo Wang. Mm-chatalign: A novel multimodal reasoning framework based on large language models for entity alignment. In Findings of EMNLP, pages 2637–2654, 2024. - Jin et al. [2024] Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. Large language models on graphs: A comprehensive survey. TKDE, 2024. - Krishna et al. [2017] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017. - Lee et al. [2024] Junlin Lee, Yequan Wang, Jing Li, and Min Zhang. Multimodal reasoning with multimodal knowledge graph. In ACL, pages 10767–10782, 2024. - Li et al. [2023] Qian Li, Cheng Ji, Shu Guo, Zhaoji Liang, Lihong Wang, and Jianxin Li. Multi-modal knowledge graph transformer framework for multi-modal entity alignment. In Findings of EMNLP, pages 987–999, 2023. - Li et al. [2024a] Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574, 2024a. - Li et al. [2024b] Yangning Li, Yinghui Li, Xingyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S Yu, Fei Huang, et al. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. arXiv preprint arXiv:2411.02937, 2024b. - Lin and Byrne [2022] Weizhe Lin and Bill Byrne. Retrieval augmented visual question answering with outside knowledge. In EMNLP, pages 11238–11254, 2022. - Liu et al. [2019] Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio, and David S Rosenblum. Mmkg: multi-modal knowledge graphs. In ESWC, pages 459–474. Springer, 2019. - Lu et al. [2024] Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv:2405.20797, 2024. - Luo et al. [2024] Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, and Shirui Pan. Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models. arXiv preprint arXiv:2410.13080, 2024. - Lymperaiou and Stamou [2024] Maria Lymperaiou and Giorgos Stamou. A survey on knowledge-enhanced multimodal learning. Artificial Intelligence Review, 57(10):284, 2024. - Nguyen et al. [2024] Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, and Luu Anh Tuan. Video-language understanding: A survey from model architecture, model training, and data perspectives. ACL, 2024. - Peng et al. [2024] Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey. arXiv preprint arXiv:2408.08921, 2024. - Ravi et al. [2023] Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, and Vered Shwartz. Vlc-bert: Visual question answering with contextualized commonsense knowledge. In WACV, pages 1155–1165, 2023. - Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pages 146–162. Springer, 2022. - Stepfun [2024] Stepfun. Step-1o. In https://platform.stepfun.com/, 2024. - Su et al. [2024] Cheng Su, Jinbo Wen, Jiawen Kang, Yonghua Wang, Yuanjia Su, Hudan Pan, Zishao Zhong, and M Shamim Hossain. Hybrid rag-empowered multi-modal llm for secure data management in internet of medical things: A diffusion-based contract approach. IEEE Internet of Things Journal, 2024. - Sun et al. [2025] Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In ICLR, 2025. - Talebirad and Nadiri [2023] Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023. - Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. - Wang et al. [2024a] Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. Direct: Diagnostic reasoning for clinical notes via large language models. In NeurIPS, pages 74999–75011, 2024a. - Wang et al. [2024b] Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024b. - Wang et al. [2025] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling. arXiv preprint arXiv:2501.12386, 2025. - Wang et al. [2024c] Zheng Wang, Shu Teo, Jieer Ouyang, Yongjun Xu, and Wei Shi. M-RAG: Reinforcing large language model performance through retrieval-augmented generation with multiple partitions. In ACL, pages 1966–1978, 2024c. - Wu et al. [2024] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. - Yao et al. [2024] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. - Yin et al. [2023] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. - Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, pages 9556–9567, 2024. - Zeng et al. [2023] Yawen Zeng, Qin Jin, Tengfei Bao, and Wenfeng Li. Multi-modal knowledge hypergraph for diverse image retrieval. In AAAI, pages 3376–3383, 2023. - Zhao et al. [2024] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024. - Zhao et al. [2023] Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. Retrieving multimodal information for augmented generation: A survey. In Findings of EMNLP, pages 4736–4756, 2023. - Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS, 36:46595–46623, 2023. - Zhu et al. [2022] Xiangru Zhu, Zhixu Li, Xiaodan Wang, Xueyao Jiang, Penglei Sun, Xuwu Wang, Yanghua Xiao, and Nicholas Jing Yuan. Multi-modal knowledge graph construction and application: A survey. TKDE, 36(2):715–735, 2022. \thetitle Supplementary Material Contents 1. 1 Introduction 1. 2 Related Works 1. 2.1 Multimodal Retrieval Augmented Generation 1. 2.2 Multimodal Knowledge Graph 1. 2.3 Retrieve Knowledge on Graph 1. 3 Datasets 1. 3.1 MH-MMKG 1. 3.2 MH Benchmark 1. 3.3 Task Definition 1. 4 Method 1. 4.1 Multi-agents Retriever 1. 4.2 Reasoning via Knowledge Augmentation 1. 5 Results 1. 5.1 Experimental Settings 1. 5.2 Performance of MLLMs 1. 5.3 Analysis of Factors Affecting Performance 1. 6 Conclusion 1. 7 Detail of MH Benchmark Construction 1. 7.1 MH-MMKG 1. 7.2 MH Benchmark 1. 8 Experiment Details 1. 8.1 Prompt Template for Our Method 1. 8.2 Knowledge Consistency Calculation 1. 8.3 Human Evaluation of GPT-4o as a Judge 1. 8.4 Additional Experiments for MH Benchmark 1. 8.5 More Result Samples 7 Detail of MH Benchmark Construction In this section, we detailed construction of our MH-MMKG and MH benchmark. 7.1 MH-MMKG A total of 22 monsters are incorporated into the graph construction, with each represented as a subgraph connected through various relationships, such as species relation and elemental weaknesses. Each subgraph contains rich information crucial for successful conquests, particularly regarding attack strategies, combos, attack phases, and launch conditions. The monsters are: Anjanath, Azure Rathalos, Barroth, Bazelgeuse, Brachydios, Diablos, Frostfang Barioth, Glavenus, Kushala Daora, Legiana, Nergigante, Rathalos, Rathian, Teostra, Tigrex, Uragaan, Zinogre, Pink Rathian, Yian Garuga, Stygian Zinogre, and Radobaan from Monster Hunter World. To ensure the quality, we hired three experienced Monster Hunter World players, each with over 200 hours of game play experience. They were tasked with gathering relevant monster information from sources such as Wiki, YouTube, and Bilibili to construct the graph. Additionally, since each monster has unique characteristics within the game, the structure of each subgraph is tailored accordingly. The entities are classified into 7 types as show in Table 6. Most of them are attack actions, making MH-MMKG more focused on battles with monsters. We also plan to explore more game elements in the future. Some entities are attached with text, image or video as its attribution. Note that all video or images are captured from Arena field. While for queries in MH Benchmark all visual media are captured from the Wild field. We also show the length statistic of video clips in Figure 7. It can be observed that most videos are around 1s to 5s. Table 6: Types of entity in MH-MMKG. | Topic Entity | Names of monsters that can serve as root entities for knowledge retrieval. Each entity is accompanied by an image of monster as its attribute. | 22 | | --- | --- | --- | | Attack Action | Possible attack movements of a monster, each accompanied by text, images (key frames for video), or a video as its attribute. Each of them also attached with human written-caption for the video. | 265 | | Attack Phase | In different phases, a monster will have varying attack patterns, damage, combos, and other attributes. Only some monsters have unique phase settings. Textual context is attached as attribution. | 20 | | Element | The element indicates a monster’s weakened resistance to a specific type of attack. | 9 | | Weapon | Types of damage for weapons crafted from monster materials. | 10 | | Props | Various types of game props for interacting with monsters. | 6 | | Attack Effects | The effects of monster attacks or skills during battle, including generated ice patches on ground, scratches, and explosions. Textual context is attached as attribution. | 9 | There are also 158 kinds of edges and 16 of them are base edges: “has attack action of”, “continues with attack action of”, “has attack variant of”, “has attack phase of”, “change attack phase to”, “is mostly weaken with”, “is weaken with”, “is resistant with”, “provide materials for”, “can be stopped by”, “has attack variants of”, “generates”, “cause”, “turns to”, “mated pair with”, “has subspecies of”. Some base edges (mostly the first two of them) are further combined with specific constrain mechanism to form 142 variants. The samples of constrains are: “is angry”, “hunter step into”, “is close to”, “stick on the wall”, “is knocked by”, etc. (We do not show all of them as they too many ). <details> <summary>x8.png Details</summary> ![128ec007](/v1/image/128ec0075a1b05778d31e324002f56cdf0ec65cea1dab4281d5085e3f25fcafa) ### Visual Description ## Density Plot: Duration Distribution ### Overview The image depicts a density plot showing the distribution of durations measured in seconds. The plot features a single blue-shaded curve representing the probability density function of the data. The x-axis represents duration in seconds, while the y-axis represents density values. ### Components/Axes - **X-axis (Duration)**: Labeled "Duration (seconds)" with numerical markers at 0.0, 2.5, 5.0, 7.5, 10.0, 12.5, 15.0, and 17.5 seconds. - **Y-axis (Density)**: Labeled "Density" with values ranging from 0.00 to 0.35 in increments of 0.05. - **Legend**: No explicit legend is present, but the blue-shaded area corresponds to the primary data series. - **Grid**: Light gray grid lines are visible in the background for reference. ### Detailed Analysis 1. **Primary Peak**: - The curve reaches its maximum density of approximately **0.34** at **2.5 seconds**. - This peak dominates the distribution, indicating the highest probability density occurs at this duration. 2. **Secondary Peak**: - A smaller secondary peak appears near **6 seconds** with a density of approximately **0.03**. - This peak is significantly lower than the primary peak, suggesting a less frequent but notable occurrence. 3. **Tails**: - The left tail (durations < 2.5s) is negligible, with density approaching 0. - The right tail (durations > 6s) declines gradually, with density dropping below 0.01 after 10 seconds and nearly zero beyond 15 seconds. ### Key Observations - The distribution is **unimodal** with a dominant peak at 2.5 seconds, though a minor secondary peak at 6 seconds introduces slight bimodality. - The distribution is **right-skewed**, with a long tail extending toward higher durations. - The density drops sharply after the primary peak, indicating most data points cluster tightly around 2.5 seconds. ### Interpretation This density plot suggests that the majority of observed durations are concentrated around **2.5 seconds**, with a mode at this value. The secondary peak at 6 seconds may represent a secondary process or outlier group, though its low density (0.03) indicates it accounts for a small fraction of the data. The right-skewed tail implies rare instances of significantly longer durations, potentially outliers or edge cases in the dataset. The sharp decline after the primary peak highlights the rarity of durations exceeding 6 seconds. This could reflect a system or process with a typical operational time of ~2.5 seconds, with occasional deviations. </details> Figure 7: Video clip length statistic. <details> <summary>x9.png Details</summary> ![b0e59060](/v1/image/b0e59060b225c3900bbf6f3738a4b7eb96281d50d908e89ce1fb0bd4b06be0da) ### Visual Description ## Diagram: Kushala Daora Combat Mechanics ### Overview The diagram illustrates the combat mechanics of the monster Kushala Daora, detailing its phases (Aerial and Ground) and associated attacks. Arrows indicate relationships between phases, attacks, and phase transitions, with color-coded legends for clarity. ### Components/Axes - **Legend** (top-right corner): - **Red arrows**: "has phase of" - **Blue arrows**: "change phase to" - **Black arrows**: "has attack of" - **Key Elements**: - **Kushala Daora** (left side): Central entity. - **Phases**: - **Aerial Phase** (center-top): Connected to Kushala Daora via red arrow. - **Ground Phase** (center-bottom): Connected to Kushala Daora via red arrow. - **Attacks**: - **Aerial Phase Attacks**: - Aerial Kick - Aerial Super Breath - Aerial Triple Breath - Aerial Crash - **Ground Phase Attacks**: - Ground Breath - Ground Charged Breath - Leap Attack - Charge Attack - **Transitions**: - Blue arrow from Aerial Phase to Ground Phase (phase change). - Blue arrow from Ground Phase to Aerial Phase (phase change). - **Action**: - "Fly" (left side): Connected to Ground Phase via black arrow. ### Detailed Analysis - **Phase Relationships**: - Kushala Daora "has phase of" both Aerial and Ground Phases (red arrows). - Phases can transition to each other via blue arrows (e.g., Aerial → Ground, Ground → Aerial). - **Attack Assignments**: - Each phase has four distinct attacks linked via black arrows. - No overlap in attacks between phases (e.g., Aerial attacks are exclusive to Aerial Phase). - **Action Trigger**: - "Fly" action is tied to the Ground Phase, suggesting it may initiate or enable Ground Phase mechanics. ### Key Observations 1. **Symmetry**: Both phases have an equal number of attacks (4 each). 2. **Phase Flexibility**: The ability to switch phases (blue arrows) implies dynamic combat strategy. 3. **Exclusivity**: Attacks are strictly phase-dependent (no shared attacks between phases). 4. **Action Dependency**: The "Fly" action is uniquely linked to the Ground Phase, possibly indicating a prerequisite or trigger. ### Interpretation This diagram represents a state machine for Kushala Daora’s combat behavior, emphasizing phase-dependent attacks and strategic phase transitions. The red arrows establish ownership of phases, while blue arrows enable tactical shifts between states. The "Fly" action’s connection to the Ground Phase suggests it may be a foundational mechanic for accessing Ground Phase abilities. The lack of shared attacks between phases highlights the importance of phase management in combat, requiring players to adapt to the monster’s state changes. The diagram’s simplicity underscores a clear, rule-based system with no ambiguity in attack-phase relationships. </details> Figure 8: Subgraph structure for Kushala Daora. <details> <summary>x10.png Details</summary> ![52a0d12e](/v1/image/52a0d12ea9b3fd9eee529b1a8c1f3151734c589a65ced3a6b79a54c0208547a4) ### Visual Description ## Flowchart: Brachydios Attack Mechanics ### Overview The diagram illustrates the attack mechanics of the creature "Brachydios," mapping its primary attacks and their relationships through colored arrows. Each arrow type represents a specific relationship (e.g., "has attack of," "generates," "causes"), with a legend at the bottom clarifying these connections. The flowchart emphasizes how certain attacks spawn variants or trigger chain reactions. ### Components/Axes - **Nodes**: - Primary attacks: Slime Reapply, Punch, Jump Attack, Tail Swipe. - Secondary elements: green slime, orange slime, Explosive Slime, Slime Explosion. - **Edges**: - Arrows with distinct colors (black, blue, orange, purple) denote relationships. - **Legend**: - Black: "has attack of" - Blue: "has attack variant" - Orange: "generates" - Purple: "causes" ### Detailed Analysis 1. **Brachydios** (central node) connects via black arrows ("has attack of") to: - Slime Reapply - Punch - Jump Attack - Tail Swipe 2. **Punch** (blue arrow: "has attack variant") branches into: - Green slime (orange arrow: "generates" → Explosive Slime) - Orange slime (orange arrow: "generates" → Slime Explosion) 3. **Explosive Slime** (purple arrow: "causes" → Slime Explosion). 4. **Jump Attack** and **Tail Swipe** have no further connections. ### Key Observations - **Chain Reaction**: Punch → (green/orange slime) → (Explosive Slime/Slime Explosion) demonstrates a cascading effect. - **Isolated Attacks**: Jump Attack and Tail Swipe lack downstream effects, suggesting they are terminal actions. - **Color Consistency**: All arrows from Brachydios to primary attacks are black, aligning with the legend. ### Interpretation The diagram reveals a hierarchical attack system where the Punch attack is pivotal, spawning slime variants that escalate into explosive outcomes. The use of color-coded relationships clarifies the mechanics: - **Punch** acts as a catalyst, generating slimes that "generate" explosions. - **Slime Explosion** is the terminal node, caused by Explosive Slime. - **Tail Swipe** and **Jump Attack** are standalone, possibly indicating simpler or less impactful moves. This structure suggests Brachydios prioritizes the Punch attack for its ability to trigger complex, high-damage sequences, while other attacks serve as direct, uncomplicated strikes. </details> Figure 9: Subgraph structure for Brachydios. We present some sub-graphs to illustrate structural diversity, focusing only on attack actions and their related entities, as the other components resemble Figure 2 in the main paper. The main paper showcases the graph structure of Zinogre, known for its extensive combo attacks. Here, we provide two additional examples: Kushala Daora and Brachydios. Kushala Daora exhibits distinct attack patterns in its Aerial Phase (attacking from the air) and Ground Phase (attacking on the ground), as shown in Figure 8. Certain attacks can transition between these phases, making this information crucial for an MLLM to accurately answer related questions. Brachydios, on the other hand, has attack variations that depend on the color of the slime on its fists or head, as illustrated in Figure 9. The color change alters both the attack variant and its effect, adding another layer of complexity to its combat behavior. MLLMs have to comprehend such complex information to correctly answer the question in MH Benchmark. 7.2 MH Benchmark To differentiate from MH-MMKG, all visual media for queries are captured from the Wild field. Additionally, we present statistics on the average number of entities and depth of knowledge associated with each query in the MH Benchmark, as shown in Table 7. It can be observed that sub-task I is relatively simple, as it relies solely on the topic node. In contrast, sub-tasks II and VI involve a greater number of steps and deeper analysis, as they pertain to combo recognition and cross-monster comparison, both of which require more complex reasoning. Table 7: Average number of entities and depth of knowledge for each query in MH Benchmark. | Number avg Depth avg | 1 1 | 2.339 2.278 | 3.535 3.250 | 2.4137 2.4137 | 3.028 2.900 | 4.076 3.038 | | --- | --- | --- | --- | --- | --- | --- | Table 8: Prompt for perceiver agent. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | You will receive consecutive video frames displaying the battle screen with the monster {monster name}. | | The given ‘Question’ regarding the battle screen is: {question} | | Generate a ‘Description’ of the battle scene as your ‘Response’, detailing the monster’s limb and body movements, mouth actions, surroundings, and other relevant details. | | Note that you should not give any assumptions for the ‘Description’. | | Note that you should directly output your ‘Response’ and do not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | Table 9: Prompt for topic entity selection agent. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | You will receive consecutive video frames displaying the battle screen with the monster: {monster name}. | | The given ‘Question’ regarding the battle screen is: {question} | | All possible monster names ‘Options’ are structured in a list format as follows:{topic entity} | | Note that your ‘Response’ is to directly output the name of the monster you are looking for. | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. Your ‘Response’: | Table 10: Prompt for expansion agent. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | The text description of the battle screen is: {caption}. | | Based on the battle screen, here is the ‘Question’ you need to answer: {question}. | | To answer the above question, you are now searching a knowledge graph to find the route towards relevant knowledge. The following contents are the knowledge you found so far (up to current entity {entity}): | | ***** | | {memory} | | ***** | | You need to select the relevant ‘Neighbor Entity’ that may provide knowledge to answer the question. The relation and condition from current entity ’entity’ to all ‘Neighbor Entity’ are: | | ***** | | {neighbor entity} | | ***** | | Your ‘Response’ is directly output the name of all relevant ‘Neighbor Entity’ and separate them directly by ‘;’. | | If there is no relevant ‘Neighbor Entity’, directly output ’None’. | | Note that if the ‘Neighbor Entity’ is an attack action, always choose it (if it is not highly irrelevant). | | Note that if the ‘Neighbor Entity’ is a phase, you can only choose one. | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | 8 Experiment Details In this section, we show the detailed settings of our baseline method, including prompt for each agent, additional experiments, and more samples. 8.1 Prompt Template for Our Method We first present the prompt templates for all agents in the retrieval pipeline. Table 8 shows the prompt for the perceiver agent, which translates input images into text based on the given question. Table 9 provides the prompt for the topic selection agent, responsible for selecting the starting entity for knowledge retrieval from the graph. Table 10 contains the prompt for the expansion agent, which plans the next neighboring entity for search. Table 11 presents the prompt for the validation agent, designed to assess the efficiency of knowledge transfer from the starting entity to the current entity. Finally, Table 12 includes the prompt for the summarizer agent, which synthesizes the retrieved knowledge for final answer generation. Among these, the {monster name}, displayed in blue text, represents additional information as a part of question. The {entity} represents the name of current entity during search. The {question} refers to the input query, while {topic entities} denote the names of all topic entities. {entity infp} is the visual irrelavent additional information for an entity. The {caption} is the generated description by the perceiver agent. The {neighbor entity} are options of neighbor for current entity. It is presented in a text format consisting of a combination of entity-edge triplets and corresponding constraints or conditions (if any). Here is a neighbor sample for monster “Frostfang Barioth” attack action entity “Straight Ice Breath”: - “Straight Ice Breath” continues with attack action of “Super Fang Slam” (Condition: When hunter hitted by the breath…) - “Straight Ice Breath” continues with attack action of “Tail Spin” (Condition: When Frostfang Barioth already released two…) In our prompt, we instruct the model to select relevant neighboring entities while placing greater emphasis on attack action entities, as most tasks are designed around them. For phase entities, we allow the model to select only one, in accordance with the game mechanics. The {memory} records the search path from the starting entity to the current entity, including entity names and all relevant information at each step. Below is an example illustrating this transition from a knowledge path: $$ \displaystyle\textit{Zinogre}\xrightarrow{\text{phase of}}\textit{Charged Phase}\xrightarrow{\text{attack of}}\textit{Double Slam} \tag{12} $$ will be transferred into: - “Zinogre”: Additional Information: Zinogre has the appearance of a wild wolf and lives in the mountains full of dense trees … - “Zinogre” has attack phase of ”Charged Phase”. - “Charged Phase”: Additional Information: Zinogre is charged, the body will be surrounded by electric … - “Charged Phase” has attack action of “Double Slam”. - “Double Slam”: Action Description: Zinogre lowers his head and rubs the ground with… Table 11: Prompt for validation agent. Content in [] is used solely for unaided-online experiments. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | The text description of the battle screen is: {caption}. | | Based on the battle screen, here is the ‘Question’ you need to answer: {question}. | | To answer the above question, you are now searching a knowledge graph to find the route towards relevant knowledge. | | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | To answer the above question, you are now searching a knowledge graph to find the route towards relevant knowledge. The following contents are the knowledge you found so far (up to current entity {entity}): | | ***** | | {memory} | | ***** | | And here is some information of current entity: {entity info}. | | [You will also receive consecutive video frames showing the battle screen with the monster {monster name} as visual information for current entity {entity}. | | Make a ‘Description’ (do not affected by previous text description of the battle screen for the ‘Question’) for the battle screen as a part of your ‘Response’. ‘Description’ should include monster’s limb and body movements, mouth, surrounding and others details. | | Note that you should not give any assumptions for the ‘Description’.] | | You have to decide whether visual and text information of this entity together with previous found knowledge is sufficient for answering this ‘Question’. | | For sufficient analysis, your ‘Answer’ is ‘Yes’ or ‘No’. | | [Directly output your ‘Response’ as the combination of ‘Answer’ and ‘Description’, separating them directly by ‘;’. ] | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | Table 12: Prompt for summarizer agent. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | You will receive consecutive video frames displaying the battle screen with the monster {monster name}. Based on the battle screen, here is the ‘Question’ you need to answer: {question}. | | Here is the ‘Knowledge’ you retrieved from a knowledge graph for this ‘Question’: | | ***** | | {knowledge} | | ***** | | Your ‘Response’ is to provide the answer for this ‘Question’ based on the retrieved Knowledge. | | Note that you should not give any analysis. | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | Note that Additional Information is the attribution of an entity (if exist). Action Description is given as human-made caption in Knowledgeable experiments, pre-extracted from visual attribution (if exist) in unaided-offline, and dynamic generated for visual attribution in unaided-online (if exist). Especially, the [] highlights the content for unaided-online that requires the model to comprehend visual references during validation and output corresponding description as the temporal visual attribution for the current entity. As shown in Table 12, the final agent summary will treat all retrieved paths as {knowledge} using the same strategy as {memory}. Each path will be converted into a text description and attached to the query as input. Table 13 shown the prompt template for unaided-offline experiments. It is used to pre-extract the visual reference (images or video for MLLMs or Video models, respectively in our experiments) into text description. This transition is not related to query or search memory. Note that, the prompts for the agent pipeline were developed using InternVL2.5-78B [12], with the expectation that even open-source models, by their instruction-following capabilities, can understand these prompts and generate responses in the required format. This ensures a fair comparison for all close-source models in the main paper. We further conducted a preliminary prompt robustness analyses for GPT-4o and Claude 3.7 (unaided-online). Our observations show that Claude generally exhibited robust performance across prompt variations, particularly for agents with straightforward instructions such as Perceiver, Topic Selection, and Summarizer. However, GPT-4o exhibited sensitivity to lexical choice. For instance, in the Validation agent, the use of the term “sufficient” to determine whether the retrieved knowledge is enough and the retrieval should be stopped. When we replaced it with “necessary,” GPT-4o tended to more cautious during retrieval. This minor change led to a .0546 and .0871 drops on Acc. and Rec., respectively, though with a .0194 improvement in Pre. These findings suggest that prompt robustness is both model-specific and agent-specific. Table 13: Prompt for offline caption pre-extraction. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | You will receive consecutive video frames showing the battle screen as visual information for {entity}. | | Make a ‘Description’ for the battle screen as your ‘Response’. ‘Description’ should include monster’s limb and body movements, mouth, surrounding and others details. | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | Table 14: Prompt for accuracy calculation using GPT-4o as a judge. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | Here is a ‘Question’ need to be answered: {question}. | | There are also two answers for this ‘Question’: | | Answers 1: {answer gt}. | | Answers 2: {answer pred}. | | Your ‘Response’ is to decide whether the content of these two answers are similar. | | If similar directly output ‘Yes’. | | If not similar directly output ‘No’. | | Note that you may ignore the format difference. | | Ignore the difference of monster name before word, e.g., Zinogre Leap Attack and Leap Attack are with same meaning. | | Here are some samples for decide similarity: | | Sample 1: | | ‘Question’: Tell me what is the specific name of attack action that Zinogre is performing? | | “Answer 1”: Static Charge | | “Answer 2”: Thunder Charge B | | “Response”: No | | Sample 2: | | ‘Question’: Start with counterattack, Zinogre released the attack action shown in the input battle screen. Tell me what is the next attack action? | | “Answer 1”: Zinogre Back Slam | | “Answer 2”: Back Slam | | “Response”: Yes | | Sample 3: | | ‘Question’: What attack action Brachydios is unleashing? | | “Answer 1”: Brachydios is unleashing the Brachydios Ground Slime Explosion attack | | “Answer 2”: Ground Slime Explosion | | “Response”: Yes | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | Table 15: Prompt for similarity calculation between generated and human-made caption using GPT-4o as a judge. | You are a professional Monster Hunter player. You are playing ‘Monster Hunter: World’. | | --- | | Here are two text description of a monster attack action. | | Your ‘Response’ is to decide whether the content of these two text descriptions are similar. | | Your should focus on the details of movement and some key information that can help you to discriminate the action. | | If similar directly output ‘Yes’. | | If not similar directly output ‘No’. | | The First description is {truth}. | | The Second description is {generated}. | | Note that you should not output any information other than your ‘Response’. | | Now, start to complete your task. | | Your ‘Response’: | 8.2 Knowledge Consistency Calculation As defined in the main paper, the model’s final output is a retrieved subgraph, denoted as $\hat{\mathcal{I}}$ . We consider each path from the root entity to a leaf entity as a unique knowledge instance and represent the set of such paths as $\hat{\mathcal{L}}$ . The knowledge consistency is computed between $\hat{\mathcal{L}}$ and the ground-truth knowledge paths $\mathcal{L}$ using a one-to-one matching approach. The recall and precision of retrieved knowledge paths are defined as follows: $$ \text{Recall}=\frac{|\hat{\mathcal{L}}\cap\mathcal{L}|}{|\mathcal{L}|} \tag{13} $$ $$ \text{Precision}=\frac{|\hat{\mathcal{L}}\cap\mathcal{L}|}{|\hat{\mathcal{L}}|} \tag{14} $$ where $\hat{\mathcal{L}}\cap\mathcal{L}$ represents the set of correctly retrieved knowledge paths. Recall measures the proportion of ground-truth knowledge paths successfully retrieved by the model, while precision measures the proportion of retrieved paths that are correct. 8.3 Human Evaluation of GPT-4o as a Judge Tables 14 and 15 present the templates for using GPT-4o as a judge [59] to assess result accuracy (Acc.) and caption similarity (Sim.). For accuracy evaluation, we prompt GPT-4o to compare the similarity between the ground-truth answer {answer gt} and the generated answer {answer pred}. Additionally, we provide three few-shot examples as references for the model. For caption similarity assessment, GPT-4o directly compares the human-written caption {truth} with the model-generated caption {generated}. To further evaluate GPT-4o’s judging performance, we conducted a human experiment. As shown in Table 16, two knowledgeable players independently evaluated 200 randomly selected samples from GPT-4o’s judgments across all experiments for each model. A judgment was considered correct if both evaluators agreed. Our findings indicate that while there are some variations across models, GPT-4o demonstrates a high overall accuracy in judgment (0.926). Although caption similarity scoring is lower, it remains sufficiently high for such a subjective task. Overall, the results show that using GPT-4o as a judge is with high feasibility. Table 16: Human evaluation for GPT-4o judgment accuracy. Each model’s generation for answer and caption is evaluated by 200 randomly select samples through two knowledgeable players. | GPT-4o [1] Claude 3.7 Sonnet [2] Ovis2-16B [36] | 0.925 0.900 0.955 | 0.865 0.840 0.810 | | --- | --- | --- | | average | 0.926 | 0.838 | 8.4 Additional Experiments for MH Benchmark In Table 17, we present the impact of incorporating the monster’s name (Name) and additional information (Extra) as part of the input question $q$ . The metric Top. represents the accuracy of the model in selecting the correct topic entity as the retrieval root. We observe that removing the monster’s name leads to a significant performance drop due to incorrect root entity selection (low Top.). Additional information refers to contextual hints, such as a monster being angry, which players can infer from the game’s text. These details are generally too subtle to be captured from images by MLLMs. Removing only the additional information also results in an obvious performance drop, indicating that such visually independent cues are essential for the model to generate the correct answer. One interesting observation is that with additional information Top. can be improved than no Name and Extra setting. Table 17: Impact of having monster name and Extra information in question. ✔ means having such information. | Name | Extra | Unaided-Online Acc. | Pre. | Rec. | Top. | | --- | --- | --- | --- | --- | --- | | .2731 | .1251 | .2413 | .5210 | | | | ✔ | .3781 | .2080 | .4434 | .7365 | | | ✔ | | .4075 | .2120 | .4636 | 1 | | ✔ | ✔ | .5105 | .2756 | .5625 | 1 | Table 18 reports average retrieval time (in seconds), number of agent calls (n rounds), and per-call response time in the format of mean ± std. Experiments were conducted using GPT-4o and Gemini 2.0 Flash via API, and InternVL2.5 on a local GPU server with 2 A6000. The results reveal efficiency as a limitation of the current agent pipeline. More results will be included. Table 18: Computational cost per sample in average. | GPT-4o Gemeni 2.0 Flash InternVL | 92.46 ± 68.93 17.15 ± 10.92 57.06 ± 41.78 | 7.20 ± 5.01 11.04 ± 9.42 9.32 ± 3.58 | 10.92 ± 9.70 1.12 ± 0.91 7.33 ± 6.95 | | --- | --- | --- | --- | We also perform ablation studies to assess the impact of using cross-models for two key agents: Summarizer (knowledge utilization) and Validation (knowledge retrieval), keeping other agents fixed. Table 19 shows Acc. results across GPT-4o, Claude 3.7, and InternVL2.5, with diagonal values representing the results of original single-model pipeline. Summarizer replacement yields little change between GPT-4o and Claude, indicating that performance gains stem more from retrieved knowledge quality than summarization strength (InternVL’s column with better knowledge, the improvement is more evident than that in the rows, showing in green). In contrast, using a weaker model (InternVL) for Validation causes a sharp performance drop (in red), underscoring the importance of this role. Yet, upgrading only the Validation agent in InternVL brings limited benefit, suggesting other retrieval-stage agents affect a lot. Table 19: Ablation for cross-models agent pipeline. | GPT-4o Claude InternVL | .5105 .4510 .3876 | .4994 .4338 .3624 | .4864 .4086 .3080 | .5105 .5052 .3413 | .4716 .4338 .3225 | .4128 .3676 .3080 | | --- | --- | --- | --- | --- | --- | --- | 8.5 More Result Samples This section presents some randomly selected examples of generated answers via various models. Figure 10 shows a sample for “Glavenus” continues attack action recognition. Both GPT-4o and Claude 3.7 output wrong answer, although GPT-4o catch the path towards true knowledge. Show models lack the ability to comprehend the knowledge. Figure 11 shows a sample for “Bazelgeuse” attack action recognition. Although some difference in response, both GPT-4o and Gemini 1.5 Pro generate correct answer. GPT-4o find more paths as its knowledge augmentation. Figure 12 shows a sample for “Barroth” attack action recognition. Both GPT-4o and Claude 3.7 generate the correct answer, however, GPT-4o’s answer is more clear, showing better instruct following ability. <details> <summary>x11.png Details</summary> ![e574b5b1](/v1/image/e574b5b1e375bc30694693d4f90339af0b2d8bcc283473e95195567111f31d96) ### Visual Description ## Flowchart Diagram: Combo Premonition System Analysis ### Overview The diagram illustrates a combo premonition system comparing two AI models (GPT-4o and Claude 3.7 Sonnet) in predicting a monster's attack sequence ("Heated Tailspin") from visual input. It includes retrieved paths, precision/recall metrics, and augmented answers with correctness indicators. ### Components/Axes 1. **Header Section** - Title: "III: Combo Premonition" (center-top) - Visual Input: Three sequential images showing a monster's attack progression - Text Prompt: "... screen, Glavenus seem to have continues action within this attack action. Describe the continues action." (with a question mark icon) 2. **Model Inputs** - **Left Path**: - Model: GPT-4o (logo with interconnected nodes) - Retrieved Paths: - "Heated Tailspin" (green arrow) - "Slam Slice Tail Scrape" (dashed arrow) - "Sword Swing" (dashed arrow) - Metrics: Recall: 1, Precision: 0.33 - **Right Path**: - Model: Claude 3.7 Sonnet (logo with "A" symbol) - Retrieved Paths: - "Heated Tailspin" (solid arrow) - Metrics: Recall: 0, Precision: 0 3. **Output Section** - **Augmented Answers**: - **Left Answer**: - Text: "Glavenus performs the 'Heated Tailspin' attack..." - Correctness: ✅ Green checkmark (correct) - **Right Answer**: - Text: "After Glavenus initiates its Heated Tailspin attack..." - Correctness: ❌ Red X (incorrect) 4. **Visual Elements** - Monster Icon: Glavenus (left side) - Attack Icons: - Heated Tailspin (flame/sword symbol) - Slam Slice Tail Scrape (sword symbol) - Sword Swing (sword symbol) - Color Coding: - Green: Correct predictions/answers - Red: Incorrect predictions/answers ### Detailed Analysis - **Retrieved Paths**: - GPT-4o generates three potential attack sequences, with "Heated Tailspin" as the primary candidate (recall 1). - Claude 3.7 Sonnet only identifies "Heated Tailspin" but fails to match the sequence (recall 0). - **Precision/Recall Metrics**: - GPT-4o achieves 33% precision (1/3 correct prediction) but perfect recall. - Claude 3.7 Sonnet has 0% precision/recall despite identifying the attack name. - **Augmented Answers**: - GPT-4o's answer correctly describes the attack mechanics (wide arc, sparks). - Claude's answer incorrectly states the attack "begins with Glavenus" (contradicts the visual sequence). ### Key Observations 1. **Model Performance Disparity**: - GPT-4o outperforms Claude 3.7 Sonnet in both recall and precision. - Claude's failure to match the sequence despite naming the attack suggests poor contextual understanding. 2. **Visual-Textual Alignment**: - Correct answers align with the visual progression (attack initiation → continuation). - Incorrect answers misrepresent the temporal sequence. 3. **Precision-Recall Tradeoff**: - GPT-4o's high recall (1) with moderate precision (0.33) indicates it prioritizes capturing all possible actions but struggles with specificity. ### Interpretation The diagram demonstrates how AI models interpret visual combat sequences to predict combo attacks. GPT-4o's superior recall suggests it better captures action continuity, while its lower precision highlights challenges in distinguishing similar attacks (e.g., "Heated Tailspin" vs. "Slam Slice Tail Scrape"). Claude 3.7 Sonnet's failure to match the sequence despite naming the attack reveals limitations in contextual reasoning. The augmented answers validate that GPT-4o's predictions align with the visual narrative, whereas Claude's output introduces factual errors. This analysis underscores the importance of model architecture and training data in tasks requiring temporal and spatial reasoning. </details> Figure 10: A sample for “Glavenus” continues attack recognition. <details> <summary>x12.png Details</summary> ![77ce0f1e](/v1/image/77ce0f1e82492adbfed2ebee7a0461e3c42dd8430a1b03434f27dc27974cd915) ### Visual Description ## Diagram: Attack Recognition Workflow ### Overview The diagram illustrates a comparative analysis of two AI models (GPT-4o and Gemini 1.5 Pro) in recognizing and predicting attacks performed by the monster "Bazelgeuse" in a game scenario. It includes retrieved attack paths, performance metrics (Recall and Precision), and augmented answers generated by each model. ### Components/Axes 1. **Header**: - Title: "II: Attack Recognition" - Central question: "What will Bazelgeuse do after flying in this attack action?" - Icons: Game controller (player interaction) and question mark (query context). 2. **AI Models**: - **Left**: GPT-4o (represented by a hexagonal icon). - **Right**: Gemini 1.5 Pro (represented by a blue diamond icon). 3. **Retrieved Paths**: - **GPT-4o Paths**: - Body Slam → Normal Phase → Carpet Bombing - Charge Attack → Normal Phase → Shoulder Charge - **Gemini 1.5 Pro Paths**: - Air Crush → Normal Phase → Carpet Bombing 4. **Metrics**: - **Recall**: 1 (100%) for both models. - **Precision**: - GPT-4o: 0.25 (25%) - Gemini 1.5 Pro: 0.5 (50%) 5. **Augmented Answers**: - **GPT-4o**: "Bazelgeuse will likely perform the 'Carpet Bombing' attack, dropping scales from the air as it glides over the battlefield." - **Gemini 1.5 Pro**: "It will continue to drop more scales as part of its Carpet Bombing attack." 6. **Visual Elements**: - Arrows indicate flow from AI models to retrieved paths and then to augmented answers. - Icons (e.g., sword, scales) represent specific attacks. ### Detailed Analysis - **Retrieved Paths**: - GPT-4o identifies four attack sequences, including "Shoulder Charge" and "Charge Attack," which are absent in Gemini’s paths. - Gemini focuses on "Air Crush" as an initial attack, while GPT-4o emphasizes "Body Slam." - **Metrics**: - Both models achieve perfect recall (1), indicating they capture all relevant attack sequences. - Gemini’s higher precision (0.5 vs. 0.25) suggests it is more accurate in predicting the correct attack sequence. - **Augmented Answers**: - Both models converge on "Carpet Bombing" as the primary attack but differ in descriptive details (e.g., "dropping scales from the air" vs. "dropping more scales"). ### Key Observations 1. **Model Performance**: - Gemini 1.5 Pro outperforms GPT-4o in precision, suggesting better alignment with the "correct" attack sequence. - GPT-4o’s lower precision may indicate overgeneralization or inclusion of less relevant attack paths. 2. **Attack Recognition**: - Both models recognize "Carpet Bombing" as the dominant attack, but their retrieved paths diverge in secondary actions (e.g., "Shoulder Charge" vs. "Air Crush"). 3. **Augmented Answers**: - The answers reflect the models’ retrieved paths, with Gemini’s response being more concise and focused. ### Interpretation The diagram highlights the trade-off between recall and precision in AI-driven attack recognition. While both models excel at identifying all possible attacks (high recall), Gemini 1.5 Pro demonstrates superior accuracy in predicting the most likely sequence (higher precision). This suggests Gemini may be better suited for scenarios requiring precise action prediction, whereas GPT-4o’s broader attack repertoire could be advantageous in exploratory or creative contexts. The augmented answers demonstrate how each model synthesizes retrieved data into human-readable explanations, with Gemini’s output being more streamlined. **Note**: No explicit legend is present in the diagram, but color coding (e.g., hexagonal vs. diamond icons) distinguishes the AI models. All textual elements are transcribed as described. </details> Figure 11: A sample for “Bazelgeuse” attack action recognition. <details> <summary>x13.png Details</summary> ![577af0ab](/v1/image/577af0abd09ba01e34f33a8e6560af8c64edc884c4768cf0fc12c2bdb71990d3) ### Visual Description ## Flowchart: Attack Recognition Process Comparison ### Overview The diagram compares two AI models (GPT-4o and Claude 3.7 Sonnet) in their approach to recognizing attack actions performed by a "Barroth" creature in a game context. Each model processes a query ("Tell me what is the specific name of attack action that Barroth is performing?") and generates retrieved paths of possible attack actions, along with recall/precision metrics and augmented answers. ### Components/Axes 1. **Models**: - **GPT-4o** (left side, labeled with a gear icon). - **Claude 3.7 Sonnet** (right side, labeled with an "A" icon). 2. **Retrieved Paths**: - **GPT-4o Paths**: - Shoulder Charge → Head Slam - Mud Attack - Rush - Metrics: Recall = 1, Precision = 0.25 - **Claude 3.7 Sonnet Paths**: - Shoulder Charge → Head Slam - Rush - Metrics: Recall = 1, Precision = 0.33 3. **Augmented Answers**: - **GPT-4o**: "Head Slam." - **Claude 3.7 Sonnet**: "Based on the image, the Barroth appears to be performing the 'Barroth Head Slam' attack. The monster is in a stationary position with its..." 4. **Visual Elements**: - Icons for attack actions (e.g., sword for Head Slam, shield for Shoulder Charge). - Green arrows indicating flow between actions. - Text boxes for queries, paths, and answers. ### Detailed Analysis - **GPT-4o**: - Lists three attack actions (Shoulder Charge, Mud Attack, Rush) with Head Slam as the final answer. - High recall (1) but low precision (0.25), suggesting it identifies all possible actions but with many false positives. - **Claude 3.7 Sonnet**: - Lists two attack actions (Shoulder Charge, Rush) with Head Slam as the final answer. - Higher precision (0.33) than GPT-4o, indicating more accurate identification despite fewer paths. - **Augmented Answers**: - GPT-4o provides a concise answer ("Head Slam"). - Claude 3.7 Sonnet adds contextual details about the Barroth’s stationary position. ### Key Observations 1. Both models achieve **perfect recall** (1), meaning they capture all possible attack actions. 2. **Precision disparity**: Claude 3.7 Sonnet (0.33) outperforms GPT-4o (0.25), suggesting better accuracy in narrowing down the correct action. 3. **Augmented answers**: Claude’s response includes richer contextual analysis compared to GPT-4o’s brevity. ### Interpretation The diagram highlights trade-offs between recall and precision in AI-driven attack recognition. While both models identify all possible actions (high recall), Claude 3.7 Sonnet demonstrates superior precision, likely due to more refined contextual analysis. The augmented answers reflect this: Claude’s response incorporates environmental context (stationary position), whereas GPT-4o’s answer is purely categorical. This suggests Claude may be better suited for scenarios requiring nuanced interpretation, while GPT-4o prioritizes breadth over specificity. **Note**: No non-English text or additional data tables are present. All values and labels are explicitly stated in the diagram. </details> Figure 12: A sample for “Barroth” attack action recognition.

Rendering Paper...