2508.14160v2

Model: gemma-3-27b-it-free

# RynnEC: Bringing MLLMs into Embodied World **Authors**: Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li 1]DAMO Academy, Alibaba Group 2]Hupan Lab 3]Zhejiang University \contribution [*]Equal contribution (November 18, 2025) Abstract We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC <details> <summary>x1.png Details</summary> ![39c82045](/v1/image/39c82045699828082cd4cea20917c61e78535ce5fcb562664f3299bc6980df20) ### Visual Description \n ## Diagram: Robotic Task Planning & Visual Question Answering ### Overview The image depicts a diagram illustrating a robotic task planning and visual question answering system. It showcases a Large Language Model (LLM) integrating visual information from a "Visual Encoder" and "Mask Encoder" to respond to questions about a scene. The scene appears to be a kitchen setting with various objects like a window, a pan, a teddy bear, a pillow, and cleaning supplies. The diagram is structured around a series of steps (Step 1-Step 10) representing questions posed to the system, along with the system's responses. A heatmap at the bottom visualizes attention weights. ### Components/Axes The diagram consists of the following key components: * **Visual Encoder:** Processes visual information from the scene. * **Mask Encoder:** Processes mask information (object segmentation) from the scene. * **Large Language Model (LLM):** Integrates visual and mask information to generate responses. * **Mask Decoder:** Decodes the mask token. * **Question Prompts (Step 1-Step 10):** Textual questions posed to the system. * **System Responses (A):** The LLM's answers to the questions. * **Heatmap:** A 10x10 grid visualizing attention weights, with color intensity indicating the strength of attention. The axes are not explicitly labeled, but represent the spatial dimensions of the scene being analyzed. * **Scene Image:** A photograph of a kitchen scene with various objects. ### Detailed Analysis or Content Details **Question & Answer Pairs:** * **Step 1:** "How many panes of glass are there on the window?" A: "Three" * **Step 2:** "How wide is the pad on the table? Can it be placed on the bookshelf without falling?" A: "The width of is 20 cm, and it can be placed on the bookshelf." * **Step 3:** "Where are the teddy bear and the pillow located?" A: "The teddy bear is and the pillow is" * **Step 4:** "Do my feet need to move in the process of moving the teddy bear onto the pillow?" A: "needs to be moved 2.1 meters onto" * **Step 5:** "Storage box on the bottom bookshelf?" A: "" * **Step 6:** "Where is the degreasing spray bottle?" A: "is the degreasing spray bottle" * **Step 7:** "I am going to clean the kitchen windowsill. Which of the two objects, or needs to be removed?" A: "needs to be removed" * **Step 8:** "Where is the soy sauce?" A: "is the soy sauce" * **Step 9:** "What is the function of ?" A: "This object is a wok, which is used for stir-frying." * **Step 10:** "How far is from me? Can I reach it without moving?" A: "is 1.3 meters away from me, and my arm is only 0.8 meters long, so I need to walk up to it to pick it up." **Heatmap Analysis:** The heatmap displays attention weights. The color scale ranges from dark blue (low attention) to red (high attention). The heatmap appears to be centered around the objects in the scene. * **Row 1:** Weights are highest in columns 1-3, then decrease. * **Row 2:** Weights are highest in columns 2-4, then decrease. * **Row 3:** Weights are highest in columns 3-5, then decrease. * **Row 4:** Weights are highest in columns 4-6, then decrease. * **Row 5:** Weights are highest in columns 5-7, then decrease. * **Row 6:** Weights are highest in columns 6-8, then decrease. * **Row 7:** Weights are highest in columns 7-9, then decrease. * **Row 8:** Weights are highest in columns 8-10, then decrease. * **Row 9:** Weights are highest in columns 9-1, then decrease. * **Row 10:** Weights are highest in columns 10-2, then decrease. **Task Descriptions:** * **Task 1:** "You need to first stick window stickers onto each pane of glass. Then, tidy up the table by placing the pad on the bookshelf and putting the teddy bear on the pillow. Finally, use the storage box on the bottom bookshelf to organize the smaller items on the table." * **Task 2:** "I need to stir-fry; please pour some soy sauce into the pan, turn on the heat, and cover it with a lid. Then, use the degreasing spray bottle to clean the kitchen windowsill." ### Key Observations * The system demonstrates the ability to answer questions about object locations, quantities, and functions within a visual scene. * The heatmap suggests a sequential attention pattern, moving across the scene from left to right, then right to left. * The system can reason about distances and reachability for robotic manipulation. * The responses are not always complete, with some answers being truncated ("" in Step 5). * The system can generate multi-step task instructions. ### Interpretation This diagram illustrates a sophisticated robotic system capable of integrating visual perception with natural language understanding. The LLM acts as a central reasoning engine, leveraging information from both the visual and mask encoders to answer questions and plan tasks. The heatmap provides insight into the system's attention mechanism, revealing how it focuses on different parts of the scene to extract relevant information. The sequential attention pattern observed in the heatmap suggests the system may be scanning the scene in a systematic manner. The ability to reason about distances and reachability is crucial for robotic manipulation tasks. The incomplete responses highlight potential limitations of the system, possibly related to the complexity of the scene or the ambiguity of the questions. Overall, the diagram demonstrates a promising approach to building intelligent robots that can interact with the world in a more natural and intuitive way. The system is likely being used for research into embodied AI, where robots learn to perform tasks by interacting with their environment and responding to human instructions. The use of both visual and mask information suggests the system is capable of not only recognizing objects but also understanding their spatial relationships and boundaries. </details> Figure 1: RynnEC is a video multi-modal large language model (MLLM) specifically designed for embodied cognition tasks. It can accept inputs interwoven from video, region masks, and text, and produce output in the form of text or masks based on the question. RynnEC is capable of addressing a diverse range of object and spatial questions within embodied contexts and plays a significant role in indoor embodied tasks. 1 Introduction In recent years, Multi-modal Large Language Models (MLLMs) Wu et al. [2023], Zhang et al. [2024a] have experienced rapid development, leading to the emergence of models such as Gemini Team et al. [2024] and GPT-4o OpenAI et al. [2024] that can handle image and even video inputs. These MLLMs are attracting increasing attention from researchers due to their powerful contextual understanding Doveh et al. [2025] and generalization Zhang et al. [2024c] abilities. Researchers in embodied intelligence are also beginning to explore the use of MLLMs as the brains of robots Han et al. [2025b], Jin et al. [2024], enabling them to perceive the real world through visual inputs like humans do. However, the current mainstream MLLMs are trained on extensive internet images and lack the foundational visual cognition to match the physical world Dang et al. [2025], Yuan et al. [2025b]. Some works have begun exploring how MLLMs can be applied to ego-centric embodied scenarios. Models like Exo2Ego Zhang et al. [2025b] and EgoLM Hong et al. [2025] enhance the understanding of ego-centric dynamic environment interactions. SpatialVLM Chen et al. [2024a] and SpatialRGPT Cheng et al. [2024a] focus on addressing spatial understanding challenges within embodied contexts. However, these approaches are challenging to directly implement in physical robots to perform complex tasks. The main limitations are as follows: 1. Lack of flexible visual interaction: In complex embodied scenarios, relying solely on textual communication is prone to ambiguity or vagueness. Direct visual interaction references, such as masks or points, can more accurately and flexibly index entities within a scene, facilitating precise task execution. 1. Insufficient detailed understanding of objects: During task execution, objects typically serve as the smallest operational units, making comprehensive and detailed understanding of objects crucial. As illustrated in Task 1 Step 1 in Fig. 1, recognizing the number of panes in a window is essential to determine the quantity of window decals needed. 1. Absence of video-based coherent spatial awareness: For humans, spatial cognition arises from continuous visual perception Pasqualotto and Proulx [2012]. Current methods in spatial intelligence Zhang et al. [2025c], Xu et al. [2025] primarily focus on single or discrete images, lacking the capacity for spatial understanding in high-continuity videos. For example, in Task 1 Step 4 in Fig. 1, the absolute distance between the teddy bear and the pillow requires a spatial scale concept derived from the entire video to be properly inferred. Thus, we propose RynnEC, an embodied cognitive MLLM designed to enhance robotic understanding of the physical world. As illustrated in Fig. 1, RynnEC is a large video understanding model whose visual encoder and foundational parameters are derived from VideoLLaMA3 Zhang et al. [2025a]. To enable flexible visual interaction, we incorporate an encoder and decoder specifically for region masks in videos, allowing RynnEC to achieve precise instance-level comprehension and grounding. Within this framework, RynnEC is designed to perform diverse cognitive tasks in embodied scenarios. We categorize embodied cognitive abilities into two essential components: object cognition and spatial cognition. Object cognition necessitates MLLMs’ understanding of object attributes, quantities, and their relationships with the environment, alongside accurate object grounding. Spatial cognition is further divided into world-centric and ego-centric perspectives. World-centric spatial cognition requires the model to grasp absolute scales and relative positions within scenes, as exemplified by object size estimations in Task 1 Step 2 (Fig. 1). Ego-centric spatial cognition connects the robot’s physical embodiment with the world, thereby assisting in behavioral decisions. For example, as depicted in Fig. 1, the reachability estimation in Task 2 Step 3 and the orientation estimation in Task 2 Step 5 assist the robot in clearly defining its relationship with interactive objects. Equipped with enhanced object and spatial reasoning, RynnEC supports more efficient execution of complex, real-world robotic tasks. Regrettably, the development of embodied cognition models has been slow due to a lack of ego-centric videos and high-quality annotations. Efforts such as Multi-SpatialMLLM Xu et al. [2025], Spatial-MLLM Wu et al. [2025a], and SpaceR Ouyang et al. [2025] leverage open-source datasets with comprehensive 3D point cloud and annotations to generate training data. However, in an era of scarce 3D annotations Hou et al. [2025], Lyu et al. [2024], this approach cannot achieve rapid and cost-effective expansion of data scale. Hence, we propose a data generation pipeline that transforms ego-centric RGB videos into embodied cognition question-answering datasets. This pipeline begins with instance segmentation from videos and diverges into two branches: one generating object cognition data and the other producing spatial cognition data. Ultimately, data from both branches are integrated into a comprehensive embodied cognition dataset. From over 200 households, we collect more than 20,000 egocentric videos. A subset from ten households is manually verified and balanced to create RynnEC-Bench, a fine-grained embodied cognition benchmark encompassing 22 tasks in object and spatial cognition. Extensive experiments demonstrate that RynnEC significantly outperforms both general OpenAI et al. [2024], Bai et al. [2025], Zhu et al. [2025] and task-specific Yuan et al. [2025a, c], Team et al. [2025] MLLMs in cognitive abilities within embodied scenarios, showcasing scalable application potential. Additionally, we observe notable advantages in multi-task training with RynnEC and identify preliminary signs of emergence in more challenging embodied cognition tasks. Finally, we highlight the potential of RynnEC in facilitating robots to undertake large-scale, long-range tasks. 2 Related Work 2.1 MLLMs for Video Understanding Early MLLMs primarily relied on sparse sampling and simple connectors, such as MLPs Lin et al. [2023], Ataallah et al. [2024], Maaz et al. [2023] and Q-Formers Zhang et al. [2023], Li et al. [2024b], to integrate visual representation with large language models. Subsequently, to tackle the problem of long video understanding, Zhang et al. [2024b] directly expanded the context window of language models, while Zhang et al. [2024d] introduced pooling in the spatial and temporal dimensions to compress the number of video tokens. As the need for more fine-grained understanding emerged, some studies (VideoRefer Yuan et al. [2025c], DAM Lian et al. [2025] and PAM Lin et al. [2025]) employed region-level feature encoders enabling video MLLMs to accept masked inputs and comprehend the semantic features of objects within the masks. Although these video MLLMs have demonstrated superior capabilities in high-level semantic capture and temporal modeling, they lack robust physical-world comprehension in egocentric embodied scenarios. 2.2 Embodied Scene Understanding Benchmarks Some studies Ren et al. [2024a], Li et al. [2024c], Han et al. [2025a] have begun to explore leveraging MLLMs to assist robots in solving embodied tasks. However, determining whether these MLLMs possess the ability to understand and interact with the physical world is challenging. Consequently, several benchmarks have emerged to evaluate the capability of MLLMs to perceive the physical world. OpenEQA Majumdar et al. [2024] and IndustryEQA Li et al. [2025a] focus on several key competencies in home and industrial settings, respectively, and manually designed open-vocabulary questions. VSI-Bench Yang et al. [2025c] centers on assessing the spatial cognitive abilities of MLLMs. STI-Bench Li et al. [2025b] introduces more complex kinematic (e.g. velocity) problems. ECBench Dang et al. [2025] systematically categorizes embodied cognitive abilities into static environments, dynamic environments, and overcoming hallucinations, offering a comprehensive evaluation across 30 sub-competencies. While these benchmarks encompass a wide range of abilities, they are unable to assess more fine-grained, region-level understanding capabilities in embodied scenarios. Compared to purely textual question-answering, region-level visual interaction can more accurately refer to targets in the complex real world. 2.3 Improving MLLMs for Embodied Cognition The aforementioned embodied benchmarks have highlighted the cognitive limitations of current MLLMs in embodied scenarios. Consequently, some studies have started to investigate diverse strategies for enhancing MLLMs’ understanding of the physical world. GPT4Scene Qi et al. [2025] improves MLLMs’ consistent global scene understanding by explicitly adding instance marks between video frames. SAT Ray et al. [2024] explores multi-frame dynamic spatial reasoning in simulated environments. Spatial-MLLM Wu et al. [2025a], Multi-SpatialMLLM Xu et al. [2025], and SpaceR Ouyang et al. [2025] leverage 3D datasets with detailed annotations (e.g., ScanNet Yeshwanth et al. [2023]) to construct the suite of spatial-intelligence tasks introduced in VSI-Bench. In contrast, our data generation pipeline based on RGB videos yields more realistic and scalable training data. More importantly, RynnEC is designed not just to handle selected capabilities in embodied scenarios, but to cover a broad swath of the world cognition required for embodied task execution under a single paradigm. 3 Methodology RynnEC is a robust video embodied cognition model capable of processing and outputting various video object proposals. This enables it to flexibly address embodied questions about objects and space. Due to a paucity of research in this domain, we comprehensively present the construction process of RynnEC from four perspectives: data generation (Sec. 3.1), evaluation framework establishment (Sec. 3.2), model architecture (Sec. 3.3), and training (Sec. 3.4). <details> <summary>x2.png Details</summary> ![c3ae874d](/v1/image/c3ae874de7bde76adec3fbc440e6bf8924baa8f198beb75f1be271e762be5f2c) ### Visual Description \n ## Diagram: Visual Representation of a Multi-Stage Object and Spatial QA Pipeline ### Overview The image depicts a diagram illustrating a pipeline for generating object and spatial Question Answering (QA) systems. The pipeline takes video instance segmentation as input and progresses through keyframe extraction, object description generation, and spatial reasoning to ultimately produce answers to questions about objects and their spatial relationships. The diagram is divided into three main sections: Video Instance Segmentation, Keyframe of Objects, and Generate Spatial QA. Each section is visually separated by colored backgrounds (blue, orange, and teal respectively). ### Components/Axes The diagram doesn't have traditional axes, but it features several key components and labels: * **Video Instance Segmentation:** Labeled "40s" indicating a time duration. Shows a series of video frames. * **Keyframe of Objects:** Contains images of a footstool, labeled "Owen 2.5-VL" and "Owen 3". Includes a "Prompt" box and a text description. * **Generate Object QA:** Contains images of a child and a sofa. Includes "Object Referring Expression" label. * **Extract Object Name:** Labeled "Grounding DINO" and "Segment Anything". Shows a series of video frames. * **Mast3r-SLAM + Mask 2D to 3D:** A 3D point cloud representation with "Start Pos" and "End Pos" markers. * **Ground Level Calibration:** A 3D representation of the footstool with X, Y, and Z axes labeled. * **Template:** Labeled "Template" * **Spatial Cognition Question:** Labeled "Spatial Cognition Question" * **QA Examples:** Several question-answer pairs are provided within the "Generate Object QA" and "Generate Spatial QA" sections. ### Detailed Analysis or Content Details **Video Instance Segmentation:** This section shows a sequence of video frames, suggesting the input to the pipeline is a video stream. The "40s" label indicates the video duration being processed. **Keyframe of Objects:** * Two images of a footstool are shown, labeled "Owen 2.5-VL" and "Owen 3". * A "Prompt" box is present, likely representing the input to a language model. * The text description associated with "Owen 2.5-VL" reads: "The object is a small footstool. It has a rectangular shape with rounded corners and is upholstered in a dark colored material. Likely leather or leather-like fabric." * Two QA examples are provided: * Q: "Is the <object> currently being exposed to sunlight?" A: "Yes" * Q: "How many legs does the <object> have?" A: "4" **Extract Object Name:** * Labeled "Grounding DINO" and "Segment Anything". * Shows a series of video frames. * Labeled "one second interval" **Mast3r-SLAM + Mask 2D to 3D:** * A 3D point cloud representation of a scene is shown, with red lines indicating the trajectory of a camera or sensor. * "Start Pos" and "End Pos" markers are visible, indicating the beginning and end points of the trajectory. **Ground Level Calibration:** * A 3D representation of the footstool is shown, with X, Y, and Z axes labeled. This suggests a calibration process to align the 3D model with the real world. **Generate Spatial QA:** * Two QA examples are provided: * (Egocentric): Q: "Which of the two objects, object1 or object2, is closer to me?" A: "<object1>" * (World-Centric): Q: "What is the difference in height above the ground between <object1> and <object2>?" A: "1.2 meters" ### Key Observations * The pipeline appears to be designed for understanding objects and their spatial relationships within a video scene. * The use of 3D reconstruction ("Mast3r-SLAM + Mask 2D to 3D") suggests the system aims to create a geometric understanding of the environment. * The QA examples demonstrate the system's ability to answer both egocentric (relative to the viewer) and world-centric (absolute) spatial questions. * The "Prompt" box in the "Keyframe of Objects" section suggests the use of a language model to generate object descriptions. ### Interpretation This diagram illustrates a sophisticated system for visual question answering, going beyond simple object recognition to incorporate spatial reasoning. The pipeline leverages video instance segmentation to identify objects, extracts keyframes for detailed analysis, and uses 3D reconstruction to understand the spatial layout of the scene. The generated object descriptions and spatial QA capabilities suggest the system could be used for applications such as robotic navigation, virtual assistants, and scene understanding. The inclusion of both egocentric and world-centric questions indicates a focus on providing contextually relevant answers. The pipeline appears to be modular, with each stage performing a specific task, allowing for potential improvements and customization. The use of "Grounding DINO" and "Segment Anything" suggests the use of state-of-the-art segmentation models. The overall design suggests a system capable of complex scene understanding and interaction. </details> Figure 2: Embodied Cognition Question-Answer (QA) Data Generation Pipeline: First, objects within the scene are segmented from the video. Subsequently, object and spatial QA pairs are generated via two distinct branches. 3.1 Embodied Cognition Data Generation Our embodied cognition dataset construction (Fig. 2) begins with egocentric video collection and instance segmentation. One branch employs a human-in-the-loop streaming generation approach to construct various object cognition QA pairs. The other branch utilizes a monocular dense 3D reconstruction method and diverse question templates to generate spatial cognition task QA pairs. 3.1.1 Video Collection and Instance Segmentation Our egocentric video collection encompasses $200+$ houses, with approximately 100 videos per house. To ensure video quality, we require a resolution of at least 1080p and a frame rate no less than 30fps, using a gimbal to maintain shooting stability. To achieve diversity among different video trajectories, each house is divided into multiple zones, with filming trajectories categorized into single-zone, dual-zone, and tri-zone types. Cross-zone filming enhances diversity by altering the sequence of traversed zones. Additionally, we randomly vary lighting conditions and camera height under different trajectories. We require that each video includes both vertical and horizontal rotations, as well as at least two close-ups of objects, simulating the variable field of view in robotic task execution. Ultimately, we collect 20,832 egocentric videos of indoor movement. To control video length, these videos are segmented every 40 seconds. Previous works Luo et al. [2025], Wang et al. [2024] adopted a strategy of designing separate data generation processes for each task type, leading to limited data reusability and continuity. We aim to create a lineage among different types of foundational data to reduce unnecessary redundancy in data generation. Therefore, this paper proposes a mask-centric embodied cognition QA generation pipeline. This pipeline initiates with the generation of object masks from video instance segmentation within a scene. First, Qwen2.5-VL Bai et al. [2025] observes the raw video and outputs an object list containing the names of all entity categories in the scene. Utilizing this object list, Grounding DINO 1.5 Ren et al. [2024b] detects objects in key frames at one-second intervals. SAM2 Ravi et al. [2024] assists in segmenting and tracking the objects detected by Grounding DINO 1.5 during the intervening one-second interval. To ensure consistency of instance IDs, the tracking results of old instances are compared with the segmentation results of newly detected instances at key frames. If an instance is found to have overlapping masks (IOU > 0.5), it retains the ID of the old tracking instance. Due to the performance limitations of Grounding DINO 1.5, newly detected object instances may already have appeared in preceding frames yet were missed. Thus, SAM2 conducts a reverse four-second instance tracking for each new object in key frames, thereby achieving full lifecycle instance tracking. In total, we obtain 1.14 million video instance masks from all the egocentric videos. 3.1.2 Object QA Generation In this work, we generate three types of object-related tasks: object captioning, object comprehension QA, and referring video object segmentation. For each instance, we first divide all frames containing the instance into eight equal parts in chronological order. Within each frame group, an instance key frame is selected based on two factors: the size of the instance in the frame and the distance between the instance center and the frame center. Consequently, each instance is associated with eight instance key frames, featuring good instance visibility and diverse viewing angles. Half of these frames have the instance cropped out using a mask, while the other four highlight the instance using a red bounding box and background dimming technique. The final set of object cue images is displayed within the blue box in Fig. 2. Due to the limitation of SAM2 in consistent object tracking in egocentric videos, the same instance may be assigned multiple IDs if the instance appears intermittently in the video. We employ an object category filtering method that limits each video to a maximum of two instances per object category, thereby minimizing duplicate instances. The presence of multiple video segments per house leads to repeated occurrences of certain salient objects, causing a pronounced long-tail distribution. We downsample object categories that occur frequently to prevent extreme object distribution. After the aforementioned filtering, the cue image sets of retained instances are input into Qwen2.5-VL Bai et al. [2025], generating object caption and object comprehension QA through various prompts. It is noteworthy that in the object comprehension QA, counting QA task is particularly unique and requires specially designed prompts. Subsequently, based on each instance’s caption and QAs, Qwen3 Yang et al. [2025a] generates two types of referring expressions: simple referring expressions and situational referring expressions. Simple referring expressions identify objects through a combination of features such as spatial location and category. Situational referring expressions establish a task scenario, requiring the model to infer the instance needed by the user within this context. Each type of QA undergoes manual filtering post-output to ensure data quality. Detailed prompts are provided in the Appendix A.2. 3.1.3 Spatial QA Generation Unlike object QA, spatial QA requires more precise 3D information concerning the global scene context. Therefore, we utilize MASt3R-SLAM Murai et al. [2025] to reconstruct 3D point clouds from RGB videos and obtain camera extrinsic parameters. Subsequently, by projecting 2D pixel points to 3D coordinates, the segmentation of each instance in the video can be mapped onto the point cloud. However, it is important to note that the world coordinate system established by MASt3R-SLAM for the 3D point cloud is not aligned with the floor. Therefore, the Random Sample Consensus (RANSAC) Fischler and Bolles [1981] algorithm is implemented to identify inlier points for plane fitting through ten iterative executions. In each iteration, the detected planar surface and its inliers are removed from the point cloud for subsequent plane detection. Given that the initial camera pose was approximately horizontal but not perpendicular to the ground, the ground plane is selected based on minimal angular deviation between its normal vector and the initial camera Y-axis orientation. The point cloud is then aligned to ensure orthogonality between the world coordinate Z-axis and the detected ground plane. RynnEC dataset encompasses 10 fundamental spatial abilities, each of which is further divided into quantitative and qualitative variants. We construct spatial QA in a template-based manner. Diverse QA templates are designed according to the characteristics of each task, and the missing attributes within the templates (e.g., distance, height) can be calculated from the 3D point cloud. We denote each instance in the format <Object X>. Furthermore, to obtain purely textual spatial QA pairs, we replace <Object X> with simple referring expressions generated in the above object QA pipeline. These texts are then further refined and diversified using GPT-4o, resulting in the final natural language spatial QA data. With training on these data, RynnEC is able to answer spatial questions in various input forms. Examples of the generated spatial QAs are illustrated in Fig. 2, and more examples as well as detailed templates are provided in the Appendix A.3. Building on insights from prior works Wu et al. [2025a], Ouyang et al. [2025], we recognize that spatial cognition tasks are highly challenging. Therefore, in addition to constructing a large-scale video-based spatial QA dataset, we also develop a relatively simpler image-based spatial QA dataset. This combination of tasks with varying levels of difficulty is intended to improve learning efficiency and enhance model robustness. Specifically, we collect 500k indoor images from 39k houses. Leveraging the single-image-to-3D reconstruction and calibration methods from SpatialRGPT Cheng et al. [2024a], we obtain the 3D spatial relationships between objects in each image. We then select tasks from the video-based spatial cognition set that can also be addressed via single images, and design corresponding QA templates. The format of the image-based spatial QA is kept consistent with that of the video-based spatial QA. 3.2 RynnEC-Bench <details> <summary>x3.png Details</summary> ![93ba1e9f](/v1/image/93ba1e9f2c858dced046b7e87468c6d109c8c26700eed788d20e4bb800805453) ### Visual Description \n ## Diagram: Cognition Bench - RyynEC ### Overview The image depicts a circular diagram titled "Cognition Bench - RyynEC". The diagram is segmented into various cognitive abilities, arranged radially around a central point. Each segment is associated with a question, an image, and an answer, demonstrating how the cognitive ability is applied in a real-world scenario. The diagram is surrounded by questions and answers related to visual reasoning and spatial understanding. ### Components/Axes The diagram is structured as a circular wheel divided into segments representing different cognitive abilities. These segments include: * **Direct Referring** * **Situational Referring** * **Referring Object Segmentation** * **Historical Spatial Cognition** * **Ego-Centric Spatial Cognition** * **Spatial Cognition** * **World-Centric Spatial Cognition** * **Positional Relationship** * **Object Properties Cognition** * **Category** * **Color** * **Material** * **Shape** * **State** * **Position** * **Function** * **Surface Detail** * **Size** * **Counting** Each segment is associated with a question, an image, and an answer. The questions are positioned around the outer edge of the diagram, and the answers are placed near the corresponding questions. The images are located within the segments. ### Detailed Analysis or Content Details Let's analyze the questions and answers surrounding the diagram: 1. **Q1:** If I want to travel and need to carry a lot of clothes, which item should I take? * **Image:** A person holding a suitcase. * **A1:** It is a suitcase. 2. **Q2:** Where is the silver suitcase with a black bag on top? * **Image:** A scene with a silver suitcase and a black bag. * **A:** It is. 3. **Q3:** What is the color of ❓? * **Image:** A red object. * **A:** The ❓ is red. 4. **Q4:** What is the function of ❓? * **Image:** An ironing board. * **A:** It is to provide a flat, heat-resistant surface for efficiently ironing clothes and removing wrinkles. 5. **Q5:** How many white clothes are near ❓? * **Image:** A scene with clothes. * **A:** 3. 6. **Q6:** Upon making a 90-degree left turn, how will ❓ be oriented with respect to you? ❓ will located at 11 o'clock direction. * **Image:** A person walking. * **A:** Is. 7. **Q7:** Is ❓ on your left front or left rear? * **Image:** A person walking. * **A:** Left rear. 8. **Q8:** How far have you walked in total? * **Image:** A person walking. * **A:** 2.3m. 9. **Q9:** Among the three objects ❓ ❓ and ❓ which one is the tallest? * **Image:** Three objects of varying heights. * **A:** Reaches the greatest height. 10. **Q10:** What is the approximate height of ❓? * **Image:** An object. * **A:** It is 0.6m tall. 11. **Q11:** Which is closer to ❓ ❓ or ❓? * **Image:** Three objects. * **A:** Is closer. The central circle reads "Cognition Bench" and "RyynEC". The segments are color-coded, with shades of orange, yellow, and green. ### Key Observations The diagram demonstrates a range of cognitive abilities, from basic object recognition (color, shape) to more complex spatial reasoning (historical, ego-centric, world-centric). The questions and answers are designed to test the ability to apply these cognitive skills in practical scenarios. The use of images alongside the questions and answers suggests a focus on visual reasoning. The questions are diverse, covering object properties, spatial relationships, and historical context. ### Interpretation The "Cognition Bench - RyynEC" diagram appears to be a framework for evaluating and understanding different aspects of human cognition, particularly those related to visual perception and spatial reasoning. The arrangement of cognitive abilities in a circular format suggests that these abilities are interconnected and work together to enable intelligent behavior. The questions and answers serve as examples of how these abilities are used in everyday life. The diagram could be used as a tool for assessing cognitive function, developing AI systems, or simply for gaining a deeper understanding of how the human mind works. The use of question marks (❓) in several questions suggests a need for visual input to answer the questions, emphasizing the importance of visual information in these cognitive processes. The diagram highlights the importance of both static object properties (color, shape) and dynamic spatial relationships (position, distance) in cognitive processing. The inclusion of "Historical Spatial Cognition" suggests an ability to recall and reason about past spatial experiences. </details> Figure 3: Overview of embodied cognition dimensions in RynnEC-Bench. RynnEC-Bench includes two subsets: object cognition and spatial cognition, evaluating a total of 22 embodied cognitive abilities. As this work is the first to propose a comprehensive set of fine-grained embodied video tasks, a robust evaluation framework for assessing MLLMs’ overall capabilities in this domain is currently lacking. To address this, we propose RynnEC-Bench, which evaluates fine-grained embodied understanding models from the perspectives of object cognition and spatial cognition in open-world scenarios. Fig. 3 provides a detailed illustration of the capability taxonomy in RynnEC-Bench. 3.2.1 Capability Taxonomy Object cognition is divided into two tasks: object properties cognition and referring object segmentation. During embodied task execution, robots often require a clear understanding of key objects’ functions, locations, quantities, surface details, relationships with the surrounding environment, etc. Accordingly, the object properties recognition tasks comprehensively and meticulously construct questions in these aspects. In the processes of robotic manipulation and navigation, identifying operation instances and target instances is an essential step. Precise instance segmentation in videos serves as the best approach to indicate the positions of these key objects. Specifically, the referring object segmentation task is categorized into direct referring problems and situational referring problems. Direct referring problems involve only combinations of descriptions for the instance, while situational referring problems are set within a scenario, requiring MLLMs to perform reasoning in order to identify the target object. Spatial cognition requires MLLMs to derive a 3D spatial awareness from egocentric video. We categorize it into ego-centric and world-centric spatial cognition. Ego-centric spatial cognition maintains awareness of agent-environment spatial relations and supports spatial reasoning and mental simulation; by temporal scope, we consider past, present, and future cases. World-centric spatial cognition focuses on understanding the 3D layout and scale of the physical world, which we further evaluate in terms of size, distance, and positional relations. 3.2.2 Data Balance The videos in RynnEC-Bench are collected from ten houses that do not overlap with those in the training set. When evaluating object cognition, we observe substantial variation in object-category distributions across houses, making results highly sensitive to which houses are sampled. To mitigate this bias and better reflect real-world deployment, we introduce a physical-world-based evaluation protocol. We first define a taxonomy of 12 coarse and 119 fine-grained indoor object categories. Using GPT-4o, we then estimate an empirical category-frequency distribution by parsing 500,000 indoor images from 39,000 houses; given the scale, this serves as a close approximation to real-world indoor object frequencies. Finally, we perform frequency-proportional sampling so that the object-category distribution in RynnEC-Bench closely matches the empirical distribution, enabling a more objective and realistic evaluation. Specifically, counting questions with answers of 1 or 2 are reduced by 50% to achieve a more balanced difficulty distribution. All QA pairs in RynnEC-Bench are further subjected to meticulous human screening to ensure high quality. Additional implementation details are available in Appendix B. 3.2.3 Evaluation Framework The questions are categorized into three types based on the nature of their answers: numerical questions, textual questions, and segmentation questions. For numerical questions such as distance estimation and direction estimation, we directly use the formula to calculate the precise indicators. For scale-related questions, Mean Relative Accuracy (MRA) Yang et al. [2025c], Everingham et al. [2010] is used to calculate the scores. Specifically, given a model’s prediction $\hat{y}$ , ground truth $y$ , and a confidence threshold $\theta$ , relative accuracy is calculated by considering $\hat{y}$ correct if the relative error rate, defined as $|\hat{y}-y|/y$ , is less than $1-\theta$ . As single-confidence-threshold accuracy only considers relative error within a narrow scope, the MRA averages the relative accuracy across a range of confidence thresholds $\mathcal{C}=\{0.5,\,0.55,\,...,\,0.95\}$ : $$ MRA=\frac{1}{|\mathcal{C}|}\sum_{\theta\in\mathcal{C}}\mathbb{I}\Bigg(\frac{|\hat{y}-y|}{y}<1-\theta\Bigg) \tag{1} $$ where $\mathbb{I}(·)$ is the indicator function. For angle-related questions, MRA is not suitable due to the cyclic nature of angular measurements. We therefore designed a rotational accuracy metric (RoA). $$ RoA=1-min\Bigg(\frac{min(|\widehat{y}-y|,360-|\widehat{y}-y|)}{90},1\Bigg) \tag{2} $$ RoA assigns a score only when the angular difference is less than 90 degrees, ensuring consistency in task difficulty across different settings. Textual questions are further categorized into close-ended and open-ended questions. For the close-ended part, we prompt GPT-4o to assign a straightforward binary score of either 0 or 1. For the open-ended part, answers are scored by GPT-4o on a scale from 0 to 1 in increments of 0.2. This question-type-adaptive evaluation approach enables the metrics of RynnEC-Bench to be both precise and consistent. For segmentation evaluation, prior work Yuan et al. [2025a], Yan et al. [2024] typically reports the $\mathcal{J}\&\mathcal{F}$ measure, combining region-overlap ( $\mathcal{J}$ ) and boundary-accuracy ( $\mathcal{F}$ ) scores. However, the conventional frame-averaged $\mathcal{J}\&\mathcal{F}$ treats empty frames (i.e., frames with no ground-truth mask) in a binary manner: if any predicted mask appears, the frame score is set to 0; otherwise it is set to 1. This evaluation method fails to account for the actual size of erroneous masks in empty frames, which can have a significant impact on embodied segmentation tasks. To address this, we propose the Global IoU metric, defined as $$ \overline{\mathcal{J}}=\frac{\sum_{i=1}^{N}|\mathcal{S}_{i}\cap\mathcal{G}_{i}|}{\sum_{i=1}^{N}|\mathcal{S}_{i}\cup\mathcal{G}_{i}|}, \tag{3} $$ where $N$ is the total number of video frames, $\mathcal{S}_{i}$ denotes the predicted segmentation mask for frame $i$ , and $\mathcal{G}_{i}$ denotes the ground truth mask for frame $i$ . For the boundary accuracy metric $\overline{\mathcal{F}}$ , we compute the average only over non-empty frames. The mean of $\overline{\mathcal{J}}$ and $\overline{\mathcal{F}}$ , denoted as $\overline{\mathcal{J}}\&\overline{\mathcal{F}}$ , provides an accurate reflection of segmentation quality, especially in egocentric videos where the target object appears in relatively few frames. <details> <summary>x4.png Details</summary> ![a7264790](/v1/image/a72647909ead6896c346afa38590b3745157b5174518896bbc2c3757d232a075) ### Visual Description \n ## Diagram: Multimodal LLM Pipeline Stages ### Overview The image depicts a four-stage pipeline for a multimodal Large Language Model (LLM), showcasing the progression from basic image understanding to complex referring segmentation. Each stage builds upon the previous one, incorporating different components and achieving increasingly sophisticated tasks. The diagram is arranged horizontally, with each stage presented as a distinct block. ### Components/Axes The diagram consists of four stages labeled "Stage 1", "Stage 2", "Stage 3", and "Stage 4". Each stage has two main sections: an image/question-answer pair at the top and a component diagram at the bottom. The component diagram consistently includes "Vision Encoder", "Region Encoder", and "LLM" (Large Language Model). Stage 4 additionally includes "LoRA" and "Mask Decoder". Icons at the very bottom represent data sources (images, text). ### Detailed Analysis or Content Details **Stage 1: Mask Alignment** * **Image:** Shows a kitchen scene with a black kettle on a table. * **Caption:** "A single black kettle on the table." * **Components:** Vision Encoder -> Region Encoder -> LLM. **Stage 2: Object Understanding** * **Image:** Shows a dining room scene with a table set for a meal. * **Q:** "What is the purpose of <objects>?" * **A:** "It enhances flavor, adding umami and richness to dishes." * **Components:** Vision Encoder -> Region Encoder -> LLM. **Stage 3: Spatial Understanding** * **Image:** Shows a room with furniture and a doorway. * **Q:** "What is the distance of <objects> and <objects>?" * **A:** "0.7m." * **Components:** Vision Encoder -> Region Encoder -> LLM. **Stage 4: Referring Segmentation** * **Image:** Shows a person holding a brown teddy bear on a table. * **Q:** "Can you segment the brown teddy bear located on the table in this video?" * **A:** "Sure, [566]." * **Components:** Vision Encoder -> Region Encoder -> LLM + LoRA -> Mask Decoder. The "LLM" component is consistently highlighted with a flame icon. The "Vision Encoder" and "Region Encoder" components also have distinct icons. ### Key Observations * The complexity of the pipeline increases with each stage. Stage 4 introduces "LoRA" and "Mask Decoder", indicating a more specialized task. * The questions posed in each stage become progressively more complex, requiring deeper understanding of the image content. * The consistent presence of "Vision Encoder", "Region Encoder", and "LLM" suggests a core architecture that is augmented with additional components as needed. * The inclusion of a numerical value "[566]" in Stage 4's answer suggests a segmentation mask or identifier. ### Interpretation This diagram illustrates a progressive approach to multimodal LLM development. It starts with basic image recognition (mask alignment) and gradually builds towards more sophisticated tasks like object understanding, spatial reasoning, and finally, referring segmentation. The addition of LoRA in Stage 4 suggests a fine-tuning approach to specialize the LLM for the segmentation task. The pipeline demonstrates how different components work together to enable the LLM to "see" and understand images, and then respond to complex queries about them. The flame icon on the LLM component likely signifies its computational intensity or central role in the process. The diagram highlights the increasing complexity of the tasks and the corresponding increase in the number of components required to achieve them. The numerical output in Stage 4 suggests the model is capable of generating precise segmentation masks. This pipeline represents a significant step towards creating AI systems that can interact with the world in a more natural and intuitive way. </details> Figure 4: Training paradigm of RynnEC. The model is trained in four progressive stages: 1) Mask Alignment, 2) Object Understanding, 3) Spatial Understanding, and 4) Referring Segmentation. 3.3 RynnEC Architecture RynnEC consists of three core components: the foundational vision-language model for basic multimodal comprehension, a region-aware encoder for fine-grained object-centric representation learning, an adaptive mask decoder for video segmentation tasks. Notably, the latter two modules are designed as plug-and-play components with independent parameter spaces, ensuring architectural flexibility and modular extensibility. Foundational Vision-Language Model. We ultilize VideoLLaMA3-Image Zhang et al. [2025a] as the foundational vision-language model for RynnEC, which consists of three main modules: a Vision Encoder, the Projector and the Large Language Model (LLM). For the vision encoder, we use VL3-SigLIP-NaViT Zhang et al. [2025a], which leverages an any-resolution vision tokenization strategy to flexibly encode images of varying resolutions. As the LLM, we employ Qwen2.5-1.5B-Instruct Yang et al. [2024] and Qwen2.5-7B-Instruct Yang et al. [2024], enabling scalable trade-offs between performance and computational cost. Region Encoder. Egocentric videos often feature cluttered scenes with similar objects that are difficult to distinguish using linguistic cues alone. To address this, we introduce a dedicated object encoder for specific object representation. This facilitates more precise cross-modal alignment during training and enables intuitive, fine-grained user interaction at inference time. Following Yuan et al. [2024, 2025c], we use a simple yet efficient MaskPooling for object tokenization, followed by a two-layer projector to align object features with LLM embedding space. During training, object masks spanning multiple frames in a video are utilized to achieve accurate representations. At inference, the encoder offers flexibility, operating effectively with either single-frame or multi-frame object masks. Mask Decoder. Accurate object localization is critical for egocentric video understanding. To incorporate robust visual grounding capabilities without degrading the model’s pretrained performance, we fine-tune the LLM with LoRA. Our mask decoder is based on the architecture of SAM2 Ravi et al. [2024], which has demonstrated strong generalization capabilities and prior knowledge in purely visual segmentation tasks. For a given video and the instruction, we adpot a [SEG] token as a specifical token to trigger mask generation for the corresponding visual region. To facilitate this process, an additional linear layer is introduced to align the [SEG] token with SAM2’s feature space. 3.4 Training and Inference As illustrated in Fig. 4, RynnEC is trained using a progressive four-stage pipeline: 1) Mask Alignment, 2) Object Understanding, 3) Spatial Understanding, and 4) Referring Segmentation. The first three stages are designed to incrementally enhance fine-grained, object-centric understanding, while the final stage focuses on equipping the model with precise object-level segmentation capabilities. This curriculum-based approach ensures gradual integration of visual, spatial, and grounding knowledge without overfitting to a single task. The datasets used in each stage are summarized in Tab. 1. The details of each training stage are as follows: 1) Mask Alignment. The goal of this initial stage is to encourage the model to attend to region-specific tokens rather than relying solely on global visual features. We fine-tune both the region encoder and the LLM on a large-scale object-level captioning dataset, where each caption is explicitly aligned with a specific object mask. This alignment training conditions the model to associate object-centric embeddings with corresponding linguistic descriptions, laying the foundation for localized reasoning in later stages. 2) Object Understanding. In this stage, the focus shifts to enriching the model’s egocentric object knowledge, encompassing attributes such as color, shape, material, size, and functional properties. The region encoder and the LLM are jointly fine-tuned to integrate this object-level information more effectively into the cross-modal embedding space. This stage is the basic for spatial understanding. 3) Spatial Understanding. Building on the previous stage, this phase equips the model with spatial reasoning abilities, enabling it to understand and reason about the relative positions and configurations of objects within a scene. We use a large amount of spatial QA we generated and the previous stage data as well as general VQA to maintain the ability to follow instructions. 4) Referring Segmentation. In the final stage, we integrate the Mask Decoder module after the LLM to endow the model with fine-grained referring segmentation capabilities. The LLM is fine-tuned via LoRA to minimize interference with its pretrained reasoning abilities. The training data includes not only segmentation-specific datasets but also samples from earlier stages to mitigate catastrophic forgetting. This multi-task mixture ensures that segmentation performance is improved without sacrificing the model’s object and spatial understanding. Table 1: Datasets used at four training stages. IM and OM indicate whether the task involves the input mask and output mask, respectively. | Training Stage | Task | IM | OM | # Samples | Datasets | | --- | --- | --- | --- | --- | --- | | Mask Alignment (Stage-1) | General Mask Captioning | ✓ | ✗ | 1.17M | RefCOCO Yu et al. [2016], Mao et al. [2016], VideoRefer-Caption Yuan et al. [2025c], DAM Lian et al. [2025], Osprey-Caption Yuan et al. [2024], MDVP-Data Lin et al. [2024], HC-STVG Tang et al. [2021] | | Scene Instance Captioning | ✓ | ✗ | 0.14M | RynnEC-Caption | | | Object Understanding (Stage-2) | Basic Properties QA | ✓ | ✗ | 1.49M | RynnEC-Object | | Object-Centric Counting | ✓ | ✗ | 0.25M | RynnEC-Counting | | | Spatial Understanding (Stage-3) | Our Stage-2 | ✓ | ✗ | 0.30M | RynnEC-Object, RynnEC-Counting | | Spatial QA | ✓ | ✗ | 0.60M | RynnEC-Spatial (Image), RynnEC-Spatial (Video) | | | ✗ | ✗ | 0.54M | VLM-3R-Data Fan et al. [2025] | | | | General VQA | ✗ | ✗ | 0.74M | LLaVA-OV-SI Li et al. [2024a], LLaVA-Video Zhang et al. [2024e], ShareGPT-4o-video Chen et al. [2024b], VideoGPT-plus Maaz et al. [2024], FineVideo Farré et al. [2024], CinePile Rawal et al. [2024], ActivityNet Caba Heilbron et al. [2015], YouCook2 Zhou et al. [2018], LLaVA-SFT Liu et al. [2023] | | | Referring Segmentation (Stage-4) | Our Stage-2 & Stage-3 | ✓ | ✗ | 0.60M | RynnEC-Object, RynnEC-Counting, RynnEC-Spatial | | General Segmentation | ✗ | ✓ | 0.32M | ADE20K Zhou et al. [2017], COCOStuff Caesar et al. [2018], Mapillary Neuhold et al. [2017], PACO-LVIS Ramanathan et al. [2023], PASCAL-Part Chen et al. [2014] | | | Embodied Segmentation | ✗ | ✓ | 0.31M | RynnEC-Segmentation | | | General VQA | ✗ | ✗ | 0.80M | LLaVA-OV-SI Li et al. [2024a], LLaVA-Video Zhang et al. [2024e], ShareGPT-4o-video Chen et al. [2024b], VideoGPT-plus Maaz et al. [2024], FineVideo Farré et al. [2024], CinePile Rawal et al. [2024], ActivityNet Caba Heilbron et al. [2015], YouCook2 Zhou et al. [2018], LLaVA-SFT Liu et al. [2023] | | 4 Experiments Table 2: Main evaluation results on RynnEC-Bench. We evaluate in two major categories: Object Cognition and Spatial Cognition. DR and SR represent Direct Referring and Situational Referring. PR represents Positional Relationship. | Model | Overall Mean | Object Cognition | Spatial Cognition | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Object Properties | Segmentation | Mean | Ego-Centric | World-Centric | Mean | | | | | | | | | DR | SR | His. | Pres. | Fut. | Size | Dis. | PR | | | | | | | Proprietary Generalist MLLMs | | | | | | | | | | | | | | GPT-4o OpenAI et al. [2024] | 28.3 | 41.1 | — | — | 33.9 | 13.4 | 22.8 | 6.0 | 24.3 | 16.7 | 36.1 | 22.2 | | GPT-4.1 OpenAI et al. [2024] | 33.5 | 45.9 | — | — | 37.8 | 17.2 | 27.6 | 6.1 | 35.9 | 30.4 | 45.7 | 28.8 | | Seed1.5-VL Guo et al. [2025] | 34.7 | 52.1 | — | — | 42.8 | 8.2 | 27.7 | 4.3 | 32.9 | 19.1 | 27.9 | 26.1 | | Genimi-2.5 Pro Comanici et al. [2025] | 45.5 | 64.0 | — | — | 52.7 | 9.3 | 36.7 | 8.1 | 47.0 | 29.9 | 69.3 | 37.8 | | Open-source Generalist MLLMs | | | | | | | | | | | | | | VideoLLaMA3-7B Zhang et al. [2025a] | 27.3 | 36.7 | — | — | 30.2 | 5.1 | 26.8 | 1.2 | 30.0 | 19.0 | 34.9 | 24.1 | | InternVL3-78B Zhu et al. [2025] | 29.0 | 45.3 | — | — | 37.3 | 9.0 | 31.8 | 2.2 | 10.9 | 30.9 | 26.0 | 20.0 | | Qwen2.5-VL-72B Bai et al. [2025] | 36.4 | 54.2 | — | — | 44.7 | 11.3 | 24.8 | 7.2 | 27.2 | 22.9 | 83.7 | 27.4 | | Open-source Object-Level MLLMs | | | | | | | | | | | | | | DAM-3B Lian et al. [2025] | 15.6 | 22.2 | — | — | 18.3 | 2.8 | 14.1 | 1.3 | 28.7 | 6.1 | 18.3 | 12.6 | | VideoRefer-VL3-7B Yuan et al. [2025c] | 32.9 | 44.1 | — | — | 36.3 | 5.8 | 29.0 | 6.1 | 38.1 | 30.7 | 28.8 | 29.3 | | Referring Video Object Segmentation MLLMs | | | | | | | | | | | | | | Sa2VA-4B Yuan et al. [2025a] | 4.9 | 5.9 | 35.3 | 14.8 | 9.4 | 0.0 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 0.0 | | VideoGlaMM-4B Munasinghe et al. [2025] | 9.0 | 16.4 | 5.8 | 4.2 | 14.4 | 4.1 | 4.7 | 1.4 | 0.8 | 0.0 | 0.3 | 3.2 | | RGA3-7B Wang et al. [2025] | 10.5 | 15.2 | 32.8 | 23.4 | 17.5 | 0.0 | 5.5 | 6.1 | 1.2 | 0.9 | 0.0 | 3.0 | | Open-source Embodied MLLMs | | | | | | | | | | | | | | RoboBrain-2.0-32B Team et al. [2025] | 24.2 | 25.1 | — | — | 20.7 | 8.8 | 34.1 | 0.2 | 37.2 | 30.4 | 3.6 | 28.0 | | RynnEC-2B | 54.4 | 59.3 | 46.2 | 36.9 | 56.3 | 30.1 | 47.2 | 23.8 | 67.4 | 31.2 | 85.8 | 52.3 | | RynnEC-7B | 56.2 | 61.4 | 45.3 | 36.1 | 57.8 | 40.9 | 50.2 | 22.3 | 67.1 | 39.2 | 89.7 | 54.5 | 4.1 Implementation Details 4.1.1 Training In this part, we briefly introduce the implementation details of each training stage. For all stages, we adopt the cosine learning rate scheduler. The warm up ratio of the learning rate is set as 0.03. The maximum token length is set to 16384, while the maximum token length for vision tokens is set to 8192. In Stage 1, both the vision encoder and the LLM are initialized with pretrained weights from VideoLLaMA3-Image. During this stage, we train the LLM, the projector, and the region encoder, using learning rates of $1× 10^{-5}$ , $1× 10^{-5}$ , and $4× 10^{-5}$ , respectively. In Stages 2 and 3, the learning rates for the LLM, projector, and region encoder are adjusted to $4× 10^{-5}$ , $1× 10^{-5}$ , and $1× 10^{-5}$ , respectively. In the final stage, the LLM is fine-tuned using LoRA with the same learning rates as in Stage 3. The learning rate of Mask Decoder is set to $4× 10^{-5}$ . 4.1.2 Evaluation We present a comprehensive evaluation of five MLLM categories on RynnEC-Bench, including both general-purpose models and those fine-tuned for region-level understanding and segmentation. For models that do not accept direct region-based inputs, we uniformly highlight target objects using bounding boxes in the video. Multiple objects are distinguished by different colored boxes, which are referenced in the question prompt. We observe that general-purpose MLLMs are incapable of localizing objects in videos; thus, only specialist models fine-tuned for this ability are evaluated on the RynnEC-Bench segmentation subset. To ensure a consistent evaluation protocol, videos are sampled at 1 fps up to a maximum of 30 frames. If the initial sampling exceeds the 30-frame limit, these target-containing frames are kept, and the remaining frames are selected via uniform sampling from the rest of the video. <details> <summary>x5.png Details</summary> ![def59530](/v1/image/def59530999ce1514c4225f8561d191f29553fa0909f3a1439bd714c792a4a3a) ### Visual Description ## Radar Charts: Object and Spatial Cognition Performance ### Overview The image presents two radar charts, labeled (a) Object Cognition and (b) Spatial Cognition. Each chart displays the performance of six different models – Gemini-2.5-Pro, Qwen2.5-VL-72B, VideoRefer-VL3-7B, RoboBrain-2.0-32B, RGA3-7B, and RynnEC-7B (labeled as "Ours") – across various cognitive attributes. The charts use a radial scale to represent performance, with higher values indicating better performance. ### Components/Axes Each radar chart has the following components: * **Center:** Represents the minimum performance value (presumably 0). * **Radial Axes:** Represent different cognitive attributes. * **Legend:** Located at the top of the image, associating colors with each model. * **Labels:** Each radial axis is labeled with a specific cognitive attribute. **Object Cognition (a):** * Attributes: Situational Seg., Direct Seg., Counting, Size, Surface, Function, Position, State, Shape, Material, Color, Category. * Scale: The radial axes extend to a maximum value of approximately 70. **Spatial Cognition (b):** * Attributes: Trajectory Review, Relative Position, Absolute Position, Object Distance, Object Size, Object Height, Spatial Imagery, Movement Imagery, Egocentric Distance, Egocentric Direction. * Scale: The radial axes extend to a maximum value of approximately 80. **Legend:** * Gemini-2.5-Pro: Orange * Qwen2.5-VL-72B: Light Blue * VideoRefer-VL3-7B: Green * RoboBrain-2.0-32B: Red * RGA3-7B: Purple * RynnEC-7B (Ours): Dark Blue ### Detailed Analysis or Content Details **Object Cognition (a):** * **Gemini-2.5-Pro (Orange):** Shows relatively high performance in Category (approx. 60), Color (approx. 60), and Material (approx. 68). Performance dips significantly in Counting (approx. 11) and Direct Seg. (approx. 45). The line generally fluctuates, indicating varying performance across attributes. * **Qwen2.5-VL-72B (Light Blue):** Exhibits moderate performance across most attributes, with a peak in Color (approx. 56) and a low point in Counting (approx. 14). The line is relatively smooth. * **VideoRefer-VL3-7B (Green):** Shows a peak in Material (approx. 66) and a low point in Counting (approx. 26). The line is somewhat erratic. * **RoboBrain-2.0-32B (Red):** Demonstrates relatively low performance across all attributes, with a peak in Color (approx. 38) and a low point in Counting (approx. 7). * **RGA3-7B (Purple):** Shows moderate performance, peaking in Category (approx. 58) and dipping in Counting (approx. 16). * **RynnEC-7B (Dark Blue):** Displays a relatively consistent performance across attributes, peaking in Category (approx. 62) and dipping in Counting (approx. 22). **Spatial Cognition (b):** * **Gemini-2.5-Pro (Orange):** Shows high performance in Trajectory Review (approx. 77) and Relative Position (approx. 41). Performance is lower in Object Height (approx. 21) and Movement Imagery (approx. 15). * **Qwen2.5-VL-72B (Light Blue):** Exhibits moderate performance, peaking in Trajectory Review (approx. 40) and dipping in Movement Imagery (approx. 13). * **VideoRefer-VL3-7B (Green):** Shows a peak in Relative Position (approx. 40) and a low point in Object Height (approx. 22). * **RoboBrain-2.0-32B (Red):** Demonstrates relatively low performance across all attributes, peaking in Object Distance (approx. 30) and dipping in Movement Imagery (approx. 6). * **RGA3-7B (Purple):** Shows moderate performance, peaking in Trajectory Review (approx. 41) and dipping in Object Height (approx. 21). * **RynnEC-7B (Dark Blue):** Displays a relatively consistent performance across attributes, peaking in Trajectory Review (approx. 46) and dipping in Movement Imagery (approx. 15). ### Key Observations * **Counting is a consistent weakness:** Across both charts, all models exhibit the lowest performance in the "Counting" attribute (Object Cognition) and "Movement Imagery" (Spatial Cognition). * **Gemini-2.5-Pro and RynnEC-7B generally perform well:** These models consistently show higher values across most attributes in both charts. * **RoboBrain-2.0-32B consistently underperforms:** This model exhibits the lowest values across most attributes in both charts. * **Trajectory Review is a strength:** In the Spatial Cognition chart, several models (Gemini-2.5-Pro, RGA3-7B, and RynnEC-7B) demonstrate relatively high performance in "Trajectory Review." ### Interpretation The radar charts provide a comparative analysis of six models' cognitive abilities in object and spatial reasoning. The data suggests that while all models possess some level of competence in these areas, significant performance variations exist. The consistent weakness in "Counting" and "Movement Imagery" across all models indicates a potential area for improvement in current AI architectures. Gemini-2.5-Pro and RynnEC-7B appear to be the most well-rounded performers, while RoboBrain-2.0-32B lags behind. The separation of object and spatial cognition into distinct charts allows for a focused evaluation of each domain. The higher performance in "Trajectory Review" for several models in the Spatial Cognition chart suggests a stronger ability to understand and predict motion. The charts highlight the complex interplay of different cognitive attributes and provide valuable insights into the strengths and weaknesses of each model. The "Ours" model (RynnEC-7B) appears to be competitive, particularly in Object Cognition, but further investigation is needed to understand the specific architectural features contributing to its performance. The charts are useful for identifying areas where further research and development are needed to improve AI's cognitive capabilities. </details> Figure 5: More granular assessments of object cognition and spatial cognition. We compare the best-performing MLLM from each category with our RynnEC-7B. 4.2 Embodied Cognition Evaluation 4.2.1 Main Results Tab. 2 presents the evaluation results of our RynnEC model and five categories of related MLLMs on the RynnEC-Bench. Although the RynnEC model contains only 7B parameters, it demonstrates robust embodied cognitive abilities, outperforming even the most advanced proprietary model, Gemini-2.5 Pro Comanici et al. [2025], by 10.7 points. Moreover, RynnEC achieves both balanced and superior performance across various tasks. For object cognition, RynnEC achieved a score of 61.4 and possesses the ability to both understand and segment objects. In terms of spatial cognition, RynnEC achieves a score of 54.5, which is 44.2% higher than that of Gemini-2.5 Pro. To support resource-constrained settings, we present a 2B-parameter RynnEC that delivers markedly lower inference latency while maintaining near-parity performance ( $<2$ percentage points drop), enabling on-device deployment for embodied applications. In the following sections, we will introduce the performance of different types of MLLMs on RynnEC-Bench in detail. Proprietary Generalist MLLMs Among the four leading proprietary generalist MLLMs evaluated, Gemini-2.5 Pro establishes a clear lead with an overall score of 45.5. This represents a substantial performance margin of 25% over the best open-source generalist MLLM and 38.3% over the premier open-source object-level MLLM. Even more notably, it achieves a remarkable score of 37.8 in the notoriously difficult domain of spatial cognition. This finding provides compelling evidence that spatial awareness can emerge as a byproduct of extensive training on video comprehension tasks. Open-source Generalist MLLMs Qwen2.5-VL-72B Bai et al. [2025] exhibits outstanding performance, achieving a score of 36.4 and surpassing GPT-4.1 OpenAI et al. [2024]. This suggests that, in specialized capabilities such as embodied cognition, the gap between open-source and proprietary MLLMs has been significantly narrowed. Furthermore, we observe that Qwen2.5-VL and InternVL3 Zhu et al. [2025] demonstrate superior performance in positional relationship (PR) and distance perception tasks, respectively, even outperforming Gemini-2.5 Pro. Such pronounced differences in various aspects of spatial cognition may be attributed to the distribution of training data. Open-source Object-Level MLLMs These MLLMs are capable of accepting region masks as input, enabling more direct localization of target objects and facilitating finer-grained object perception. VideoRefer-VL3-7B Yuan et al. [2025c] is a model fine-tuned from the base model VideoLLaMA3-7B Zhang et al. [2025a]. As shown in Tab. 2, VideoRefer-VL3-7B consistently outperforms VideoLLaMA3-7B in both object cognition and spatial cognition tasks. This demonstrates that, in embodied scenarios, integrating mask understanding within the model is superior to explicit visual prompting. Referring Video Object Segmentation MLLMs Recently, several studies have applied MLLMs to object segmentation tasks while retaining the original multimodal understanding capabilities of MLLMs. However, the best-performing model, RGA3-7B Wang et al. [2025], achieves only 15.2 points on the object properties task. Although these MLLMs can still address some general video understanding tasks, their task generalization ability is significantly diminished following segmentation training. In contrast, our RynnEC model, which is specifically designed for embodied scenarios, maintains strong object and spatial understanding capabilities even after segmentation training. Open-source Embodied MLLMs With the growing demand for highly generalizable cognitive abilities in the field of embodied intelligence, a number of studies have begun to develop MLLMs specifically tailored for embodied scenarios. A representative model is RoboBrain-2.0 Team et al. [2025], which achieves 24.2 even worse than general-purpose video models such as VideoLLaMA3-7B. There are two primary reasons for this: (1) Loss of object cognition: Embodied MLLMs typically emphasize spatial perception and task planning abilities, but tend to overlook the importance of detailed object understanding. (2) Lack of fine-grained perceptual capability: In egocentric videos, RoboBrain-2.0 demonstrates limited ability to interpret region-level features. 4.2.2 Object Cognition Fig. 5 (a) presents a more comprehensive evaluation of RynnEC’s capability in object properties cognition from multiple dimensions. Since most object properties cognition abilities are encompassed by general video understanding skills, Gemini-2.5-Pro exhibits superior performance across various competencies. However, due to the high edge deployment requirements of embodied MLLMs, the inference speed of these large-scale models becomes a bottleneck. With only 7B parameters, RynnEC achieves object properties cognition comparable to that of Gemini-2.5-Pro in most categories. Notably, for attributes such as surface details, object state, and object shape, RynnEC-2B even surpasses all other MLLMs. Moreover, most MLLMs lack video object segmentation capabilities, whereas dedicated segmentation MLLMs often sacrifice understanding abilities. RynnEC, while maintaining strong comprehension capabilities, achieves 30.9% and 57.7% improvements over state-of-the-art segmentation MLLMs in direct referring and situational referring object segmentation tasks, respectively. 4.2.3 Spatial Cognition Fig. 5 (b) demonstrates RynnEC’s spatial cognition capabilities through more fine-grained tasks. As spatial abilities have not been formally defined or systematically explored in previous work, different MLLMs only exhibit strengths in a limited set of specific skills. Overall, spatial cognition abilities such as Spatial Imagery, Movement Imagery, and Trajectory Review are typically absent in prior MLLMs. In contrast, RynnEC possesses a more comprehensive set of spatial abilities, which can facilitate embodied agents in developing spatial awareness within complex environments. <details> <summary>x6.png Details</summary> ![8e508f2e](/v1/image/8e508f2e6b1991c309c4b76e89be37eed1df19bb1b0bf303c23827723b16c3b2) ### Visual Description ## Bar Chart: Performance Comparison of VideoLLaMA3-7B and RynnEC-7B ### Overview This bar chart compares the performance scores of two models, VideoLLaMA3-7B and RynnEC-7B, across several categories related to spatial reasoning and scene understanding. The chart displays the average score for each model on each category. ### Components/Axes * **X-axis:** Categories - "Abs. Dist.", "Route Plan", "Rel. Dir. Hard", "Rel. Dist.", "Rel. Dir. Medium", "Rel. Dir. Easy", "Obj. Count", "Obj. Size", "Room Size", "Appear. Order". * **Y-axis:** Score - ranging from 0 to 70. * **Models:** * VideoLLaMA3-7B (represented by a light purple color) with an average score of 35.8. * RynnEC-7B (represented by a dark purple color) with an average score of 45.8. * **Legend:** Located at the top of the chart, clearly indicating the color correspondence for each model. ### Detailed Analysis The chart consists of 10 pairs of bars, one for each model per category. The values are as follows: * **Abs. Dist. (Absolute Distance):** VideoLLaMA3-7B: 23.5, RynnEC-7B: 25.4 * **Route Plan:** VideoLLaMA3-7B: 25.4, RynnEC-7B: 32.0 * **Rel. Dir. Hard (Relative Direction - Hard):** VideoLLaMA3-7B: 38.7, RynnEC-7B: 30.0 * **Rel. Dist. (Relative Distance):** VideoLLaMA3-7B: 42.9, RynnEC-7B: 39.4 * **Rel. Dir. Medium (Relative Direction - Medium):** VideoLLaMA3-7B: 44.2, RynnEC-7B: 46.3 * **Rel. Dir. Easy (Relative Direction - Easy):** VideoLLaMA3-7B: 51.9, RynnEC-7B: 45.2 * **Obj. Count (Object Count):** VideoLLaMA3-7B: 53.5, RynnEC-7B: 41.9 * **Obj. Size (Object Size):** VideoLLaMA3-7B: 58.5, RynnEC-7B: 42.2 * **Room Size:** VideoLLaMA3-7B: 54.9, RynnEC-7B: 27.1 * **Appear. Order (Appearance Order):** VideoLLaMA3-7B: 42.7, RynnEC-7B: 31.4 **Trends:** * RynnEC-7B generally outperforms VideoLLaMA3-7B across most categories. * The performance gap between the two models is most significant in "Obj. Count", "Obj. Size", and "Room Size". * VideoLLaMA3-7B performs better than RynnEC-7B in "Rel. Dir. Hard". ### Key Observations * RynnEC-7B consistently achieves higher scores, indicating superior performance overall. * The largest performance difference is observed in "Room Size", where RynnEC-7B scores significantly lower than VideoLLaMA3-7B. This could indicate a weakness in RynnEC-7B's ability to estimate or understand room dimensions. * The "Rel. Dir. Hard" category is an outlier, where VideoLLaMA3-7B outperforms RynnEC-7B. ### Interpretation The data suggests that RynnEC-7B is a more robust model for spatial reasoning and scene understanding tasks, as evidenced by its consistently higher scores across most categories. The models are being evaluated on their ability to understand spatial relationships, object properties, and scene layouts. The significant difference in performance on "Obj. Count", "Obj. Size", and "Room Size" suggests that RynnEC-7B may have a better understanding of object characteristics and spatial configurations. The outlier in "Rel. Dir. Hard" could indicate a specific strength of VideoLLaMA3-7B in handling complex directional reasoning or a weakness of RynnEC-7B in this particular area. The average scores provided at the top of the legend (35.8 and 45.8) confirm the overall trend of RynnEC-7B outperforming VideoLLaMA3-7B. The chart provides a clear quantitative comparison of the two models' capabilities, highlighting their strengths and weaknesses in different aspects of spatial understanding. </details> | Models | VSI-Bench | | --- | --- | | Qwen2.5-VL-7B Bai et al. [2025] | 35.9 | | InternVL3-8B Zhu et al. [2025] | 42.1 | | GPT-4o OpenAI et al. [2024] | 43.6 | | Magma-8B Yang et al. [2025b] | 12.7 | | Cosmos-Reason1-7B Azzolini et al. [2025] | 25.6 | | VeBrain-8B Luo et al. [2025] | 26.3 | | RoboBrain-7B-1.0 Ji et al. [2025] | 31.1 | | RoboBrain-7B-2.0 Team et al. [2025] | 36.1 | | M2-Reasoning-7B AI et al. [2025] | 42.3 | | ViLaSR Wu et al. [2025b] | 45.4 | | RynnEC-7B | 45.8 | Figure 6: Performance on VSI-Bench Yang et al. [2025c]. Left: per-subtask comparison with VideoLLaMA3, the base model of our RynnEC. Right: overall comparison with generalist MLLMs and embodied MLLMs without explicit 3D encoding. 4.3 Generalization and Scalability To investigate the generalizability of RynnEC, we conduct experiments on VSI-Bench Yang et al. [2025c], a purely textual spatial intelligence benchmark. As shown in Fig. 6, RynnEC-7B consistently surpasses VideoLLaMA3-7B across almost all capability dimensions. Notably, RynnEC is trained with a mask-centric spatial awareness paradigm, whereas all tasks in VSI-Bench involve purely textual spatial reasoning. This demonstrates that spatial awareness need not be constrained by the modality of representation, and spatial reasoning abilities can be effectively transferred across modalities. Further observation reveals substantial performance gains of RynnEC on the Route Planning task, despite this task not being included during training. This indicates that the navigation performance of embodied agents is currently constrained by foundational spatial perception capabilities, such as the understanding of direction, distance, and spatial relationships. Only with robust foundational spatial cognition can large embodied models achieve superior performance in high-level planning and decision-making tasks. Compared to other embodied MLLMs of comparable size, RynnEC-7B also achieves a leading score of 45.8. Certain tasks, such as object segmentation and movement imagery, remain significant challenges for RynnEC. We hypothesize that the suboptimal performance on these tasks stems primarily from insufficient training data. To validate this, we conduct an empirical analysis of data scalability across different task categories. As the data volume increases progressively from 20% to 100%, the model’s performance on all tasks improves steadily. This observation motivates further expansion of the dataset to enhance RynnEC’s spatial reasoning capabilities. However, it is noteworthy that the marginal gains diminish as data volume grows, indicating a decreasing return on scale. Investigating strategies to enhance data diversity in order to sustain this scaling behavior remains a critical open challenge for future research. <details> <summary>x7.png Details</summary> ![9b2ff5d4](/v1/image/9b2ff5d4abbec719e06ebe36a54a3d39ffbf6f3abc3763c681c6a4706a0d9bc5) ### Visual Description \n ## Task Sequence: Robot Navigation and Object Manipulation ### Overview The image presents a sequence of eight screenshots depicting a robot performing a series of tasks in an indoor environment. Each screenshot illustrates a step in the task, accompanied by questions and action descriptions related to robot perception and control. The tasks involve navigating to specific locations, identifying objects, and manipulating them. ### Components/Axes Each screenshot contains the following elements: * **Image:** A first-person view from the robot's perspective. * **Task Title:** A brief description of the overall task. (e.g., "Task 1: Put the basketball in the white box beside the tennis racket.") * **Step Number:** Indicates the current step in the task sequence. (e.g., "Step 1: Walk up to the basketball and pick it up") * **Question 1 (Q):** A question about the environment, requiring the robot to identify an object. (e.g., "Q: Where is the basketball?") * **Answer 1 (A):** The correct answer to the question. (e.g., "A: The basketball is <object>.") * **Question 2 (Q):** A question about the location of an object relative to the robot. (e.g., "Q: Is <object> above or below the table?") * **Answer 2 (A):** The correct answer to the question. (e.g., "A: <object> is below the table.") * **Action:** A description of the robot's action for that step. (e.g., "Action: go straight ahead and turn left.") * **Highlighting:** Red arrows indicate the path the robot takes, and blue outlines highlight the target object. ### Detailed Analysis or Content Details **Task 1: Put the basketball in the white box beside the tennis racket.** * **Step 1:** Robot is facing a basketball. Q: Where is the basketball? A: The basketball is <object>. Q: Is <object> above or below the table? A: <object> is below the table. Action: go straight ahead and turn left. * **Step 2:** Robot is approaching the basketball. Q: Where is the white box beside the tennis racket? A: The white box beside the tennis racket is directly behind me. Action: Turn around. * **Step 3:** Robot is walking towards the white box. Q: How far is the <object> from me? A: <object> is 1.2 meters away from me. Action: Walk straight ahead for 1 meter. * **Step 4:** Robot is placing the basketball in the white box. Q: How wide is the <object>? I can hold a basketball. A: The width of <object> is approximately 0.22 meters. I can hold a basketball. Action: Carefully place the basketball in the white box. **Task 2: Reduce the number of plates on the dining table to five, and place the removed plates to the left of the laptop.** * **Step 1:** Robot is facing the dining table. Q: Where is the dining table? A: The dining table is <object>. Q: How many plates are there on the table? A: six. Action: Turn right to view the entire dining table. * **Step 2:** Robot is looking at the dining table with plates. Q: How many plates are there on the table? A: six. Action: Since we need to leave 5 plates behind, we need to pick up one plate. * **Step 3:** Robot is approaching a plate. Q: Which plate is the closest to me? A: <object> is the closest to me. Action: Pick up <object>. * **Step 4:** Robot is turning towards the laptop. Q: In what direction is the laptop located relative to me, and how far away is it? A: The laptop is up and one o'clock position and the 4.5 meters away from me. Action: Turn right by 90 degrees and then go straight for 4 meters. ### Key Observations * The tasks involve a combination of navigation, object recognition, and manipulation. * The questions and answers suggest a system for the robot to reason about its environment and actions. * The use of relative positioning ("above," "below," "directly behind") indicates the robot's ability to understand spatial relationships. * The action descriptions are simple and direct, suggesting a low-level control interface. * The highlighting (arrows and outlines) provides visual guidance for the robot's actions. ### Interpretation The image demonstrates a robotic system designed to perform everyday tasks in a structured environment. The system appears to rely on a combination of visual perception, spatial reasoning, and pre-programmed actions. The questions and answers suggest a form of knowledge representation that allows the robot to understand the relationships between objects and its own actions. The sequence of steps highlights the challenges of robotic task execution, including the need for accurate object recognition, precise navigation, and careful manipulation. The system's ability to answer questions about its environment suggests a level of situational awareness that is crucial for successful task completion. The use of highlighting and simple action descriptions indicates a focus on clarity and ease of control. The tasks themselves are representative of common household chores, suggesting a potential application for this type of robotic system in domestic environments. The system is likely using computer vision to identify objects and estimate distances, and a control algorithm to execute the actions. The overall goal appears to be to create a robot that can autonomously perform tasks in a human-like manner. </details> Figure 7: The example of RynnEC assisting robots in performing long-range tasks. The robot accomplishes the two designated tasks within the RoboTHOR simulator Deitke et al. [2020]. RynnEC facilitates the robot in achieving fine-grained environmental cognition throughout the task execution. 4.4 Embodied Application Recently, some works Cheng et al. [2024b], Xiong et al. [2024] have leveraged MLLMs as the "brain" to assist robots in planning tasks, perceiving environments, and making decisions. However, current MLLMs lack key capabilities such as spatial awareness, fine-grained perception, and instance localization, which restricts these applications to limited and simple tasks. As illustrated in Fig. 7, RynnEC demonstrates the potential to assist robots in accomplishing long-horizon tasks within complex environments. From two real-time tasks performed by the robot equipped with RynnEC, we observe the following roles that RynnEC plays in task execution: (1) Fine-grained object localization and understanding enable robots to more quickly identify target objects and assess their states; (2) Direction and distance perception of targets improves navigation efficiency and precision; (3) Spatial scale estimation empowers robots to perform more delicate manipulations; (4) Counting ability facilitates the completion of tasks requiring mathematical reasoning. It is important to emphasize that the role of RynnEC in embodied tasks is far from limited to these examples. We hope that more researchers will integrate RynnEC models into robotic systems across a wide range of tasks, thereby advancing embodied intelligence toward more valuable real-world applications. 5 Conclusion and Future Works In this paper, we introduce RynnEC, a Video MLLM for embodied cognition. Through the architectural design of a region encoder and mask decoder, RynnEC achieves flexible, fine-grained visual interaction. Meanwhile, RynnEC demonstrates robust object and spatial cognitive abilities with compact size. To address the limitations of available scene data, we employ a data generation pipeline that relies solely on RGB videos. Furthermore, to supplement the lack of fine-grained embodied cognition benchmarks, we propose RynnEC-Bench, which covers 22 categories of object and spatial cognitive abilities. During training, RynnEC progressively integrates diverse skills through a four-stage capability injection process. Importantly, we advocate that fine-grained video-based visual understanding is key to achieving generalizable cognition in the physical world. RynnEC will enable robots to accomplish more precise cognitive tasks, thereby advancing the practical development of embodied intelligence. We regard RynnEC as a foundational step toward developing a general embodied intelligence model. Looking ahead, we plan to further advance RynnEC along two primary directions. - Enhancing Reasoning Capabilities: Robust visual reasoning is essential for solving any complex embodied task. An important research direction is how to effectively integrate RynnEC’s diverse abilities to perform joint reasoning, thereby enabling the resolution of higher-level embodied problems. - Unified Perception and Planning Framework: Recent studies Team et al. [2025] have started to explore training unified embodied intelligence models that combine perception and planning. However, these approaches are limited in their ability to facilitate fine-grained, video-based visual interactions. In the future, we aim to endow RynnEC with more flexible planning abilities and integrate it with VLA models to form a closed-loop embodied system. References - AI et al. [2025] Inclusion AI, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, et al. M2-reasoning: Empowering mllms with unified general and spatial reasoning. arXiv preprint arXiv:2507.08306, 2025. - Ataallah et al. [2024] Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. - Azzolini et al. [2025] Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025. - Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. https://arxiv.org/abs/2502.13923. - Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. - Caesar et al. [2018] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018. - Chen et al. [2024a] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024a. - Chen et al. [2024b] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understanding and generation with better captions. Advances in Neural Information Processing Systems, 37:19472–19495, 2024b. - Chen et al. [2014] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1971–1978, 2014. - Cheng et al. [2024a] An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language models. arXiv preprint arXiv:2406.01584, 2024a. - Cheng et al. [2024b] Kai Cheng, Zhengyuan Li, Xingpeng Sun, Byung-Cheol Min, Amrit Singh Bedi, and Aniket Bera. Efficienteqa: An efficient approach for open vocabulary embodied question answering. arXiv preprint arXiv:2410.20263, 2024b. - Comanici et al. [2025] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. - Dang et al. [2025] Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, and Lidong Bing. Ecbench: Can multi-modal foundation models understand the egocentric world? a holistic embodied cognition benchmark. arXiv preprint arXiv:2501.05031, 2025. - Deitke et al. [2020] Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. Robothor: An open simulation-to-real embodied ai platform, 2020. - Doveh et al. [2025] Sivan Doveh, Shaked Perek, M Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision and language models. In European Conference on Computer Vision, pages 250–267. Springer, 2025. - Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010. - Fan et al. [2025] Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025. - Farré et al. [2024] Miquel Farré, Andi Marafioti, Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. Finevideo. https://huggingface.co/datasets/HuggingFaceFV/finevideo, 2024. - Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. - Guo et al. [2025] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report. arXiv preprint arXiv:2505.07062, 2025. - Han et al. [2025a] Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, et al. Multimodal fusion and vision-language models: A survey for robot vision. arXiv preprint arXiv:2504.02477, 2025a. - Han et al. [2025b] Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, et al. Multimodal fusion and vision-language models: A survey for robot vision. arXiv preprint arXiv:2504.02477, 2025b. - Hong et al. [2025] Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, and Lingni Ma. Egolm: Multi-modal language model of egocentric motions. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5344–5354, 2025. - Hou et al. [2025] Xiaolu Hou, Mingcheng Li, Dingkang Yang, Jiawei Chen, Ziyun Qian, Xiao Zhao, Yue Jiang, Jinjie Wei, Qingyao Xu, and Lihua Zhang. Bloomscene: Lightweight structured 3d gaussian splatting for crossmodal scene generation. arXiv preprint arXiv:2501.10462, 2025. - Ji et al. [2025] Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025. - Jin et al. [2024] Shiyu Jin, Jinxuan Xu, Yutian Lei, and Liangjun Zhang. Reasoning grasping via multimodal large language model. arXiv preprint arXiv:2402.06798, 2024. - Li et al. [2024a] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a. - Li et al. [2024b] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024b. - Li et al. [2024c] Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024c. - Li et al. [2025a] Yifan Li, Yuhang Chen, Anh Dao, Lichi Li, Zhongyi Cai, Zhen Tan, Tianlong Chen, and Yu Kong. Industryeqa: Pushing the frontiers of embodied question answering in industrial scenarios. arXiv preprint arXiv:2505.20640, 2025a. - Li et al. [2025b] Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding? arXiv preprint arXiv:2503.23765, 2025b. - Lian et al. [2025] Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning. arXiv preprint arXiv:2504.16072, 2025. - Lin et al. [2023] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. - Lin et al. [2024] Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. - Lin et al. [2025] Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. - Liu et al. [2023] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. - Luo et al. [2025] Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. - Lyu et al. [2024] Ruiyuan Lyu, Jingli Lin, Tai Wang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, and Jiangmiao Pang. Mmscan: A multi-modal 3d scene dataset with hierarchical grounded language annotations. Advances in Neural Information Processing Systems, 37:50898–50924, 2024. - Maaz et al. [2023] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. - Maaz et al. [2024] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arXiv preprint arXiv:2406.09418, 2024. - Majumdar et al. [2024] Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024. - Mao et al. [2016] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. - Munasinghe et al. [2025] Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19036–19046, 2025. - Murai et al. [2025] Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025. - Neuhold et al. [2017] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017. - OpenAI et al. [2024] OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, and Alan Hayes. Gpt-4o system card, 2024. - Ouyang et al. [2025] Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. https://arxiv.org/abs/2504.01805. - Pasqualotto and Proulx [2012] Achille Pasqualotto and Michael J Proulx. The role of visual experience for the neural basis of spatial cognition. Neuroscience & Biobehavioral Reviews, 36(4):1179–1187, 2012. - Qi et al. [2025] Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428, 2025. - Ramanathan et al. [2023] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7141–7151, 2023. - Ravi et al. [2024] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. - Rawal et al. [2024] Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024. - Ray et al. [2024] Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024. - Ren et al. [2024a] Allen Z Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh. Explore until confident: Efficient exploration for embodied question answering. arXiv preprint arXiv:2403.15941, 2024a. - Ren et al. [2024b] Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection. arXiv preprint arXiv:2405.10300, 2024b. - Tang et al. [2021] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021. - Team et al. [2025] BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, et al. Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029, 2025. - Team et al. [2024] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, and Damien Vincent. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. - Wang et al. [2025] Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, and Stratis Gavves. Object-centric video question answering with visual grounding and referring. arXiv preprint arXiv:2507.19599, 2025. - Wang et al. [2024] Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19757–19767, 2024. - Wu et al. [2025a] Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747, 2025a. - Wu et al. [2023] Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S Yu. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023. - Wu et al. [2025b] Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. arXiv preprint arXiv:2506.09965, 2025b. - Xiong et al. [2024] Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jiaming Liu, Ruiping Wang, and Hao Dong. Autonomous interactive correction mllm for robust robotic manipulation. In 8th Annual Conference on Robot Learning, 2024. - Xu et al. [2025] Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models. arXiv preprint arXiv:2505.17015, 2025. - Yan et al. [2024] Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. In European Conference on Computer Vision, pages 98–115. Springer, 2024. - Yang et al. [2024] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024. - Yang et al. [2025a] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a. - Yang et al. [2025b] Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14203–14214, 2025b. - Yang et al. [2025c] Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025c. - Yeshwanth et al. [2023] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias NieSSner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes, 2023. - Yu et al. [2016] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European conference on computer vision, pages 69–85. Springer, 2016. - Yuan et al. [2025a] Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025a. - Yuan et al. [2024] Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024. - Yuan et al. [2025b] Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, and Yueting Zhuang. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?, 2025b. https://arxiv.org/abs/2506.05287. - Yuan et al. [2025c] Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025c. - Zhang et al. [2025a] Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025a. https://arxiv.org/abs/2501.13106. - Zhang et al. [2024a] Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024a. - Zhang et al. [2023] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. - Zhang et al. [2025b] Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, and Liqiang Nie. Exo2ego: Exocentric knowledge guided mllm for egocentric video understanding. arXiv preprint arXiv:2503.09143, 2025b. - Zhang et al. [2025c] Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976, 2025c. - Zhang et al. [2024b] Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024b. - Zhang et al. [2024c] Xingxuan Zhang, Jiansheng Li, Wenjing Chu, Junjia Hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, and Peng Cui. On the out-of-distribution generalization of multimodal large language models. arXiv preprint arXiv:2402.06599, 2024c. - Zhang et al. [2024d] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024d. - Zhang et al. [2024e] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024e. - Zhou et al. [2017] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017. - Zhou et al. [2018] Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. - Zhu et al. [2025] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, and Jiahao Wang. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. https://arxiv.org/abs/2504.10479. \beginappendix A Implementation Details for Data Pipeline A.1 Instance Segmentation As described in Section 3.1.1, instance segmentation and tracking in videos require a three-stage collaborative process. The first stage involves the extraction of an object list, which should comprehensively include the names of all objects present in the video scene. After evaluating multiple approaches, we find that directly leveraging Qwen2.5-VL to extract object category names from video frames achieves the highest efficiency and accuracy. Specifically, we uniformly sample 16 frames from each video, dividing them into two groups: even-numbered frames and odd-numbered frames. Each group is then processed independently by Qwen2.5-VL to generate a list of object category names. The prompt used to guide the extraction of the object list is in Tab. 3. System Prompt: Please analyze the image sequence captured as I move through an indoor environment and provide a concise list of major distinct physical objects that can be detected and segmented in these scenes. You need to pay attention to the following points (1) Focus on tangible items such as furniture, appliances, and tools. Avoid nouns that denote locations and rooms like "kitchen" or "bedroom". (2) Limit the list to a maximum of 20 objects, and avoid including specific components or parts of these objects. (3) Ensure the list does not have duplicates. Your output must be a series of nouns separated by semicolons Table 3: Prompts for object list extraction. During our experimentation, we observe that Qwen2.5-VL occasionally produces repeated instances of the same object name or phrases sharing the same object name as a prefix. To address this, we apply a post-processing step to remove duplicate and semantically similar phrases from the model outputs, thereby ensuring the diversity and conciseness of the object list. The final object list is obtained by taking the union of the results from the odd-numbered and even-numbered frame groups, yielding a more comprehensive and robust set of detected objects. Furthermore, generic scene-level categories such as "wall" and "floor" are explicitly excluded from the final object list, as they are not considered relevant instances for downstream instance-level tracking and segmentation tasks. A.2 Object QA Generation We generate three categories of object-related tasks: object caption, object comprehension QA, and referring video object segmentation. The pre-annotation prompts for object caption and object comprehension QA are presented in Tab. 4. Both tasks take as input a set of keyframes in which the target object is highlighted; the only difference lies in the task-specific instruction prompts. The referring video object segmentation task requires generating unique referring expressions for objects. We aggregate the QAs generated in the previous stage for each object, representing various attributes of the object. Subsequently, Qwen3 utilizes these QAs to generate both direct referring expressions and situational referring expressions. The specific prompt is shown in Tab. 5. Crop Image Prompt: The above four images show a crop of the object we need to describe. Bbox Image Prompt: The four images above highlight the target object with a red bounding box and dimmed background. Task Prompt: Caption Task: Please provide a detailed description of the specified object, focusing on its color, material, shape, state, position, function, surface detail and other information. (1) Stick to a narrative format for descriptions, avoiding list-like itemizations. (2) Just output the information you are sure of, if you output the wrong information you will be fired. Comprehension QA Task: I need you to generate a series of question pairs for me about this object, using <object> to represent the object in the question pairs. You can focus on its category, color, material, shape, state, position, function, surface detail, size and other information. "Output example" Question: What color is the <object>? Answer: Mainly red, with some blue as decoration. Notes: (1) The object in all images is the same; QA should focus solely on it, without referencing specific images. (2) Ask as many questions as needed—the more details, the better. (3) Prioritize reasoning and spatial understanding questions over simple ones. (4) You can ask questions about the target object by associating it with the surrounding objects (e.g., comparison, spatial relationship, functional relationship, quantitative relationship, etc.). # Python code together with above text prompts are directly sent to LLaMA ⬇ messages = [{"role": "user", "content": [ {"type": "text", "text": "Crop Image Prompt"} + crop_image_list + {"type": "text", "text": "Bbox Image Prompt"} + bbox_image_list + {"type": "text", "text": "Task Prompt"}]}] Table 4: Prompts for object caption and comprehension QA generation. Separate textual instructions are provided for the cropped images and the images highlighting the object via bounding boxes, respectively. System Prompt: You are analyzing indoor objects. Given a series of QAs about a single object (marked as <object>), use the information to generate two referring expressions that uniquely identify it. The two expressions should be: • One simple referring expression, using attributes such as category, color, material, spatial location, or function. • One situational referring expression, involving contextual reasoning and diverse sentence structures. Input Example: Question: What is the primary function of the <object>? Answer: The <object> is primarily used for holding writing instruments like pens and pencils. (Additional QA pairs continue in a similar fashion—omitted for brevity.) Output Example: [simple expression] The cylindrical light brown pen holder on the top shelf of the desk. [complex expression] If I finish writing with a pencil, where is the best place to store it? Table 5: Prompt for object referring expression generation. A.3 Spatial QA Generation As outlined in Section 3.1.3, we adopt a template-based approach for generating spatial QA. Specifically, we define 14 core spatial abilities and create a total of 30 distinct templates, with each template encompassing at least three different question structures. Some examples of QA templates are provided in Listing LABEL:lst:qa_template. Listing 1: Template examples for Spatial QA generation. ⬇ camera_distance_questions = [ " How far have you walked in total?", " What is the total distance you have covered walking?", " What is the overall distance you have walked?" ] closer_to_camera_questions = [ " Which is closer to you, [A] or [B]?", " Between [A] and [B], which one is nearer to you?", " Which one is closer to you, [A] or [B]?" ] closest_to_camera_questions = [ " Which is closest to you, [A] or [B] or [C]?", " Among [A], [B], and [C], which one is nearer to you?", " Which of [A], [B], or [C] is closest to you?" ] future_direction_camera_questions = [ " After you turn 90 degrees to the left, where will [A] be in relation to you?", " If you turn left by 90- degree, in which direction will [A] be positioned?", " Upon making a 90- degree left turn, how will [A] be oriented with respect to you?" ] future_direction_camera_rotate_questions = [ " How many degrees clockwise do you need to turn to face the direction of [A]?", " To face towards [A], how many degrees should you rotate in a clockwise direction?", " What degree of clockwise rotation is necessary for you to face [A]’ s direction?" ] distance_questions_3 = [ " Which of the three objects, [A], [B], or [C], is closest to you?", " Among [A], [B], and [C], which object is nearest to you?", " Between [A], [B], and [C], which one is the closest to you?", ] height_from_ground_questions = [ " What is the height difference above ground level between [A] and [B]?", " How much higher or lower is [A] compared to [B] above the ground?", " By what amount does the elevation of [A] differ from that of [B]?" ] center_distance_questions = [ " What is the distance between the centers of [A] and [B]?", " How far apart are the centers of [A] and [B]?", " What is the separation between the central points of [A] and [B]?" ] tall_choice_questions_3 = [ " Among the three objects [A], [B], and [C], which one is the tallest?", " Which of the three objects [A], [B], and [C] is tallest?", " Out of the three objects [A], [B], and [C], which one is the tallest?", ] above_predicate_questions = [ " Is [A] above [B]?", " Does [A] appear over [B]?", " Can you confirm if [A] is positioned above [B]?", ] B Details of RynnEC-Bench Construction As described in Section 3.2.2, we adjust the object distribution in the object properties understanding evaluation set of RynnEC-Bench based on real-world object category frequencies. The detailed object categorization strategy is presented in Tab. 6. We classify common indoor objects into 12 coarse categories and 119 fine-grained categories. Objects not falling into any of these predefined categories are assigned to an "other" class. A function-centered taxonomy is adopted: objects with similar appearances but distinct functional roles are categorized separately. To construct this evaluation set, we follow a two-stage process. First, an initial, oversized pool of 20,000 question-answer (QA) pairs is randomly generated without distributional constraints. Following this, we downsample this pool to a target size of 10,000 pairs. The sampling is performed according to the real-world object distribution outlined previously. Specifically, we calculate the target number of samples for each object category by multiplying its frequency in the distribution by the total target size (10,000). The final dataset is then constructed by drawing the calculated number of QA pairs for each category from the initial 20,000-pair pool. This stratified sampling strategy ensures that the final evaluation set’s composition accurately mirrors the specified real-world object frequencies. | Category | Fine-Grained Classes | | --- | --- | | Furniture | Bed, Chair, Sofa, Table, Nightstand, Cabinet, Shelf, Headboard, Wardrobe, Drawer, Wall, Door, Window, Mirror, Hanger, Hook, Handle, Hinge, Railing, Radiator, Light Switch | | Appliances & Electronics | Outlet, Refrigerator, Washing Machine, Air Conditioner, Monitor, Television, Control Panel, Fan, Speaker, Lamp, Charger, Router, Cable, Oven, Toaster, Microwave, Water heater, Range Hoods, Remote Control | | Kitchenware & Tableware | Spice Jar, Pot, Kettle, Cup, Jar, Bowl, Spoon, Knife, Plate, Chopping board, Chopstick, Stove, Rice Cooker | | Containers | Bag, Box, Basket, Bucket, Bottle, Trash Can, Can, Lid, Ashtray | | Bathroom & Cleaning | Faucet, Sink, Toilet, Toilet Seat, Toilet Lid, Shower, Bathtub, Mop, Broom, Brush, Sponge, Towel, Toothbrush, Toothpaste, Comb, Soap, Toilet Paper, Hose, Razor, Hair Dryer | | Textiles & Bedding | Quilt, Blanket, Carpet, Curtain, Pillow, Cushion, Mattress | | Stationery & Office Supplies | Book, Clock, Calendar, Pen, Sharpener, Scissors, Calculator, Mouse, Mousepad, Keyboard, LaptopPanel, Tablet Computer | | Decor & Art | Plant, Painting, Picture, Poster, Label, Calendar, Vase | | Daily Necessities | Phone, Hat, Slipper, Shoe, Umbrella, Headphones, Glove | | Food | Fruit, Vegetable | | Clothing | Shirt, Pants, Dress, Skirt, Coat, Shorts, Socks, Underwear | | Fitness & Recreation | Treadmill, Dumbbells, Piano, Toy | Table 6: Object category taxonomy. C Qualitative examples In section ˜ 4.2.1 and table ˜ 2, we show that our model can handle different types of object and spatial cognition tasks. In this section, we show more detailed qualitative examples for different abilities of our model. C.1 Object Cognition - Properties (figure ˜ 8, figure ˜ 9). The model discerns a wide range of object properties, including physical attributes such as size, color, and surface details, as well as functional affordances. - Segmentation (figure ˜ 8). The system performs both simple and situational referring expression segmentation, enabling it to isolate target objects in the scene based on natural language queries. C.2 Spatial Cognition - Trajectory Review (figure ˜ 8). The model perceives the distance traversed by its own camera, allowing for a review of its past trajectory. - Egocentric Direction (figure ˜ 9). It successfully determines the direction of objects relative to its own perspective. - Egocentric Distance (figure ˜ 9). The system is capable of estimating the egocentric distance from itself to surrounding objects in the environment. - Movement Imagery (figure ˜ 8). A key capability is the imagination of prospective movements, allowing the model to reason about future paths. - Spatial Imagery (figure ˜ 8). The model demonstrates an ability for spatial imagination, such as inferring the layout of unseen areas. - Object Size (figure ˜ 9). Its spatial understanding extends to estimating the absolute sizes of objects and performing relative size comparisons between them. - Object Height (figure ˜ 8). Similarly, the model predicts and compares the heights of different objects. - Object Distance (figure ˜ 9). The system accurately gauges the distance between multiple objects within the scene (i.e., inter-object distance). - Absolute Position (figure ˜ 8). The model can ascertain the absolute positional relationships between objects. - Relative Position (figure ˜ 9). Furthermore, it demonstrates a robust understanding of the relative positions of objects with respect to one another. <details> <summary>figures/append-qual/qual-all-1.png Details</summary> ![2f57dd3e](/v1/image/2f57dd3e688e769fda119827a06403ac0f040550ef27f5957db19dddd0a69376) ### Visual Description \n ## Image Analysis: Object Cognition & Spatial Reasoning Questions with Visuals ### Overview The image presents a grid of twelve panels, each containing a photograph and a question-answer pair related to object cognition and spatial reasoning. Each panel focuses on a specific dimension of understanding the scene depicted in the image. The questions are formatted as "Dimension: [Dimension Name] – [Question Type]: [Question Text]" and are followed by an answer labeled "A: [Answer Text]". The image appears to be designed for evaluating a system's ability to understand visual scenes. ### Components/Axes The image is structured as a 3x4 grid. Each cell contains: * **Image:** A photograph depicting a common indoor scene (office/kitchen area). * **Question:** A text-based question related to the image, categorized by a "Dimension" (e.g., Object Cognition – Color, Spatial Cognition – Absolute Position). * **Answer:** A text-based answer to the corresponding question. The dimensions explored are: * Object Cognition – Color * Object Cognition – Category * Object Cognition – Shape * Spatial Cognition – Spatial Imagery * Spatial Cognition – Absolute Position * Object Cognition – Function * Object Cognition – State * Object Cognition – Material * Spatial Cognition – Movement Imagery * Spatial Cognition – Object Height * Object Cognition – Size * Spatial Cognition – Position * Spatial Cognition – Trajectory Review * Object Cognition – Object Segmentation ### Detailed Analysis / Content Details Here's a transcription of the questions and answers from each panel: 1. **Dimension: Object Cognition – Color** Q: What is the primary color of `<object0>`? A: The object is primarily light brown. 2. **Dimension: Object Cognition – Category** Q: What category does `<object0>` belong to? A: The object is a piece of furniture, specifically a small wooden table or stand. 3. **Dimension: Object Cognition – Shape** Q: What is the shape of `<object0>`? A: The object has a rectangular shape with a flat top and open sides. 4. **Dimension: Spatial Cognition – Spatial Imagery** Q: Positioned at `<object0>` with your view directed towards `<object1>`, in which direction is `<object2>` situated? A: Right front. 5. **Dimension: Spatial Cognition – Absolute Position** Q: Which one is above, `<object0>` or `<object1>`? A: `<object0>`. 6. **Dimension: Object Cognition – Function** Q: What is the function of `<object0>`? A: The object is used for storage. 7. **Dimension: Object Cognition – State** Q: What can be inferred about the state of the `<object0>`? A: The object appears to be in a stationary state, not currently in use. 8. **Dimension: Object Cognition – Material** Q: What material is `<object1>` likely made of? A: `<object1>` is likely made of plastic. 9. **Dimension: Spatial Cognition – Movement Imagery** Q: After you turn 90 degree to the left, where will `<object1>` be in relation to you? A: `<object1>` will situate at the 6 o’clock direction from me. 10. **Dimension: Spatial Cognition – Object Height** Q: How much higher or lower is `<object1>` compared to `<object0>` above the ground? A: 1.03 meters. 11. **Dimension: Object Cognition – Size** Q: How does the size of the `<object0>` compare to the other items on the table? A: The object is relatively small compared to the other items on the table, such as the coffee machine and the trash bin. 12. **Dimension: Spatial Cognition – Position** Q: What is the position of `<object0>`? A: The object is mounted on the wall above the countertop. 13. **Dimension: Spatial Cognition – Trajectory Review** Q: How far did you walk? A: 1.83 meters. 14. **Dimension: Object Cognition – Object Segmentation** Q: If I want to drink water, which object should I look for? A: `<object0>`. ### Key Observations * The questions cover a broad range of visual understanding capabilities, from basic object properties (color, shape, material) to more complex spatial reasoning (position, movement, trajectory). * The use of `<object0>`, `<object1>`, `<object2>` suggests a system that can identify and track objects within the scene. * The answers are generally concise and specific, indicating a focus on precise understanding. * The numerical values (1.03 meters, 1.83 meters) suggest the system can estimate distances and heights. ### Interpretation This image represents a test suite for evaluating a computer vision or AI system's ability to understand and reason about visual scenes. The questions are designed to assess the system's performance across different dimensions of visual intelligence. The consistent format and clear question-answer pairs suggest a structured evaluation process. The inclusion of both object-level and spatial reasoning questions indicates a holistic approach to assessing visual understanding. The use of placeholders like `<object0>` implies the system is capable of object detection and referencing. The questions are not simply about *seeing* the objects, but about *understanding* their relationships and properties within the scene. The answers provided suggest a system that can perform basic spatial calculations and infer object functions and states. The overall goal appears to be to develop a system that can "see" and "understand" the world in a way that is similar to human perception. </details> Figure 8: Visualization of question answering examples. Part 1 out of 2. <details> <summary>figures/append-qual/qual-all-3.png Details</summary> ![db30f48b](/v1/image/db30f48b8040648bfccfc223212b4bd3c069104146dd147aae00cc740b0e15ea) ### Visual Description \n ## Image Analysis: Question-Answer Pairs with Scene Images ### Overview The image presents a series of question-answer pairs, each accompanied by a photograph of a domestic scene. The questions relate to spatial cognition and object recognition, probing understanding of distances, sizes, shapes, positions, and functions of objects within the scenes. Each question is labeled with "Dimension: [Cognitive Area]". ### Components/Axes The image is structured as a grid of question-answer blocks. Each block contains: * **Dimension Label:** Indicates the cognitive area being tested (e.g., Spatial Cognition, Object Cognition). * **Question (Q):** A specific question about the scene. * **Answer (A):** The provided answer to the question. * **Scene Image:** A photograph illustrating the context for the question. ### Detailed Analysis / Content Details Here's a transcription of each question-answer pair, along with observations about the accompanying image: **1. Dimension: Spatial Cognition—Egocentric Distance** * Q: What is the distance between me and <object0>? * A: 1.63m. * Image: A person is seated at a table with a laptop and a mug. <object0> appears to be the mug. **2. Dimension: Spatial Cognition—Object Distance** * Q: What is the distance between <object0> and <object1>? * A: It is 1.25 meters. * Image: Same scene as above. <object0> and <object1> appear to be the mug and the laptop, respectively. **3. Dimension: Object Cognition—Surface Detail** * Q: What's the surface detail of <object0>? * A: The surface of <object0> is smooth and reflective. * Image: Same scene. <object0> is the laptop screen. **4. Dimension: Spatial Cognition—Object Size** * Q: How tall is <object1>? * A: It is 1.02 meters. * Image: Same scene. <object1> is the person. **5. Dimension: Spatial Cognition—Egocentric Direction** * Q: Is <object0> on your left front or right front in the last frame? * A: Left front. * Image: Same scene. <object0> is the mug. **6. Dimension: Spatial Cognition—Relative Position** * Q: Is <object0> directly above <object1>? * A: No, they are on the same height. * Image: Same scene. <object0> is the mug, and <object1> is the laptop. **7. Dimension: Object Cognition—Shape** * Q: What is the shape of <object0>? * A: The object has a classic teddy bear shape with a round head and body, and limbs. * Image: A child is holding a teddy bear. <object0> is the teddy bear. **8. Dimension: Spatial Cognition—Relative Position** * Q: Is <object0> between <object1> and <object2>? * A: No. * Image: A child is holding a teddy bear. <object0> is the teddy bear, <object1> is the child's hand, and <object2> is the child's arm. **9. Dimension: Spatial Cognition—Egocentric Distance** * Q: What is the size of <object0> compare to the ipad on the desk? * A: The object is larger in size compared to the ipad on the desk. * Image: A child is holding a teddy bear next to an iPad on a desk. <object0> is the teddy bear. **10. Dimension: Spatial Cognition—Egocentric Distance** * Q: What category does <object0> belong to? * A: The object is a stuffed toy, specifically a teddy bear. * Image: A child is holding a teddy bear. <object0> is the teddy bear. **11. Dimension: Object Cognition—Object Segmentation** * Q: If I want to check the current weather and time sitting at the desk, where should I look? * A: * Image: A person is seated at a desk with a laptop. **12. Dimension: Object Cognition—Function** * Q: What is the function of <object0>? * A: The object provides water, which can be used for drinking or cooking. * Image: A person is seated at a table with a mug. <object0> is the mug. **13. Dimension: Spatial Cognition—Egocentric Distance** * Q: Among <object0>, <object1>, and <object2> which one is nearer to you? * A: <object0>. * Image: A person is seated at a table with a mug, a laptop, and a phone. <object0> is the mug, <object1> is the laptop, and <object2> is the phone. ### Key Observations * The questions consistently use placeholders like `<object0>`, `<object1>`, and `<object2>`, indicating a system for object identification within the images. * The answers provide quantitative data (distances in meters) and qualitative descriptions (surface details, shapes, functions). * The scenes depict everyday environments, suggesting the task is to assess understanding of common spatial relationships and object properties. * The questions cover a range of cognitive abilities, including distance estimation, shape recognition, and functional understanding. ### Interpretation This image represents a dataset for evaluating AI or human performance in visual understanding and spatial reasoning. The questions are designed to test the ability to perceive and interpret the relationships between objects in a scene, as well as to understand the properties of those objects. The use of precise measurements (distances) suggests a focus on accurate spatial perception. The inclusion of both spatial and object-based questions indicates a holistic assessment of visual cognition. The consistent format and use of placeholders suggest this is part of a larger, automated evaluation system. The unanswered question in block 11 suggests the dataset is incomplete or that some questions are intentionally left open-ended. </details> Figure 9: Visualization of question answering examples. Part 2 out of 2.

Rendering Paper...