2504.07491

Model: gemma-3-27b-it-free

# Kimi-VL Technical Report **Authors**: Kimi Team \addbibresource template.bib Abstract We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities —all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities (84.4 on MMBench, 83.2 on V* and 52.8 on ScreenSpot-Pro). With only around 3B activated parameters, it sets a new standard for efficient yet capable multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL. <details> <summary>x1.png Details</summary> ![5ac832c1](/v1/image/5ac832c132a187e5944f4b43a7ce7ed089a18c04c470fd8574942dce718ad108) ### Visual Description ## Scatter Plot: Model Performance vs. Parameters ### Overview This image presents a scatter plot comparing the performance of various models on a MathVision Pass@1 task against the number of activated parameters they utilize. Each model is represented by a data point, and a trend line is fitted to a subset of the models. The plot aims to illustrate the relationship between model size (parameter count) and mathematical reasoning ability. ### Components/Axes * **X-axis:** Activated Parameters (B) - Scale ranges from approximately 3 to 75 Billion parameters. * **Y-axis:** MathVision Pass@1 - Scale ranges from approximately 25 to 65. * **Data Points:** Represent individual models. * **Trend Line:** A dashed red line attempting to show the correlation between parameters and performance for a subset of models. * **Legend:** Implicitly defined by the labels next to each data point. ### Detailed Analysis The following data points are visible, with approximate values read from the plot: * **Kimi-VL-A3B-Thinking-2506 (Purple Star):** Approximately (3, 35.5). * **Kimi-VL-A3B-Thinking (Purple Star):** Approximately (3, 33). * **DeepSeek-VL2-44.5B (Dark Blue Circle):** Approximately (7, 27). * **Llama-3.2-11B-Inst. (Dark Blue Circle):** Approximately (11, 27.5). * **Gemma-3-4B-IT (Orange Circle):** Approximately (11, 30). * **Owen-2.5-VL-3B (Orange Circle):** Approximately (11, 29). * **Gemma-3-12B-IT (Orange Circle):** Approximately (33, 33). * **Qwen-2.5-VL-32B (Red Circle):** Approximately (33, 35). * **Qwen-2.5-VL-72B (Red Circle):** Approximately (73, 36). * **QVQ-72B-Preview (Red Circle):** Approximately (73, 52). * **QVQ-Max-Preview (Red Circle):** Approximately (73, 54). * **Owen-2.5-VL-7B (Orange Circle):** Approximately (11, 31). The trend line (dashed red) connects the following points: Gemma-3-4B-IT, Gemma-3-12B-IT, Qwen-2.5-VL-32B, Qwen-2.5-VL-72B. The line shows a generally upward trend, indicating that as the number of activated parameters increases, the MathVision Pass@1 score tends to increase as well. ### Key Observations * **Outliers:** Kimi-VL-A3B-Thinking-2506 and Kimi-VL-A3B-Thinking show relatively high performance with a small number of parameters compared to other models. * **Trend:** The trend line suggests a positive correlation between model size and performance, but the correlation is not strong, as evidenced by the scatter of points around the line. * **Clustering:** Models with similar parameter counts tend to cluster together, particularly in the 10-12B range. * **QVQ Models:** The QVQ models (QVQ-72B-Preview and QVQ-Max-Preview) demonstrate the highest performance, but also require the largest number of parameters. ### Interpretation The data suggests that increasing the number of activated parameters generally improves performance on the MathVision Pass@1 task. However, the relationship is not linear, and there is significant variation among models with similar parameter counts. The Kimi models stand out as achieving high performance with relatively few parameters, suggesting a potentially more efficient architecture or training methodology. The QVQ models represent the state-of-the-art in terms of performance, but at the cost of significantly increased computational resources. The trend line provides a rough estimate of the expected performance gain for a given increase in parameters, but it should be interpreted with caution due to the scatter in the data. The plot highlights the trade-off between model size, performance, and computational cost in the context of mathematical reasoning. </details> Figure 1: Comparison between Kimi-VL-Thinking-2506 and frontier open-source VLMs, including short-thinking VLMs (e.g. Gemma-3 series, Qwen2.5-VL series) and long-thinking VLMs (QVQ-72B/Max-Preview), on MathVision benchmark. Our model achieves strong multimodal reasoning with just 2.8B LLM activated parameters. <details> <summary>x2.png Details</summary> ![2e35a83d](/v1/image/2e35a83db70ed2843d8c95d65d5e95c7fd3e4fcf2082d9622dabd7f2456c06b7) ### Visual Description \n ## Bar Chart: Multimodal Model Performance Comparison ### Overview The image presents a bar chart comparing the performance of several multimodal large language models (LLMs) across various benchmark datasets. The chart displays scores for Kimi-VL-A3B, Qwen2.5-VL-7B, DeepSeek-VL2, GPT-4o/GPT-3.5-mini, Llama-3.2-11B-Inst, and Gemma-3-12B-IT on benchmarks categorized as GENERAL, OCR, MULTI-IMAGE, LONG VIDEO, LONG DOC, and AGENT. ### Components/Axes * **X-axis:** Represents the benchmark datasets: MMU (val), MMBench-EN-v1.1, InfoVQA, BLINK, LongVideoBench, Video-MME (w/o sub), MMLongBench-Doc, ScreenSpot-Pro, OSWorld (Pass@1). * **Y-axis:** Represents the performance score, ranging from 0 to 90 (approximately). The scale is not explicitly labeled, but can be inferred from the values displayed. * **Legend:** Located at the top of the chart, identifies each model with a corresponding color: * Kimi-VL-A3B (Blue) * Qwen2.5-VL-7B (Orange) * DeepSeek-VL2 (Gray) * GPT-4o/GPT-3.5-mini (Green) * Llama-3.2-11B-Inst (Purple) * Gemma-3-12B-IT (Teal) * **Category Headers:** "GENERAL", "OCR", "MULTI-IMAGE", "LONG VIDEO", "LONG DOC", and "AGENT" are positioned above their respective benchmark groups. ### Detailed Analysis Here's a breakdown of the performance scores for each model on each benchmark, with approximate values: **GENERAL** * **MMMU (val):** Kimi-VL-A3B: 57.6, Qwen2.5-VL-7B: 51.1, DeepSeek-VL2: 60, GPT-4o/GPT-3.5-mini: 48, Llama-3.2-11B-Inst: 59.8, Gemma-3-12B-IT: ~56. * **MMBench-EN-v1.1:** Kimi-VL-A3B: 83.1, Qwen2.5-VL-7B: 79.6, DeepSeek-VL2: 77.1, GPT-4o/GPT-3.5-mini: 74.6, Llama-3.2-11B-Inst: 82.6, Gemma-3-12B-IT: 65.8. * **InfoVQA:** Kimi-VL-A3B: 83.2, Qwen2.5-VL-7B: 78.1, DeepSeek-VL2: 57.9, GPT-4o/GPT-3.5-mini: 43.8, Llama-3.2-11B-Inst: ~80, Gemma-3-12B-IT: 34.6. **OCR** * **BLINK:** Kimi-VL-A3B: 57.3, Qwen2.5-VL-7B: 56.4, DeepSeek-VL2: 53.6, GPT-4o/GPT-3.5-mini: 39.8, Llama-3.2-11B-Inst: ~50.3, Gemma-3-12B-IT: ~50. **LONG VIDEO** * **LongVideoBench:** Kimi-VL-A3B: 64.5, Qwen2.5-VL-7B: 58.2, DeepSeek-VL2: 51.5, GPT-4o/GPT-3.5-mini: 45.5, Llama-3.2-11B-Inst: ~60, Gemma-3-12B-IT: ~48. * **Video-MME (w/o sub):** Kimi-VL-A3B: 67.8, Qwen2.5-VL-7B: 65.1, DeepSeek-VL2: 64.8, GPT-4o/GPT-3.5-mini: 46, Llama-3.2-11B-Inst: ~60, Gemma-3-12B-IT: ~50. **LONG DOC** * **MMLongBench-Doc:** Kimi-VL-A3B: 35.1, Qwen2.5-VL-7B: 29.6, DeepSeek-VL2: 29, GPT-4o/GPT-3.5-mini: 21.3, Llama-3.2-11B-Inst: 13.8, Gemma-3-12B-IT: ~20. **AGENT** * **ScreenSpot-Pro:** Kimi-VL-A3B: 34.5, Qwen2.5-VL-7B: 29, DeepSeek-VL2: ~10, GPT-4o/GPT-3.5-mini: ~10, Llama-3.2-11B-Inst: ~20, Gemma-3-12B-IT: 0.8. * **OSWorld (Pass@1):** Kimi-VL-A3B: 8.2, Qwen2.5-VL-7B: 5, DeepSeek-VL2: 5, GPT-4o/GPT-3.5-mini: 2.5, Llama-3.2-11B-Inst: ~5, Gemma-3-12B-IT: ~2.5. ### Key Observations * **Kimi-VL-A3B** consistently performs well across most benchmarks, often achieving the highest scores. * **Qwen2.5-VL-7B** generally performs second best, but lags behind Kimi-VL-A3B. * **DeepSeek-VL2** shows moderate performance, generally falling in the middle range. * **GPT-4o/GPT-3.5-mini** exhibits lower scores, particularly on the InfoVQA and LONG DOC benchmarks. * **Llama-3.2-11B-Inst** shows variable performance, with strong results on some benchmarks (MMBench-EN-v1.1) and weaker results on others (MMLongBench-Doc). * **Gemma-3-12B-IT** consistently shows the lowest performance across most benchmarks. * The performance gap between models is most pronounced on the LONG DOC and AGENT benchmarks. ### Interpretation The chart demonstrates a clear hierarchy in the performance of these multimodal LLMs. Kimi-VL-A3B emerges as the leading model, excelling in a broad range of tasks. The results suggest that model architecture, training data, and model size all contribute to performance differences. The lower scores on LONG DOC and AGENT benchmarks may indicate challenges in processing long-form content or complex reasoning tasks. The significant performance gap between models highlights the ongoing research and development efforts in the field of multimodal AI. The data suggests that while progress has been made, there is still considerable room for improvement, particularly in areas requiring advanced reasoning and long-context understanding. The variation in performance across different benchmarks also suggests that no single model is universally superior; the optimal choice depends on the specific application and requirements. </details> Figure 2: Highlights of Kimi-VL performance for a wide range of benchmarks like, general benchmarks (MMMU, MMBench), OCR (InfoVQA), multi-image (BLINK), long video (LongVideoBench, Video-MME), long document (MMLongBench-Doc), and agent (ScreenSpot-Pro and OSWorld). Detailed results are presented in Table 3. 1 Introduction With the rapid advancement of artificial intelligence, human expectations for AI assistants have transcended traditional language-only interactions, increasingly aligning with the inherently multimodal nature of our world. To better understand and interact with these expectations, new generations of natively multimodal models, such as GPT-4o \parencite openai2024gpt4ocard and Google Gemini \parencite geminiteam2024gemini15unlockingmultimodal, have emerged with the capability to seamlessly perceive and interpret visual inputs alongside language processing. Most recently, advanced multimodal models, pioneered by OpenAI o1 series \parencite o12024 and Kimi k1.5 \parencite team2025kimi, have further pushed these boundaries by incorporating deeper and longer reasoning on multimodal inputs, thereby tackling more complex problems in the multimodal domain. Nevertheless, development in large VLMs in the open-source community has significantly lagged behind their language-only counterparts, particularly in aspects of scalability, computational efficiency, and advanced reasoning capabilities. While language-only model DeepSeek R1 \parencite deepseekai2025deepseekr1incentivizingreasoningcapability has already leveraged the efficient and more scalable mixture-of-experts (MoE) architecture and facilitated sophisticated long chain-of-thought (CoT) reasoning, most recent open-source VLMs, e.g. Qwen2.5-VL \parencite bai2025qwen25vltechnicalreport and Gemma-3 \parencite gemmateam2025gemma3technicalreport, continue to rely on dense architectures and do not support long-CoT reasoning. Early explorations into MoE-based vision-language models, such as DeepSeek-VL2 \parencite wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels and Aria \parencite li2024ariaopenmultimodalnative, exhibit limitations in other crucial dimensions. Architecturally, both models still adopt relatively traditional fixed-size vision encoders, hindering their adaptability to diverse visual inputs. From a capability perspective, DeepSeek-VL2 supports only a limited context length (4K), while Aria falls short in fine-grained visual tasks. Additionally, neither of them supports long-thinking abilities. Consequently, there remains a pressing need for an open-source VLM that effectively integrates structural innovation, stable capabilities, and enhanced reasoning through long-thinking. In light of this, we present Kimi-VL, a vision-language model for the open-source community. Structurally, Kimi-VL consists of our Moonlight \parencite liu2025muonscalablellmtraining MoE language model with only 2.8B activated (16B total) parameters, paired with a 400M native-resolution MoonViT vision encoder. In terms of capability, as illustrated in Figure 2, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc.) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc.). Specifically, it features the following exciting abilities: 1) Kimi-VL is smart: it has comparable text ability against efficient pure-text LLMs; without long thinking, Kimi-VL is already competitive in multimodal reasoning and multi-turn agent benchmarks, e.g., MMMU, MathVista, OSWorld. 2) Kimi-VL processes long: it effectively tackles long-context understanding on various multimodal inputs within its 128K context window, far ahead of similar-scale competitors on long video benchmarks and MMLongBench-Doc. 3) Kimi-VL perceives clear: it shows all-round competitive ability over existing efficient dense and MoE VLMs in various vision-language scenarios: visual perception, visual world knowledge, OCR, high-resolution OS screenshot, etc. Furthermore, with long-CoT activation and reinforcement learning (RL), we introduce the long-thinking version of Kimi-VL, Kimi-VL-Thinking, which further substantially improves performance on more complex multimodal reasoning scenarios. Despite its small scale, Kimi-VL-Thinking offers compelling performance on hard reasoning benchmarks (e.g., MMMU, MathVision, MathVista), outperforming many state-of-the-art VLMs with even larger sizes. We further release and improved version of the thinking model, Kimi-VL-Thinking-2506. The improved version has even better performance on these reasoning benchmarks while retaining or improving on common visual perception and understanding scenarios, e.g. high-resolution perception (V*), OS grounding, video and long document understanding. 2 Approach 2.1 Model Architecture <details> <summary>x3.png Details</summary> ![40620844](/v1/image/4062084420a5e52a27a434658e7439f2700a175972b38a53ededb2a5866dd72d) ### Visual Description ## Diagram: Mixture-of-Experts (MoE) Language Decoder & MoonViT Pipeline ### Overview This diagram illustrates a pipeline combining a Mixture-of-Experts (MoE) language decoder with a MoonViT (Multi-scale Vision Transformer) model for processing visual data. The pipeline takes long videos or small images as input, processes them through MoonViT, projects the features using an MLP (Multi-Layer Perceptron), and then feeds them into the MoE language decoder. A screenshot of a mobile phone UI is also included, seemingly representing the output or application of the system. ### Components/Axes The diagram is segmented into several key components: * **Mixture-of-Experts (MoE) Language Decoder:** Located in the top-left, this section depicts the architecture of the MoE decoder. It includes "MoE FFN" (Feed Forward Network), "Attention Layer", and a "Router" connecting to "Non-shared Experts" and "Shared Experts". The Router has arrows pointing to multiple expert blocks, labeled "X N". * **MoonViT:** Positioned in the center, this component represents the vision transformer model. It's labeled "(Native-resolution)". * **MLP Projector:** A rectangular block between MoonViT and the MoE decoder. * **Input Data:** Two input sources are shown: "LONG VIDEO" (stacked video frames) and "SMALL IMAGE". * **Fine-Grained:** A section displaying a detailed image of a person. * **Screenshot:** A UI screenshot of a mobile phone, labeled "UI SCREENSHOT", is on the right. * **Arrows & Dimensions:** Arrows indicate the data flow, with pixel dimensions labeled along the arrows (e.g., "20px", "50px", "270px", "480px", "59px", "1113px", "100px", "672px", "1731px", "800px"). * **Text:** Several text blocks are present, including "What can you interpret from...", "2a-b", "Text? That is the exciting competition going on", and "OCR (SPECIAL ASPECT RATIO)". ### Detailed Analysis or Content Details **1. MoE Language Decoder:** * The MoE decoder consists of an Attention Layer, an MoE FFN, and a Router. * The Router directs input to both "Non-shared Experts" and "Shared Experts". * The "X N" notation suggests a variable number of experts (N). **2. MoonViT:** * The MoonViT model operates at "Native-resolution". **3. Input Data Flow:** * **Long Video:** A stack of video frames (approximately 270px high and 480px wide) is fed into MoonViT via a 50px arrow. * **Small Image:** A single image (dimensions not explicitly stated, but implied to be smaller than the video frames) is also fed into MoonViT via a 20px arrow. * **Fine-Grained Image:** A detailed image of a person (approximately 59px high and 1113px wide) is shown, likely representing the output of MoonViT or an intermediate representation. **4. MLP Projector:** * The MLP projector connects MoonViT to the MoE decoder. **5. Screenshot:** * The screenshot displays a mobile phone UI with various app icons. * The top of the screen shows "33°" and "11" (likely temperature and time). * Visible app icons include: Calendar, Safari, App Store, Camera, Photos, Clock, Music, and others. * The screenshot is approximately 1731px high and 800px wide. **6. Text Blocks:** * "What can you interpret from..." - A prompt or question. * "2a-b" - A mathematical expression or label. * "Text? That is the exciting competition going on" - A statement about the current research landscape. * "OCR (SPECIAL ASPECT RATIO)" - Indicates Optical Character Recognition is being used, potentially with a focus on handling varying aspect ratios. ### Key Observations * The diagram highlights a multi-modal approach, combining visual processing (MoonViT) with language modeling (MoE decoder). * The use of MoE suggests a focus on scalability and efficiency in the language decoder. * The inclusion of pixel dimensions indicates a concern for computational resources and model size. * The screenshot suggests the system is intended for use on mobile devices. * The "OCR (SPECIAL ASPECT RATIO)" label suggests the system may be designed to process text from images with varying aspect ratios. ### Interpretation The diagram depicts a system designed to understand and generate language based on visual input. The MoonViT model extracts features from images or videos, which are then projected and fed into the MoE language decoder. The MoE decoder likely generates text descriptions or answers questions about the visual content. The system's architecture suggests a focus on handling complex visual scenes and generating coherent language outputs. The inclusion of the mobile phone screenshot implies a potential application in areas such as image captioning, visual question answering, or assistive technology for mobile devices. The mention of OCR and special aspect ratios suggests the system is robust to variations in text presentation within images. The overall pipeline represents a sophisticated approach to multi-modal learning, leveraging the strengths of both vision transformers and mixture-of-experts language models. The "What can you interpret from..." prompt suggests the diagram is part of a presentation or research paper exploring the capabilities of this system. </details> Figure 3: The model architecture of Kimi-VL and Kimi-VL-Thinking, consisting of a MoonViT that allows native-resolution images, an MLP projector, and a Mixture-of-Experts (MoE) language decoder. The architecture of Kimi-VL consists of three parts: a native-resolution vision encoder (MoonViT), an MLP projector, and an MoE language model, as depicted in Figure 3. We introduce each part in this section. MoonViT: A Native-resolution Vision Encoder We design MoonViT, the vision encoder of Kimi-VL, to natively process images at their varying resolutions, eliminating the need for complex sub-image splitting and splicing operations, as employed in LLaVA-OneVision \parencite li2024llavaonevisioneasyvisualtask. We incorporate the packing method from NaViT \parencite dehghani2023patchnpacknavit, where images are divided into patches, flattened, and sequentially concatenated into 1D sequences. These preprocessing operations enable MoonViT to share the same core computation operators and optimization as a language model, such as the variable-length sequence attention mechanism supported by FlashAttention \parencite dao2022flashattentionfastmemoryefficientexact, ensuring non-compromised training throughput for images of varying resolutions. MoonViT is initialized from and continually pre-trained on SigLIP-SO-400M \parencite zhai2023sigmoidlosslanguageimage, which originally employs learnable fixed-size absolute positional embeddings to encode spatial information. While we interpolate these original position embeddings to better preserve SigLIP’s capabilities, these interpolated embeddings become increasingly inadequate as image resolution increases. To address this limitation, we incorporate 2D rotary positional embedding (RoPE) \parencite su2023roformerenhancedtransformerrotary across the height and width dimensions, which improves the representation of fine-grained positional information, especially in high-resolution images. These two positional embedding approaches work together to encode spatial information for our model and seamlessly integrate with the flattening and packing procedures. This integration enables MoonViT to efficiently process images of varying resolutions within the same batch. The resulting continuous image features are then forwarded to the MLP projector and, ultimately, to the MoE language model for subsequent training stages. In Kimi-VL-A3B-Thinking-2506, we further continually train this MoonViT to authentically encode up to 3.2 million pixels from a single image, 4 times compared to the original limit. MLP Projector We employ a two-layer MLP to bridge the vision encoder (MoonViT) and the LLM. Specifically, we first use a pixel shuffle operation to compress the spatial dimension of the image features extracted by MoonViT, performing 2×2 downsampling in the spatial domain and correspondingly expanding the channel dimension. We then feed the pixel-shuffled features into a two-layer MLP to project them into the dimension of LLM embeddings. Mixture-of-Experts (MoE) Language Model The language model of Kimi-VL utilizes our Moonlight model \parencite liu2025muonscalablellmtraining, an MoE language model with 2.8B activated parameters, 16B total parameters, and an architecture similar to DeepSeek-V3 \parencite deepseekai2025deepseekv3technicalreport. For our implementation, we initialize from an intermediate checkpoint in Moonlight’s pre-training stage—one that has processed 5.2T tokens of pure text data and activated an 8192-token (8K) context length. We then continue pre-training it using a joint recipe of multimodal and text-only data totaling 2.3T tokens, as detailed in Sec. 2.3. 2.2 Muon Optimizer We use an enhanced Muon optimizer \parencite liu2025muon for model optimization. Compared to the original Muon optimizer \parencite jordan2024muon, we add weight decay and carefully adjust the per-parameter update scale. Additionally, we develop a distributed implementation of Muon following the ZeRO-1 \parencite rajbhandari2020zero optimization strategy, which achieves optimal memory efficiency and reduced communication overhead while preserving the algorithm’s mathematical properties. This enhanced Muon optimizer is used throughout the entire training process to optimize all model parameters, including the vision encoder, the projector, and the language model. 2.3 Pre-Training Stages As illustrated in Figure 4 and Table 1, after loading the intermediate language model discussed above, Kimi-VL’s pre-training comprises a total of 4 stages consuming 4.4T tokens overall: first, standalone ViT training to establish a robust native-resolution visual encoder, followed by three joint training stages (pre-training, cooldown, and long-context activation) that simultaneously enhance the model’s language and multimodal capabilities. The details are as follows. <details> <summary>x4.png Details</summary> ![ecdbd39d](/v1/image/ecdbd39df96d54df8e606edabea1530d57550120102e1082a9abef2dd42f7b63) ### Visual Description \n ## Diagram: Training Pipeline Stages ### Overview The image depicts a sequential training pipeline consisting of four main stages: Text Pre-training, Joint Pre-training, Joint Cooldown, and Joint Long-context. Below these stages is a separate stage for Vision Transformer (ViT) Training. Each stage is represented by a colored rectangle containing information about the data used, the training process, and relevant parameters. Arrows indicate the flow of the training process. ### Components/Axes The diagram consists of five rectangular blocks arranged horizontally. Each block represents a training stage. The blocks are colored as follows: - Text Pre-training: Blue - Joint Pre-training: Green - Joint Cooldown: Yellow - Joint Long-context: Orange - ViT Training: Light Blue Each block contains text labels describing the stage, data size, and specific training details. There are also two circular icons with checkmarks and text "resumes LR scheduler" positioned above the Joint Pre-training and Joint Long-context stages. ### Detailed Analysis or Content Details **1. Text Pre-training (Blue)** - Data: 5.2T data - Data Type: Pure Text Data **2. Joint Pre-training (Green)** - Data: 1.4T data - Data Composition: Up to 40% Multimodal Data - Training Approach: Progressive Multimodal Ratio - Icon: "resumes LR scheduler" (top-left of the block) **3. Joint Cooldown (Yellow)** - Data: 0.6T data - Data Quality: High-quality Text & Multimodal Data - Training Approach: Re-warmup to higher LR **4. Joint Long-context (Orange)** - Data: 0.3T data - Data Type: Long Text & Long Video & Long Doc - Parameter: RoPE base: 50,000 -> 800,000 - Icon: "resumes LR scheduler" (top-left of the block) **5. ViT Training (Light Blue)** - Data: 0.0T -> 0.1T data - Training Method: CoCa-loss with tiny language decoder -> align to LLM ### Key Observations - The data size decreases as the training progresses from Text Pre-training to Joint Long-context. - The training process transitions from pure text data to increasingly multimodal data. - The "resumes LR scheduler" icon suggests a learning rate scheduling strategy is employed in the Joint Pre-training and Joint Long-context stages. - The ViT training is a separate process, potentially running concurrently or as a pre-processing step for the multimodal data. - The RoPE base parameter in the Joint Long-context stage indicates a focus on handling long sequences. ### Interpretation This diagram illustrates a multi-stage training pipeline for a large language model (LLM) that incorporates vision capabilities. The pipeline begins with pre-training on a massive corpus of text data, then gradually introduces multimodal data (images, videos, documents) during the Joint Pre-training phase. The Joint Cooldown stage likely fine-tunes the model after the initial multimodal pre-training. Finally, the Joint Long-context stage focuses on extending the model's ability to process long sequences, potentially using techniques like RoPE (Rotary Positional Embedding). The separate ViT training suggests that visual features are extracted using a Vision Transformer and then integrated into the LLM. The decreasing data size across stages could indicate a focus on higher-quality data or more efficient training methods in later stages. The "resumes LR scheduler" icon suggests a dynamic learning rate adjustment strategy to optimize training performance. The overall pipeline aims to create a powerful multimodal LLM capable of understanding and generating both text and visual content. </details> Figure 4: The pre-training stages of Kimi-VL consume a total of 4.4T tokens after text-only pre-training of its language model. To preserve text abilities, all stages that update the language model are joint training stages. Table 1: Overview of training stages: data composition, token volumes, sequence lengths, and trainable components. | Stages Data | ViT Training Alt text Synthesis Caption | Joint Pre-training + Text, Knowledge | Joint Cooldown + High-quality Text | Joint Long-context + Long Text | | --- | --- | --- | --- | --- | | Grounding | Interleaving | High-quality Multimodal | Long Video | | | OCR | Video, Agent | Academic Sources | Long Document | | | Tokens | 2T + 0.1T | 1.4T | 0.6T | 0.3T | | Sequence length | 8192 | 8192 | 8192 | 32768->131072 | | Training | ViT | ViT & LLM | ViT & LLM | ViT & LLM | ViT Training Stages The MoonViT is trained on image-text pairs, where the text components consist of a variety of targets: image alt texts, synthetic captions, grounding bboxes, and OCR texts. The training incorporates two objectives: a SigLIP \parencite zhai2023sigmoidlosslanguageimage loss $\mathcal{L}_{siglip}$ (a variant of contrastive loss) and a cross-entropy loss $\mathcal{L}_{caption}$ for caption generation conditioned on input images. Following CoCa’s approach \parencite yu2022cocacontrastivecaptionersimagetext, the final loss function is formulated as $\mathcal{L}=\mathcal{L}_{siglip}+\lambda\mathcal{L}_{caption}$ , where $\lambda=2$ . Specifically, the image and text encoders compute the contrastive loss, while the text decoder performs next-token prediction (NTP) conditioned on features from the image encoder. To accelerate training, we initialized both encoders with SigLIP SO-400M \parencite zhai2023sigmoidlosslanguageimage weights and implemented a progressive resolution sampling strategy to gradually allow larger size; the text decoder is initialized from a tiny decoder-only language model. During training, we observed an emergence in the caption loss while scaling up OCR data, indicating that the text decoder had developed some OCR capabilities. After training the ViT in the CoCa-alike stage with 2T tokens, we align the MoonViT to the MoE language model using another 0.1T tokens, where only MoonViT and MLP projector are updated. This alignment stage significantly reduces the initial perplexity of MoonViT embeddings in the language model, allowing a smoother joint pre-training stage as follows. Joint Pre-training Stage In the joint pre-training stage, we train the model with a combination of pure text data (sampled from the same distribution as the initial language model) and a variety of multimodal data (as discussed in Sec. 3.1). We continue training from the loaded LLM checkpoint using the same learning rate scheduler, consuming an additional 1.4T tokens. The initial steps utilize solely language data, after which the proportion of multimodal data gradually increases. Through this progressive approach and the previous alignment stage, we observe that joint pre-training preserves the model’s language capabilities while successfully integrating visual comprehension abilities. Joint Cooldown Stage The stage following the pre-training stage is a multimodal cooldown phase, where the model is continue trained with high-quality language and multimodal datasets to ensure superior performance. For the language part, through empirical investigation, we observe that the incorporation of synthetic data during the cooling phase yields significant performance improvements, particularly in mathematical reasoning, knowledge-based tasks, and code generation. The general text components of the cooldown dataset are curated from high-fidelity subsets of the pre-training corpus. For math, knowledge, and code domains, we employ a hybrid approach: utilizing selected pre-training subsets while augmenting them with synthetically generated content. Specifically, we leverage existing mathematical knowledge and code corpora as source material to generate question-answer (QA) pairs through a proprietary language model, implementing rejection sampling techniques to maintain quality standards \parencite yue2023mammoth,su2024nemotron. These synthesized QA pairs undergo comprehensive validation before being integrated into the cooldown dataset. For the multimodal part, in addition to the two strategies as employed in text cooldown data preparation, i.e. question-answer synthesis and high-quality subset replay, to allow more comprehensive visual-centric perception and understanding \parencite li2024llavaonevisioneasyvisualtask,tong2024cambrian1fullyopenvisioncentric,guo2024mammothvlelicitingmultimodalreasoning, we filter and rewrite a variety of academic visual or vision-language data sources to QA pairs. Unlike post-training stages, these language and multimodal QA pairs in the cooldown stage are only included for activating specific abilities and henceforth facilitating learning high-quality data, thus, we keep their ratio at a low portion to avoid overfitting these QA patterns. The joint cooldown stage significantly improves both language and multimodal abilities of the model. Table 2: Needle-in-a-Haystack (NIAH) test on text/video haystacks, where needles are uniformly distributed at various positions within the haystack. We report recall accuracy across different haystack lengths up to 131,072 tokens (128K). | - text haystack - video haystack | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 87.0 91.7 | | --- | --- | --- | --- | --- | --- | --- | --- | Joint Long-context Activation Stage In the final pre-training stage, we extend the context length of the model from 8192 (8K) to 131072 (128K), with the inverse frequency of its RoPE \parencite su2023roformerenhancedtransformerrotary embeddings reset from 50,000 to 800,000. The joint long-context stage is conducted in two sub-stages, where each one extends the model’s context length by four times. For data composition, we filter and upsample the ratio of long data to 25% in each sub-stage, while using the remaining 75% tokens to replay shorter data in its previous stage; our exploration confirms that this composition allows the model to effectively learn long-context understanding while maintaining short-context ability. To allow the model to activate long-context abilities on both pure-text and multimodal inputs, the long data used in Kimi-VL’s long-context activation consists of not only long text, but also long multimodal data, including long interleaved data, long videos, and long documents. Similar as cooldown data, we also synthesize a small portion of QA pairs to augment the learning efficiency of long-context activation. After the long-context activations, the model can pass needle-in-a-haystack (NIAH) evaluations with either long pure-text or long video haystack, proving its versatile long-context ability. We provide the NIAH recall accuracy on various range of context length up to 128K in Table 2. <details> <summary>x5.png Details</summary> ![21242f44](/v1/image/21242f441ef8dd2176ea4fc049adef6abcbeba8dda568a6cc374b89158915514) ### Visual Description \n ## Diagram: Kimi-VL Training Pipeline ### Overview The image depicts a sequential diagram illustrating the training pipeline for Kimi-VL, a multimodal large language model. The pipeline consists of three stages: Joint Supervised Fine-tuning, Long-CoT Supervised Fine-tuning, and Reinforcement Learning (RL). An arrow indicates the flow of training from the first stage to the second, and then to the third. The Kimi-VL logo is present on the right side of the diagram. ### Components/Axes The diagram is composed of three rectangular blocks, each representing a training stage. Each block contains text describing the stage's methodology and data used. There are no axes in this diagram. ### Content Details **Block 1: Joint Supervised Fine-tuning** * **Title:** Joint Supervised Fine-tuning * **Description:** Text + Multimodal SFT Data * **Details:** 1 Epoch@32K + 1 Epoch@128K **Block 2: Long-CoT Supervised Fine-tuning** * **Title:** Long-CoT Supervised Fine-tuning * **Description:** Text + Multimodal Long-CoT Data * **Details:** Planning, Evaluation, Reflection, Exploration **Block 3: Reinforcement Learning (RL)** * **Title:** Reinforcement Learning (RL) * **Description:** Online RL on Answer Only * **Details:** Length penalty Difficulty control **Arrow:** A blue arrow connects the first and second blocks, indicating the sequential flow of training. **Kimi-VL Logo:** The logo "Kimi-VL Thinking" is present on the right side of the diagram, vertically aligned with the three blocks. ### Key Observations The diagram highlights a three-stage training process. The first stage uses supervised fine-tuning with specific epoch configurations. The second stage focuses on Long-Context (CoT) data and incorporates cognitive processes like planning and reflection. The final stage employs reinforcement learning, focusing on answer quality and controlling for length and difficulty. ### Interpretation This diagram illustrates a progressive training strategy for Kimi-VL. It begins with standard supervised learning to establish a baseline, then moves to more complex supervised learning incorporating long-context reasoning, and finally refines the model through reinforcement learning to optimize answer quality. The inclusion of "Planning, Evaluation, Reflection, Exploration" in the second stage suggests an attempt to imbue the model with higher-level cognitive abilities. The final RL stage's focus on "Length penalty Difficulty control" indicates a desire to balance answer conciseness with the complexity of the questions. The sequential nature of the pipeline suggests that each stage builds upon the previous one, progressively improving the model's capabilities. The diagram provides a high-level overview of the training process and does not contain specific quantitative data beyond the epoch numbers. </details> Figure 5: The post-training stages of Kimi-VL and Kimi-VL-Thinking, including two stages of joint SFT in 32K and 128K context, and further long-CoT SFT and RL stages to activate and enhance long thinking abilities. 2.4 Post-Training Stages Joint Supervised Fine-tuning (SFT) In this phase, we fine-tune the base model of Kimi-VL with instruction-based fine-tuning to enhance its ability to follow instructions and engage in dialogue, culminating in the creation of the interactive Kimi-VL model. This is achieved by employing the ChatML format (Openai, 2024), which allows for a targeted instruction optimization while maintaining architectural consistency with Kimi-VL. We optimize the language model, MLP projector, and vision encoder using a mixture of pure-text and vision-language SFT data, which will be described in Sec 3.2. Supervision is applied only to answers and special tokens, with system and user prompts being masked. The model is exposed to a curated set of multimodal instruction-response pairs, where explicit dialogue role tagging, structured injection of visual embeddings, and preservation of cross-modal positional relationships are ensured through the format-aware packing. Additionally, to guarantee the model’s comprehensive proficiency in dialogue, we incorporate a mix of multimodal data and pure text dialogue data used in Moonlight, ensuring its versatility across various dialogue scenarios. We first train the model at the sequence length of 32k tokens for 1 epoch, followed by another epoch at the sequence length of 128k tokens. In the first stage (32K), the learning rate decays from $2× 10^{-5}$ to $2× 10^{-6}$ , before it re-warmups to $1× 10^{-5}$ in the second stage (128K) and finally decays to $1× 10^{-6}$ . To improve training efficiency, we pack multiple training examples into each single training sequence. Long-CoT Supervised Fine-Tuning With the refined RL prompt set, we employ prompt engineering to construct a small yet high-quality long-CoT warmup dataset, containing accurately verified reasoning paths for both text and image inputs. This approach resembles rejection sampling (RS) but focuses on generating long-CoT reasoning paths through prompt engineering. The resulting warmup dataset is designed to encapsulate key cognitive processes that are fundamental to human-like reasoning, such as planning, where the model systematically outlines steps before execution; evaluation, involving critical assessment of intermediate steps; reflection, enabling the model to reconsider and refine its approach; and exploration, encouraging consideration of alternative solutions. By performing a lightweight SFT on this warm-up dataset, we effectively prime the model to internalize these multimodal reasoning strategies. As a result, the fine-tuned long-CoT model demonstrates improved capability in generating more detailed and logically coherent responses, which enhances its performance across diverse reasoning tasks. Reinforcement Learning To further advance the model’s reasoning abilities, we then train the model with reinforcement learning (RL), enabling the model to autonomously generate structured CoT rationales. Specifically, similar as Kimi k1.5 \parencite team2025kimi, we adopt a variant of online policy mirror descent as our RL algorithm, which iteratively refines the policy model $\pi_{\theta}$ to improve its problem-solving accuracy. During the $i$ -th training iteration, we treat the current model as a reference policy model and optimize the following objective, regularized by relative entropy to stabilize policy updates: $$ \displaystyle\max_{\theta}\mathbb{E}_{(x,y^{*})\sim\mathcal{D}}\left[\mathbb{E% }_{(y,z)\sim\pi_{\theta}}\left[r(x,y,y^{*})\right]-\tau\mathrm{KL}(\pi_{\theta% }(x)||\pi_{\theta_{i}}(x))\right]\,, \tag{1} $$ where $r$ is a reward model that justifies the correctness of the proposed answer $y$ for the given problem $x$ , by assigning a value $r(x,y,y^{*})∈\{0,1\}$ based on the ground truth $y^{*}$ , and $\tau>0$ is a parameter controlling the degree of regularization. Each training iteration begins by sampling a problem batch from the dataset $\mathcal{D}$ , and the model parameters are updated to $\theta_{i+1}$ using the policy gradient derived from (1), with the optimized policy model subsequently assuming the role of reference policy for the subsequent iteration. To enhance RL training efficiency, we implement a length-based reward to penalize excessively long responses, mitigating the overthinking problem where the model generates redundant reasoning chains. Besides, we employ two sampling strategies including curriculum sampling and prioritized sampling, which leverage difficulty labels and per-instance success rates to focus training effort on the most pedagogically valuable examples, thereby optimizing the learning trajectory and improving training efficiency. Through large-scale reinforcement learning training, we can derive a model that harnesses the strengths of both basic prompt-based CoT reasoning and sophisticated planning-enhanced CoT approaches. During inference, the model maintains standard autoregressive sequence generation, eliminating the deployment complexities associated with specialized planning algorithms that require parallel computation. Simultaneously, the model develops essential meta-reasoning abilities including error detection, backtracking, and iterative solution refinement by effectively utilizing the complete history of explored reasoning paths as contextual information. With endogenous learning from its complete reasoning trace history, the model can effectively encode planned search procedures into its parametric knowledge. 2.5 Infrastructure Storage We utilize S3 \parencite amazon_s3 compatible object storage from cloud service vendors to store our visual-text data. To minimize the time between data preparation and model training, we store visual data in its original format and have developed an efficient and flexible data loading system. This system provides several key benefits: - Supports on-the-fly data shuffling, mixing, tokenization, loss masking and packing during training, allowing us to adjust data proportions as needed; - Enables random augmentation of both visual and text data, while preserving the correctness of 2D coordinate and orientation information during transformations; - Ensures reproducibility by strictly controlling random states and other states across different data loader workers, guaranteeing that any interrupted training can be resumed seamlessly—the data sequence after resumption remains identical to an uninterrupted run; - Delivers high-performance data loading: through multiple caching strategies, our system reliably supports training on large scale clusters while maintaining controlled request rates and throughput to the object storage. Additionally, to ensure consistent dataset quality control, we developed a centralized platform for data registration, visualization, compiling statistics, synchronizing data across cloud storage systems, and managing dataset lifecycles. <details> <summary>x6.png Details</summary> ![0c9ce97a](/v1/image/0c9ce97aba2070f177148529283faa53b5eecb211cdb8850b9f622637c1ada35) ### Visual Description \n ## Handwritten Mathematical Manuscripts: Analysis of Einstein's Relativity Notes ### Overview The images depict two pages of handwritten mathematical notes, seemingly belonging to Albert Einstein. The notes are densely filled with equations and calculations related to gravitational fields and celestial mechanics, likely pertaining to the theory of general relativity. The handwriting appears consistent across both pages, suggesting they are part of the same document or a closely related series. ### Components/Axes There are no explicit axes or legends in the traditional sense of a chart or graph. The "components" are the individual equations, symbols, and annotations scattered across the pages. The structure is primarily hierarchical, with main equations branching into sub-calculations and explanatory notes. ### Detailed Analysis or Content Details **Image 1 (Left):** This page contains a complex series of equations. The dominant theme appears to be the manipulation of partial derivatives and summations, typical in tensor calculus and field theory. * **Top Section:** A large equation involving partial derivatives (∂) and summations (Σ). The equation includes terms like "gμν" (likely the metric tensor), "Γμνλ" (Christoffel symbols), and "xμ". * **Central Section:** Further equations building upon the initial one, with continued use of indices and summations. There are references to "ds2" (the spacetime interval). * **Bottom Section:** More equations and annotations, including what appears to be a derivation or simplification of the previous expressions. **Image 2 (Right):** This page focuses on numerical calculations and references to constants. * **Top Section:** Equations involving constants like "G" (likely the gravitational constant, approximately 6.674 x 10-11 m3 kg-1 s-2), "M" (mass), and "T" (time). * **Central Section:** A series of calculations, potentially related to the bending of light or the precession of planetary orbits. * **Bottom Section:** Further calculations and annotations, including what appears to be a comparison of theoretical results with observational data. **German Text Transcription & Translation:** * **Image 1:** The phrase "Einheitsvektor" is visible. * **Translation:** "Unit vector" * **Image 2:** The phrase "Gleichung" is visible. * **Translation:** "Equation" **Specific Equations (Approximate Transcription - due to handwriting):** Due to the complexity and handwriting, precise transcription is difficult. However, some key elements can be identified: * **Image 1:** Equations involving derivatives of the metric tensor and Christoffel symbols. Expressions resembling the geodesic equation are present. * **Image 2:** Equations relating gravitational potential to mass and distance. Calculations involving the gravitational constant (G). ### Key Observations * **Handwriting Consistency:** The handwriting is remarkably consistent across both pages, strongly suggesting a single author. * **Mathematical Sophistication:** The equations are highly advanced, indicating a deep understanding of differential geometry and tensor calculus. * **Focus on General Relativity:** The presence of the metric tensor, Christoffel symbols, and spacetime interval strongly suggests that the notes are related to Einstein's theory of general relativity. * **German Annotations:** The inclusion of German terms ("Einheitsvektor", "Gleichung") suggests that Einstein was either writing these notes in German or was making annotations in his native language. * **Numerical Calculations:** The second page's emphasis on numerical calculations suggests an attempt to apply the theoretical framework to specific physical problems. ### Interpretation The manuscripts are almost certainly original notes from Albert Einstein, likely related to his work on general relativity. The first page appears to be a theoretical derivation of equations governing gravitational fields, while the second page represents an attempt to apply these equations to concrete calculations. The presence of the gravitational constant and references to mass and time suggest that Einstein was exploring the relationship between gravity, spacetime, and matter. The combination of theoretical derivations and numerical calculations indicates a holistic approach to scientific inquiry. Einstein was not only developing the mathematical framework of general relativity but also actively testing its predictions against observational data. The German annotations suggest that Einstein was comfortable working in both German and potentially other languages, and that he may have been using German as a tool for clarifying his thoughts or making personal notes. The overall impression is that these manuscripts provide a rare glimpse into the mind of one of the greatest physicists of all time, revealing his thought process and his relentless pursuit of understanding the universe. The notes are not merely a collection of equations but a testament to Einstein's intellectual curiosity and his unwavering commitment to scientific discovery. **Important Considerations:** * **Handwriting Difficulty:** The handwriting is challenging to decipher, and some transcriptions may be inaccurate. * **Contextual Knowledge:** A deeper understanding of general relativity and tensor calculus would be necessary to fully interpret the equations. * **Image Quality:** The image quality limits the ability to discern fine details. </details> Figure 6: Manuscript reasoning visualization. Kimi-VL-Thinking demonstrates the ability to perform historical and scientific inference by analyzing handwritten manuscripts step by step. In this example, our model identifies the author as Albert Einstein based on handwriting style, content analysis, and language cues. It reasons that the manuscripts relate to gravitational field equations, consistent with Einstein’s contributions to general relativity. Parallelism We adopt a 4D parallelism strategy—Data Parallelism \parencite li2020pytorchdistributedexperiencesaccelerating, Expert Parallelism \parencite fedus2022switchtransformersscalingtrillion, Pipeline Parallelism \parencite huang2019gpipeefficienttraininggiant,narayanan2021efficientlargescalelanguagemodel, and Context Parallelism \parencite jacobs2023deepspeedulyssesoptimizationsenabling,liu2023ringattentionblockwisetransformers—to accelerate the speed of Kimi-VL . After optimizing parallel strategies, the resulting training throughput of our model is around 60% higher than a 7B dense VLM (e.g. VLMs based on Qwen2.5-7B). - Data Parallelism (DP). DP replicates the model across multiple devices, each processing different micro-batches. This setup allows larger effective batch sizes by simply increasing the number of devices. - Expert Parallelism (EP). EP distributes expert modules in the MoE layer across multiple devices. When combined with DP, experts on a given device can handle tokens from different DP groups, enhancing computational efficiency. - Pipeline Parallelism (PP). PP splits the model into multiple layer-based stages. To minimize pipeline bubbles, we allocate the Vision Tower (VT) and several decoder layers to the first stage, place the output layer and additional decoder layers in the last stage, and distribute the remaining decoder layers evenly across intermediate stages based on their time overhead. - Context Parallelism (CP). CP addresses long-sequence training by splitting sequences across different CP ranks in conjunction with flash attention \parencite dao2022flashattentionfastmemoryefficientexact. This substantially reduces peak memory usage and relieves the memory pressure from attention computations. Beyond these four parallel strategies, we incorporate ZeRO1 \parencite rajbhandari2020zero and Selective Checkpointing Activation \parencite chen2016trainingdeepnetssublinear, korthikanti2022reducingactivationrecomputationlarge to further optimize memory usage. ZeRO1 reduces optimizer state overhead by using a distributed optimizer while avoiding extra communication costs. Selective Checkpointing Activation trades time for space by recomputing only those layers that have low time overhead but high memory consumption, striking a balance between computation efficiency and memory demands. For extremely long sequences, we expand recomputation to a broader set of layers to prevent out-of-memory errors. 3 Data Construction 3.1 Pre-Training Data Our multimodal pre-training corpus is designed to provide high-quality data that enables models to process and understand information from multiple modalities, including text, images, and videos. To this end, we have also curated high-quality data from six categories – caption, interleaving, OCR, knowledge, video, and agent – to form the corpus. When constructing our training corpus, we developed several multimodal data processing pipelines to ensure data quality, encompassing filtering, synthesis, and deduplication. Establishing an effective multimodal data strategy is crucial during the joint training of vision and language, as it both preserves the capabilities of the language model and facilitates alignment of knowledge across diverse modalities. We provide a detailed description of these sources in this section, which is organized into the following categories: Caption Data Our caption data provides the model with fundamental modality alignment and a broad range of world knowledge. By incorporating caption data, the multimodal LLM gains wider world knowledge with high learning efficiency. We have integrated various open-source Chinese and English caption datasets like \parencite schuhmann2022laion, gadre2024datacomp and also collected substantial in-house caption data from multiple sources. However, throughout the training process, we strictly limit the proportion of synthetic caption data to mitigate the risk of hallucination stemming from insufficient real-world knowledge. For general caption data, we follow a rigorous quality control pipeline that avoids duplication and maintain high image-text correlation. We also vary image resolution during pre-training to ensure that the vision tower remains effective when processing images of both high- and low-resolution. Image-text Interleaving Data During the pre-training phase, the model benefits from interleaving data for many aspects. For example, multi-image comprehension ability can be boosted by interleaving data; interleaving data always provides detailed knowledge for the given image; a longer multimodal context learning ability can also be gained by interleaving data. What’s more, we also find that interleaving data can contribute positively to maintaining the model’s language abilities. Thus, image-text interleaving data is an important part in our training corpus. Our multimodal corpus considered open-sourced interleave datasets like \parencite zhu2024multimodal,laurenccon2024obelics and also constructed large-scale in-house data using resources like textbooks, webpages, and tutorials. Further, we also find that synthesizing the interleaving data benefits the performance of multimodal LLM for keeping the text knowledge. To ensure each image’s knowledge is sufficiently studied, for all the interleaving data, despite standard filtering, deduping, and other quality control pipeline, we also integrate a data reordering procedure to keep all the image and text in the correct order. OCR Data Optical Character Recognition (OCR) is a widely adopted technique that converts text from images into an editable format. In our model, a robust OCR capability is deemed essential for better aligning the model with human values. Accordingly, our OCR data sources are diverse, ranging from open-source to in-house datasets, encompassing both clean and augmented images, and spanning over single-page and multi-page inputs. In addition to the publicly available data, we have developed a substantial volume of in-house OCR datasets, covering multilingual text, dense text layouts, web-based content, and handwritten samples. Furthermore, following the principles outlined in OCR 2.0 \parencite wei2024general, our model is also equipped to handle a variety of optical image types, including figures, tables, geometry diagrams, mermaid plots, and natural scene text. We apply extensive data augmentation techniques—such as rotation, distortion, color adjustments, and noise addition—to enhance the model’s robustness. As a result, our model achieves a high level of proficiency in OCR tasks. In addition to single-page OCR data, we collect and convert a large volume of in-house multi-page OCR data to activate the model’s understanding of long documents in the real world. With the help of these data, our model is capable of performing accurate OCR on a single image but can also comprehend an entire academic paper or a scanned book. Knowledge Data The concept of multimodal knowledge data is analogous to the previously mentioned text pre-training data, except here we focus on assembling a comprehensive repository of human knowledge from diverse sources to further enhance the model’s capabilities. For example, carefully curated geometry data in our dataset is vital for developing visual reasoning skills, ensuring the model can interpret the abstract diagrams created by humans. Our knowledge corpus adheres to a standardized taxonomy to balance content across various categories, ensuring diversity in data sources. Similar to text-only corpora, which gather knowledge from textbooks, research papers, and other academic materials, multimodal knowledge data employs both a layout parser and an OCR model to process content from these sources. While we also include filtered data from internet-based and other external resources. Because a significant portion of our knowledge corpus is sourced from internet-based materials, infographics can cause the model to focus solely on OCR-based information. In such cases, relying exclusively on a basic OCR pipeline may limit training effectiveness. To address this, we have developed an additional pipeline that better captures the purely textual information embedded within images. Agent Data For agent tasks, the model’s grounding and planning capabilities have been significantly enhanced. In addition to utilizing publicly available data, a platform has been established to efficiently manage and execute virtual machine environments in bulk. Within these virtual environments, heuristic methods were employed to collect screenshots and corresponding action data. This data was then processed into dense grounding formats and continuous trajectory formats. The design of the Action Space was categorized according to Desktop, Mobile, and Web environments. Furthermore, icon data was collected to strengthen the model’s understanding of the meanings of icons within software graphical user interfaces (GUIs). To enhance the model’s planning ability for solving multi-step desktop tasks, a set of computer-use trajectories was collected from human annotators, each accompanied by synthesized Chain-of-Thought (Aguvis \parencite xu2024aguvis). These multi-step agent demonstrations equip Kimi-VL with the capability to complete real-world desktop tasks (on both Ubuntu and Windows). Video Data In addition to image-only and image-text interleaved data, we also incorporate large-scale video data during pre-training, cooldown, and long-context activation stages to enable two directions of essential abilities of our model: first, to understand a long-context sequence dominated by images (e.g. hour-long videos) in addition to long text; second, to perceive fine-grained spatio-temporal correspondence in short video clips. Our video data are sourced from diverse resources, including open-source datasets as well as in-house web-scale video data, and span videos of varying durations. Similarly, to ensure sufficient generalization ability, our video data cover a wide range of scenes and diverse tasks. We cover tasks such as video description and video grounding, among others. For long videos, we carefully design a pipeline to produce dense captions. Similar to processing the caption data, we strictly limit the proportion of the synthetic dense video description data to reduce the risk of hallucinations. Text Data Our text pretrain corpus directly utilizes the data in Moonlight [liu2025muonscalablellmtraining], which is designed to provide comprehensive and high-quality data for training large language models (LLMs). It encompasses five domains: English, Chinese, Code, Mathematics & Reasoning, and Knowledge. We employ sophisticated filtering and quality control mechanisms for each domain to ensure the highest quality training data. For all pretrain data, we conducted rigorous individual validation for each data source to assess its specific contribution to the overall training recipe. This systematic evaluation ensures the quality and effectiveness of our diverse data composition. To optimize the overall composition of our training corpus, the sampling strategy for different document types is empirically determined through extensive experimentation. We conduct isolated evaluations to identify document subsets that contribute most significantly to the model’s knowledge acquisition capabilities. These high-value subsets are upsampled in the final training corpus. However, to maintain data diversity and ensure model generalization, we carefully preserve a balanced representation of other document types at appropriate ratios. This data-driven approach helps us optimize the trade-off between focused knowledge acquisition and broad generalization capabilities. footnotetext: GPT-4o and GPT-4o-mini results use Omniparser without UIA, according to [bonatti2024windowsagentarenaevaluating]. | | Benchmark (Metric) | GPT-4o | GPT- | Qwen2.5- | Llama3.2- | Gemma3- | DeepSeek- | Kimi-VL- | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 4o-mini | VL-7B | 11B-Inst. | 12B-IT | VL2 | A3B | | | | | Architecture | - | - | Dense | Dense | Dense | MoE | MoE | | | # Act. Params ${}_{\text{(LLM+VT)}}$ | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B | | | # Total Params | - | - | 8B | 11B | 12B | 28B | 16B | | | College-level | MMMU ${}_{\text{val}}$ (Pass@1) | 69.1 | 60.0 | 58.6 | 48 | 59.6 | 51.1 | 57.0 | | VideoMMMU (Pass@1) | 61.2 | - | 47.4 | 41.8 | 57.2 | 44.4 | 52.6 | | | MMVU ${}_{\text{val}}$ (Pass@1) | 67.4 | 61.6 | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 | | | General | MMBench-EN-v1.1 (Acc) | 83.1 | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | 83.1 | | MMStar (Acc) | 64.7 | 54.8 | 63.9 | 49.8 | 56.1 | 55.5 | 61.3 | | | MMVet (Pass@1) | 69.1 | 66.9 | 67.1 | 57.6 | 64.9 | 60.0 | 66.7 | | | RealWorldQA (Acc) | 75.4 | 67.1 | 68.5 | 63.3 | 59.1 | 68.4 | 68.1 | | | AI2D (Acc) | 84.6 | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | 84.9 | | | Multi-image | BLINK (Acc) | 68.0 | 53.6 | 56.4 | 39.8 | 50.3 | - | 57.3 | | Math | MathVista (Pass@1) | 63.8 | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | 68.7 | | MathVision (Pass@1) | 30.4 | - | 25.1 | 13.6 | 32.1 | 17.3 | 21.4 | | | OCR | InfoVQA (Acc) | 80.7 | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | 83.2 | | OCRBench (Acc) | 815 | 785 | 864 | 753 | 702 | 811 | 867 | | | OS Agent | ScreenSpot-V2 (Acc) | 18.1 | - | 86.8 | - | - | - | 92.8 | | ScreenSpot-Pro (Acc) | 0.8 | - | 29.0 | - | - | - | 34.5 | | | OSWorld (Pass@1) | 5.03 | - | 2.5 | - | - | - | 8.22 | | | WindowsAgentArena (Pass@1) footnotemark: | 9.4 | 2.7 | 3.4 | - | - | - | 10.4 | | | Long Document | MMLongBench-Doc (Acc) | 42.8 | 29.0 | 29.6 | 13.8 | 21.3 | - | 35.1 | | Long Video | Video-MME (w/o sub. / w/ sub.) | 71.9/77.2 | 64.8/68.9 | 65.1/71.6 | 46.0/49.5 | 58.2/62.1 | - | 67.8/72.6 | | MLVU ${}_{\text{MCQ}}$ (Acc) | 64.6 | 48.1 | 70.2 | 44.4 | 52.3 | - | 74.2 | | | LongVideoBench ${}_{\text{val}}$ | 66.7 | 58.2 | 56.0 | 45.5 | 51.5 | - | 64.5 | | | Video Perception | EgoSchema ${}_{\text{full}}$ | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | 78.5 | | VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | 37.4 | | | TOMATO | 37.7 | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | 31.7 | | Table 3: Performance of Kimi-VL against proprietary and open-source efficient VLMs; performance of GPT-4o is also listed in gray for reference. Top and second-best models are in boldface and underline respectively. Some results of competing models are unavailable due to limitation of model ability on specific tasks or model context length. 3.2 Instruction Data At this stage, the data is primarily aimed at enhancing the model’s conversational abilities and instruction-following capabilities. To cover as many scenarios as possible, we enrich the data across different domains. For non-reasoning tasks, including chart interpretation, agent grounding, OCR, image-grounded conversations, question-answering, writing, and text processing, we initially construct a seed dataset through human annotation. This seed dataset is used to train a seed model. Subsequently, we collect a diverse set of prompts and employ the seed model to generate multiple responses to each prompt. Annotators then rank these responses and refine the top-ranked response to produce the final version. For reasoning tasks like visual coding, visual reasoning, and math/science problems, where rule-based and model-based verifications are more accurate and efficient than human judgment, we utilize rejection sampling to expand the SFT dataset. The complete vanilla SFT dataset comprises approximately a 1:1 ratio of text tokens to image tokens. <details> <summary>x7.png Details</summary> ![560f1aae](/v1/image/560f1aae8a50166f34d7cda7d619c2e056cc00cc96ba3bb90f45cba6328d9f85) ### Visual Description \n ## Image 1: Urban Area Comparison ### Overview The image presents a 2x2 grid of aerial photographs of urban areas. The task is to identify which of the four sub-images matches the same location as the first image. ### Components/Axes There are no axes or legends in this image. It consists solely of four aerial photographs. ### Detailed Analysis or Content Details * **Image 1 (Top-Left):** Depicts a dense urban area with a mix of buildings, green spaces, and a distinctive circular structure (possibly a dome or observatory). * **Image 2 (Top-Right):** Shows a suburban area with larger plots and fewer buildings, unlike the compact urban setting of Image 1. * **Image 3 (Bottom-Left):** Features a central courtyard and a circular structure resembling the one in Image 1. * **Image 4 (Bottom-Right):** Displays a similarly dense urban environment with a large building complex, featuring a central courtyard and a circular structure. ### Key Observations The primary task is to visually compare the images and identify similarities in urban density, building types, and the presence of the circular structure. ### Interpretation Based on the visual comparison, Image 4 is the most likely candidate to be the same location as Image 1. This is because it matches the urban density, building types, and the circular structure seen in Image 1. **Answer:** The fourth sub-picture (Image 4) is in the same place as the first picture. --- ## Image 2: Dome Building Identification ### Overview The image shows a photograph of a large dome-shaped building in an urban setting. The task is to identify the building. ### Components/Axes There are no axes or legends in this image. It consists solely of a photograph. ### Detailed Analysis or Content Details The image features a large, modern dome-shaped building with a retractable roof. The building is situated in a city skyline, with the CN Tower visible in the background. ### Key Observations The distinctive retractable roof and the presence of the CN Tower are key identifying features. ### Interpretation The dome building in the image is the Rogers Centre, a multi-purpose stadium in Toronto, Canada. It is recognizable by its distinctive retractable roof and is a landmark in the city's skyline, often visible alongside the CN Tower. The Rogers Centre hosts various events, including sports games, concerts, and conventions. --- ## Image 3: Cyberpunk 2077 Location ### Overview The image shows a screenshot from a video game environment. The task is to identify the game and the location within the game. ### Components/Axes There are no axes or legends in this image. It consists solely of a screenshot. ### Detailed Analysis or Content Details The screenshot depicts a futuristic bar or club environment with neon lights, holographic displays, and characters with cybernetic enhancements. A HUD element is visible, displaying text ("Sit next to Jackie?"). ### Key Observations The visual style, including the neon lights, cybernetic enhancements, and futuristic setting, are characteristic of the Cyberpunk genre. ### Interpretation You are in *Cyberpunk 2077*, a open-world action role-playing game set in Night City (year 2077). The image shows a futuristic bar or club within the game's cyberpunk-themed environment. Likely a mission or social interaction location (e.g., "Sit next to Jackie?" is visible in the HUD). The setting features neon lights, holographic displays, and characters with cybernetic enhancements, typical of the game's aesthetic. </details> Figure 7: Kimi-VL exhibits strong visual reasoning capabilities by grounding visual content in spatial, contextual, and cultural knowledge. It accurately identifies matching urban locations based on structural and layout features, interprets scenes from video games like Cyberpunk 2077 using stylistic cues, and recognizes real-world landmarks such as the Rogers Centre in Toronto. 3.3 Reasoning Data Our reasoning data is meticulously constructed for activation and enhancement of the model’s multimodal reasoning capabilities during both the long-CoT supervised fine-tuning and reinforcement learning stages. Through developing a generation pipeline that resembles rejection sampling (RS) and prompt engineering, we collect and synthesize an amount of high-quality long-CoT data. Specifically, we first assemble a collection of QA data with ground truth annotations that require multi-step reasoning, such as mathematical problem-solving and domain-specific VQA. Subsequently, we sample multiple detailed reasoning trajectories for each question by leveraging a powerful long-CoT model - Kimi k1.5 \parencite team2025kimi with curated reasoning prompts. In rejection sampling, we feed the true labels and model predictions into an off-the-shelf reward model for judgment. Wrong chain-of-thought responses are filtered out according to the model evaluation as well as some rule-based rewards, thus improving the reasoning data quality. 4 Evaluation We begin by presenting our comprehensive model and conducting a comparative analysis with leading state-of-the-art (SoTA) solutions. Following this introduction, we proceed to assess various sub-capabilities of the model through detailed performance evaluations. This part examines how effectively the model handles different tasks and scenarios, providing insights into its strengths and limitations across diverse functional domains. 4.1 Comparison to the State-of-the-Art Models Table 3 presents a comprehensive evaluation of Kimi-VL against state-of-the-art vision-language models across multiple benchmarks. Although having a more parameter-efficient architecture (2.8B+0.4B activated parameters) compared to larger models such as GPT-4o, Llama-3.2-11B-Inst. and Gemma3-12B-IT, Kimi-VL demonstrates competitive or superior performance in several key areas. Our model employs a Mixture-of-Experts (MoE) architecture similar to DeepSeek-VL2, but outperforms it on most benchmarks with significantly fewer parameters (activated: 2.8B vs 4.5B; total: 16B vs 28B); it also outperforms Qwen2.5-VL-7B (actually 8.3B) on 19 out of 24 benchmarks, though the latter has 2.59 $×$ more activated parameters. The following sections analyze performance across specific domains, which reveals Kimi-VL ’s strengths in OCR, math, agent, long-form content understanding, multi-image and video perception. <details> <summary>x8.png Details</summary> ![b0a12016](/v1/image/b0a12016495dbb5359a316f0177d74afad9fa4af7fb0886f0ce7c0fca559d09e) ### Visual Description \n ## Geometry Problem: Angle Calculation ### Overview The image presents a geometry problem involving a circle with center O, a diameter AB, a point C on the circle, and a point D. The problem states that angle ∠D = 62° and asks for the measure of angle ∠ACO. The image includes a diagram illustrating the described geometric setup and a step-by-step solution in Chinese. ### Components/Axes The image contains: * A circle with center labeled 'O'. * A diameter AB. * A point C on the circumference of the circle. * A point D on the circumference of the circle. * Angle ∠D labeled as 62°. * Angle ∠ACO, which is the target to be calculated. * Text in Chinese explaining the solution steps. * Multiple choice answers: A. 26°, B. 28°, C. 30°, D. 32° ### Content Details The solution steps, transcribed from the image (Chinese with English translation): 1. **分析与条件:** (Analysis and Conditions) * AB是直径，所以∠ACB = 90° (圆周角定理)。 (AB is a diameter, therefore ∠ACB = 90° (property of inscribed angles)). * ∠D = 62°. 2. **利用圆周角定理:** (Using the Inscribed Angle Theorem) * ∠AOC是圆心角，等于2倍的圆周角∠D。 (∠AOC is a central angle, equal to twice the inscribed angle ∠D). * ∠AOC = 2 × 62° = 124°. 3. **计算∠ACO:** (Calculating ∠ACO) * ∠ACO是三角形AOC的内角。 (∠ACO is an interior angle of triangle AOC). * ∠AOC + ∠CAO + ∠ACO = 180°. * 因为OA=OC (半径相等), 所以∠CAO = ∠ACO。 (Because OA=OC (radii are equal), therefore ∠CAO = ∠ACO). * 设∠ACO = x, 则: (Let ∠ACO = x, then:) * x + x + 124° = 180° * 2x + 124° = 180° * 2x = 56° * x = 28° 因此，∠ACO的度数为28°。答案是B。 (Therefore, the measure of ∠ACO is 28°. The answer is B.) ### Key Observations * The problem leverages fundamental geometric principles: the property of angles inscribed in a semicircle (90°), the relationship between central and inscribed angles, and the properties of isosceles triangles (equal sides imply equal angles). * The solution is presented in a clear, step-by-step manner. * The final answer is 28°, corresponding to option B. ### Interpretation The problem demonstrates a typical application of geometric theorems to solve for unknown angles within a circle. The solution methodically applies the given information and established rules to deduce the value of ∠ACO. The use of the inscribed angle theorem and the properties of isosceles triangles are crucial to the solution. The problem is well-structured and provides a clear path to the correct answer. The inclusion of multiple-choice options suggests this is likely a practice question or part of an assessment. The problem is designed to test understanding of basic geometric concepts and problem-solving skills. The diagram is a visual aid to help understand the relationships between the angles and segments. The Chinese text provides a complete and detailed explanation of the solution process. </details> Figure 8: Kimi-VL demonstrates its capability to perform symbolic reasoning and geometric inference by solving a circle geometry problem step by step. The model analyzes given conditions, applies geometric theorems such as the inscribed angle theorem and properties of triangle angles, and accurately derives the target angle. <details> <summary>x9.png Details</summary> ![e9fe1c67](/v1/image/e9fe1c67b668863f46cc1a9e1cc1b4eb9845423c2692f7ca462a7ac349d9e590) ### Visual Description \n ## Data Table: Sparkling Smile Clinic Data Analysis ### Overview This is a data table presenting statistical information related to a "Sparkling Smile Clinic". The table appears to categorize data by various treatment types and provides numerical values for different metrics. ### Components/Axes * **Rows:** Represent different treatment types: "Total of Patients", "Initial Consultation", "Orthodontic Examination", "Scaling and Root Planing", "Teeth Whitening", "Veneers", "Implants", "Root Canal Treatment", "Crowns", "Bridges", "Dentures (Complete)", "Dentures (Partial)", "Other Procedures". * **Columns:** Represent years: 2018, 2019, 2020, 2021, 2022, 2023. * **Values:** Numerical data representing the count of procedures performed for each treatment type in each year. The values are presented as integers. ### Detailed Analysis / Content Details Here's a reconstruction of the data table. Note that some values are difficult to read with certainty due to image quality. I will indicate uncertainty with "≈". | Treatment Type | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | | -------------------------- | ---- | ---- | ---- | ---- | ---- | ---- | | Total of Patients | 265 | 280 | 235 | 250 | 275 | 300 | | Initial Consultation | 85 | 90 | 70 | 75 | 80 | 95 | | Orthodontic Examination | 45 | 50 | 40 | 42 | 45 | 50 | | Scaling and Root Planing | 60 | 65 | 50 | 55 | 60 | 70 | | Teeth Whitening | 30 | 35 | 25 | 30 | 35 | 40 | | Veneers | 15 | 20 | 10 | 12 | 15 | 20 | | Implants | 10 | 12 | 8 | 10 | 12 | 15 | | Root Canal Treatment | 25 | 30 | 20 | 22 | 25 | 30 | | Crowns | 40 | 45 | 35 | 38 | 40 | 45 | | Bridges | 10 | 12 | 8 | 10 | 12 | 15 | | Dentures (Complete) | 5 | 6 | 4 | 5 | 6 | 7 | | Dentures (Partial) | 5 | 6 | 4 | 5 | 6 | 7 | | Other Procedures | 20 | 22 | 18 | 20 | 22 | 25 | ### Key Observations * The "Total of Patients" generally increased from 2018 to 2023, with a dip in 2020. * Most treatment types show a similar trend of increase from 2018-2019, a decrease in 2020, and then a recovery and increase in 2021-2023. * "Initial Consultation" and "Scaling and Root Planing" consistently have higher numbers compared to other procedures. * "Dentures (Complete)" and "Dentures (Partial)" have the lowest numbers across all years. --- ## Mathematical Formula ### Overview This section presents a mathematical formula, likely related to statistical analysis or modeling. ### Components/Axes The formula is a single equation with several variables and mathematical operations. ### Detailed Analysis / Content Details The formula is: ``` x_{i+1} = (1 - α)x_i + α[μ_i + √(1 - α)σ_i] = √(1 - α) * [x_i + σ_i * (1 - α) / α] = √(1 - α) * [x_i + (1 - α) / α * σ_i] = √(1 - α) * [x_i + (1 - α) / α * σ_i] = √(1 - α) * [x_i + (1 - α) / α * σ_i] ``` Where: * `x_{i+1}` represents a value at time step i+1. * `x_i` represents a value at time step i. * `α` (alpha) is a parameter. * `μ_i` (mu_i) represents a mean value at time step i. * `σ_i` (sigma_i) represents a standard deviation at time step i. ### Key Observations The formula appears to be a recursive equation, updating a value `x` based on its previous value, a mean, and a standard deviation. The `√(1 - α)` term acts as a scaling factor. --- ## Chinese Text Block ### Overview This section contains a block of Chinese text accompanied by a small image of a person. ### Components/Axes * **Text:** A paragraph of Chinese characters. * **Image:** A small portrait of a person. ### Detailed Analysis / Content Details The Chinese text is: ``` 请您耐心听我把事情说清楚，我是一个小小的创业者，我没有很多资金，我也没有很多资源，我只是想通过自己的努力，为社会创造一点价值。我希望大家能够理解我，支持我，给我一个机会，让我能够证明自己。我是一个有梦想的人，我希望我的梦想能够实现。我希望我的事业能够成功。我希望我的家人能够幸福。我希望我的朋友能够快乐。我希望我的国家能够繁荣昌盛。我希望我的世界能够和平安宁。谢谢大家。 (鼓掌) ``` **English Translation:** "Please be patient and let me explain things clearly. I am a small entrepreneur. I don't have much capital, and I don't have many resources. I just want to create some value for society through my own efforts. I hope everyone can understand and support me, and give me a chance to prove myself. I am a person with dreams, and I hope my dreams can come true. I hope my career will be successful. I hope my family will be happy. I hope my friends will be joyful. I hope my country will prosper. I hope my world will be peaceful and serene. Thank you everyone. (Applause)" ### Key Observations The text is a heartfelt plea from an entrepreneur seeking support and understanding. It expresses a desire to contribute to society and achieve personal and collective well-being. ### Interpretation The combination of the data table, the formula, and the Chinese text suggests a narrative. The data table likely represents the performance of the "Sparkling Smile Clinic" over time, potentially used to justify investment or demonstrate growth. The formula could be a model used to predict future performance or optimize resource allocation. The Chinese text adds a human element, representing the passion and drive of the clinic's founder or a key stakeholder. The entrepreneur's plea for support resonates with the need to demonstrate the clinic's value and potential, as reflected in the data and modeled by the formula. The overall message is one of ambition, perseverance, and a desire to make a positive impact. </details> Figure 9: Diverse OCR visualization. Kimi-VL demonstrates strong OCR capabilities across varied content types, including structured financial tables, complex mathematical formulas, and handwritten Chinese text. The model accurately parses tabular data into markdown, converts formulas to LaTeX, and transcribes handwritten paragraphs with contextual understanding, showcasing its versatility in multimodal text extraction and interpretation. <details> <summary>x10.png Details</summary> ![2f082f3c](/v1/image/2f082f3cce00b38cb07f20436c000ca80cd2e78fddb921347a7c0f854f0b9b91) ### Visual Description \n ## Tutorial: Enabling "Do Not Track" in Chrome ### Overview The image presents a step-by-step tutorial on how to enable the "Do Not Track" feature in the Google Chrome web browser. It consists of nine screenshots of the Chrome browser interface, each accompanied by textual descriptions of the actions and thought processes involved in navigating the settings. The screenshots are arranged in a vertical sequence, numbered from Step 1 to Step 9. The tutorial also includes "Toolcall" data, presumably representing automated interaction logs. ### Components/Axes The image is structured around the following components: * **Step Number:** Numerical identifier for each step (1-9). Located in the top-left corner of each screenshot. * **Screenshot:** A visual representation of the Chrome browser interface at each step. * **Thought:** A textual description of the user's reasoning and observations during each step. * **Action:** A concise description of the user's action performed in each step. * **Toolcall:** Data representing automated interaction logs, including click coordinates (x, y) and confidence scores. ### Detailed Analysis or Content Details **Step 1:** * **Thought:** "To enable the 'Do Not Track' feature in Chrome, I need to access Chrome's settings. Looking at the interface, I can see the vertical ellipsis menu button in the top-right corner of the browser window. This menu button is the standard way to access Chrome's settings and options." * **Action:** "Click on the vertical ellipsis menu button in the top-right corner of the Chrome window to open the browser menu." * **Toolcall:** `click(x=0.884, y=0.144)` **Step 2:** * **Thought:** "The next logical step would be to access Chrome's settings, where privacy options like 'Do Not Track' are typically located. Looking at the menu that's currently open, I can see the 'Settings' option at the bottom of the menu. This would be the most direct way to access the settings page." * **Action:** "Click on the 'Settings' option in the Chrome menu to access the settings page." * **Toolcall:** `click(x=0.744, y=0.831)` **Step 3:** * **Thought:** "Now, I need to find where the 'Do Not Track' setting is located. Based on my knowledge of browser privacy settings, it's typically found in the Privacy and security section. I can see that sidebar menu on the left side of the settings page." * **Action:** "Click on the 'Privacy and security' option in the sidebar menu to access the privacy settings." * **Toolcall:** `click(x=0.215, y=0.669)` **Step 4:** * **Thought:** "Within the 'Privacy and security' settings, I need to locate the 'Do Not Track' setting. I can see a section labeled 'Privacy and security' with several options. I'll look for an option specifically related to 'Do Not Track'." * **Action:** "Scroll down the 'Privacy and security' page to find the 'Do Not Track' setting." * **Toolcall:** `scroll(y=0.651)` **Step 5:** * **Thought:** "I've found the 'Do Not Track' setting. It's currently set to 'Off'. To enable it, I need to toggle the switch next to it. The switch is currently in the 'Off' position, indicated by its gray color." * **Action:** "Click on the toggle switch next to the 'Do Not Track' setting to enable it." * **Toolcall:** `click(x=0.647, y=0.522)` **Step 6:** * **Thought:** "I notice that we're currently on the 'Privacy and security' settings page, which is where the 'Do Not Track' setting would be located. To confirm that 'Do Not Track' is enabled, I need to navigate to the appropriate settings section. I can see the 'Privacy and security' option in the left sidebar menu to expand its submenu." * **Action:** "Click the 'Privacy and security' option in the left sidebar menu to expand its submenu." * **Toolcall:** `click(x=0.365, y=0.317)` **Step 7:** * **Thought:** "I can see that 'Security' is one of the options in the Privacy and security section. This makes sense as the 'Do Not Track' is a security feature that helps protect user privacy by preventing websites from tracking browsing activity." * **Action:** "Click on the 'Security' option in the Privacy and security section to access more security settings." * **Toolcall:** `click(x=0.522, y=0.884)` **Step 8:** * **Thought:** "Since the current page doesn't show the 'Do Not Track' setting, the logical next action would be to navigate back to the 'Privacy and security' page. I can see the browser allows the user to go back to the previous through all settings." * **Action:** "Click on the back button in the browser to return to the 'Privacy and security' settings." * **Toolcall:** `back()` **Step 9:** * **Thought:** "I'm now back on the 'Privacy and security' settings page. I can confirm that the 'Do Not Track' setting is now enabled. The toggle switch next to it is now blue, indicating that it's in the 'On' position." * **Action:** "Confirm that the 'Do Not Track' setting is enabled." * **Toolcall:** `check(x=0.647, y=0.522)` ### Key Observations * The tutorial demonstrates a clear and logical sequence of steps to enable the "Do Not Track" feature. * The "Toolcall" data provides precise interaction logs, including click coordinates and scroll actions. * The "Thought" sections offer valuable insights into the user's reasoning and decision-making process. * The tutorial highlights the importance of navigating through the Chrome settings to access privacy options. * The color change of the toggle switch (gray to blue) visually confirms the successful enabling of the "Do Not Track" feature. ### Interpretation The image provides a detailed walkthrough of a common privacy-enhancing feature in a popular web browser. It showcases how a user might interact with the browser's interface to adjust their privacy settings. The inclusion of "Toolcall" data suggests that this tutorial could be used to train an automated agent to perform the same task. The "Thought" sections are particularly valuable as they provide a cognitive model of the user's behavior, which can be used to improve the usability of the browser interface or to develop more effective privacy education materials. The tutorial emphasizes the user's agency in controlling their online privacy, but also implicitly acknowledges the complexity of navigating browser settings to achieve this control. The repeated navigation between settings pages (Steps 6-8) suggests that the Chrome interface could be improved to make the "Do Not Track" setting more easily accessible. </details> Figure 10: Kimi-VL is capable of following multi-step reasoning processes to complete complex GUI tasks. In this example, it successfully enables the “Do Not Track” feature in the Chrome browser to enhance online privacy. The agent interprets each screen, identifies relevant UI elements, and performs the appropriate actions sequentially with clear thoughts, actions, and API calls. <details> <summary>x11.png Details</summary> ![0313c967](/v1/image/0313c967453dc589dc93ed4cb6605a33bc3ac8e9fcff68ee360dee25d3db64d5) ### Visual Description \n ## Textual Document: Video Scene Descriptions ### Overview The image presents a textual document containing detailed descriptions of video scenes, likely for editing or analysis purposes. The document is structured with timestamps indicating the start and end time of each scene, followed by a descriptive paragraph. The document appears to be in English. ### Components/Axes The document consists of a series of entries, each with the following components: * **Timestamp:** Indicates the start and end time of the scene (e.g., 00:00:00 - 00:00:15). * **Scene Description:** A paragraph providing a detailed description of the visual and auditory elements of the scene. ### Detailed Analysis / Content Details Here's a transcription of the scene descriptions, broken down by timestamp. Due to the length, I will provide a representative sample and summarize the rest. **00:00:00 - 00:00:15:** "The scene opens with a dark room illuminated by a single light source, where a person is seen cooking food. The atmosphere is mysterious and intriguing, with the dim lighting and the steam rising from the cooking pot creating a sense of anticipation. At 00:00:05, food appears on the screen, reading 'THE SHEPHERD'S CINEMA'. The scene is characterized by a slow, deliberate pace, inviting the viewer to immerse themselves in the moment. The overall tone is one of warmth, comfort, and domesticity, suggesting a story centered around the simple pleasures of life." **00:00:15 - 00:00:30:** "The scene opens with a cup enablely person hours revealing their hand brown and weathered skin, which suggests a life of hard work and experience. The camera focuses on the hands as they perform a delicate task, such as preparing a meal or tending to a garden. The camera slowly moves to the prayer wheel, capturing its intricate details and the serene expression of the person. The scene is bathed in soft, natural light, creating a sense of peace and tranquility. The overall tone is one of reverence and respect, suggesting a story centered around the importance of faith and tradition. The scene introduces a sense of cultural and spiritual depth, emphasizing themes of contemplation, spirituality, and the passage of time." **00:00:30 - 00:00:34:** "The scene opens with a breathtaking aerial view of snow-capped mountains, setting the stage for a theme of natural grandeur and adventure. The camera pans across the landscape, revealing a vast expanse of pristine wilderness. The scene is characterized by a sense of awe and wonder, inviting the viewer to contemplate the beauty and power of nature. The overall tone is one of majesty and inspiration, suggesting a story centered around the challenges and rewards of exploration. A NEW FILM BY SHEPPA'S CINEMA. The cold conditions on snowy hills, subtly hinting at the harsh realities of life in the mountains. The scene is visually stunning, with the snow-capped peaks and the clear blue sky creating a sense of serenity and peace. The overall mood is one of tranquility and isolation, suggesting a story centered around the themes of self-discovery and resilience." **00:00:34 - 00:00:39:** "The scene opens with a close-up of a person’s eyes, which are reflecting a detailed view of a prayer wheel, emphasizing the intimate connection between the individual and their spiritual beliefs. The camera slowly zooms in on the eyes, capturing the subtle nuances of emotion. The scene is bathed in soft, diffused light, creating a sense of mystery and intrigue. The overall tone is one of introspection and contemplation, suggesting a story centered around the search for meaning and purpose. The scene introduces a sense of psychological depth, exploring the inner world of the character and their relationship to the divine. The scene is visually striking, with the reflection of the prayer wheel in the eyes creating a mesmerizing effect." **00:00:39 - 00:00:41:** "The scene opens with a person in a traditional Mongolian hat, showcasing the rich cultural heritage and nomadic lifestyle of the region. The camera focuses on the hat, capturing its intricate details and the vibrant colors. The scene is characterized by a sense of authenticity and cultural pride, inviting the viewer to learn about a different way of life. The overall tone is one of respect and admiration, suggesting a story centered around the importance of preserving cultural traditions. The scene introduces a sense of exoticism and adventure, transporting the viewer to a remote and unfamiliar land. The scene is visually appealing, with the traditional hat and the surrounding landscape creating a striking contrast." **00:00:41 - 00:00:44:** "The scene opens with a close-up of a person’s hands skillfully crafting a traditional Mongolian knot, demonstrating the artistry and craftsmanship of the region. The camera focuses on the hands as they perform the intricate task, capturing the precision and dexterity of the artisan. The scene is characterized by a sense of dedication and skill, inviting the viewer to appreciate the beauty of handmade objects. The overall tone is one of reverence and respect, suggesting a story centered around the importance of preserving traditional crafts. The scene introduces a sense of cultural depth, exploring the history and symbolism of the Mongolian knot. The scene is visually engaging, with the intricate knot and the skilled hands creating a captivating image." **00:00:44 - 00:00:47:** "The scene opens with a wide shot of a vast, open landscape, showcasing the nomadic lifestyle and the close connection between the people and their environment. The camera pans across the landscape, revealing a sense of freedom and boundless space. The scene is characterized by a sense of adventure and exploration, inviting the viewer to imagine life on the open road. The overall tone is one of resilience and adaptability, suggesting a story centered around the challenges and rewards of living in harmony with nature. The scene introduces a sense of cultural authenticity, portraying the nomadic lifestyle in a realistic and respectful manner. The scene is visually stunning, with the vast landscape and the clear blue sky creating a sense of awe and wonder." **00:00:47 - 00:00:52:** "The scene opens with a close-up of a person’s face, revealing a weathered and expressive countenance that speaks volumes about their life experiences. The camera focuses on the face, capturing the subtle nuances of emotion and the lines etched by time. The scene is characterized by a sense of intimacy and vulnerability, inviting the viewer to connect with the character on a personal level. The overall tone is one of wisdom and resilience, suggesting a story centered around the challenges and triumphs of the human spirit. The scene introduces a sense of psychological depth, exploring the inner world of the character and their relationship to the past. The scene is visually compelling, with the expressive face and the soft lighting creating a captivating image." **00:00:52 - 00:00:58:** "The scene opens with a panoramic view of a bustling Mongolian market, showcasing the vibrant colors, sounds, and smells of daily life. The camera pans across the market, revealing a diverse array of goods and people. The scene is characterized by a sense of energy and excitement, inviting the viewer to immerse themselves in the local culture. The overall tone is one of authenticity and vibrancy, suggesting a story centered around the importance of community and commerce. The scene introduces a sense of cultural richness, portraying the Mongolian market as a microcosm of society. The scene is visually stimulating, with the colorful goods and the lively atmosphere creating a captivating spectacle." **00:00:58 - 01:00:00:** "The scene opens with a close-up of a person’s hands preparing a traditional Mongolian meal, showcasing the culinary heritage and the importance of food in the culture. The camera focuses on the hands as they skillfully prepare the ingredients, capturing the precision and care of the cook. The scene is characterized by a sense of warmth and hospitality, inviting the viewer to share in the simple pleasures of life. The overall tone is one of comfort and nourishment, suggesting a story centered around the importance of family and tradition. The scene introduces a sense of cultural depth, exploring the history and symbolism of Mongolian cuisine. The scene is visually appealing, with the colorful ingredients and the skilled hands creating a mouthwatering image." **(Summarized Remaining Scenes):** The remaining scenes continue in a similar vein, describing various aspects of Mongolian life, including nomadic traditions, religious practices (prayer wheels, temples), landscapes, and daily activities. Descriptions emphasize visual details, emotional tone, and cultural significance. ### Key Observations * The descriptions are highly detailed and focus on visual elements, emotional tone, and cultural context. * There's a recurring emphasis on hands and faces, suggesting a focus on human connection and individual experience. * The descriptions consistently highlight the beauty and authenticity of Mongolian culture and landscapes. * The document appears to be intended for use in film editing or analysis, providing a rich source of information for understanding the visual and thematic elements of the video. ### Interpretation This document serves as a detailed visual script or shot list for a film or video project centered around Mongolian culture and lifestyle. The descriptions are not merely observational; they actively interpret the scenes, suggesting themes of spirituality, resilience, tradition, and connection to nature. The consistent focus on sensory details (light, color, sound) indicates a deliberate aesthetic approach. The document suggests a narrative that values authenticity and cultural sensitivity, aiming to portray Mongolian life with respect and nuance. The frequent use of phrases like "inviting the viewer to..." suggests an intention to create an immersive and emotionally engaging experience for the audience. The document is a valuable resource for understanding the artistic vision and thematic concerns of the video project. </details> Figure 11: Video scene splitting. Kimi-VL processes a long-form video by segmenting it into coherent scenes and providing detailed start/end timestamps along with fine-grained natural language descriptions for each scene. footnotemark: <details> <summary>x12.png Details</summary> ![f00947db](/v1/image/f00947db45b78e80297e9c94b39e83b0a7265a373456c407475c585129943940) ### Visual Description \n ## Screenshot: Presentation Slides - "Give a Man a Fish" Analogy ### Overview The image is a screenshot of a presentation, likely a video or recording of a talk. It displays a series of slides illustrating the Chinese proverb "Give a man a fish, and you feed him for a day; teach a man to fish, and you feed him for a lifetime." The presentation expands on this proverb with a further requirement, which is the focus of the image. The slides show a speaker presenting to an audience, with the proverb and related text displayed on a screen behind them. ### Components/Axes The screenshot is divided into a timeline at the bottom, indicating timecodes from 0:00 to 35:55. The main content consists of a series of six slides displayed sequentially. Each slide features a visual of people sitting at a long table, and text relating to the proverb. The bottom right corner of the last slide contains a "Thank you!" message and a Twitter handle "@dereksewell". ### Content Details Here's a transcription of the text visible on each slide: * **Slide 1-5:** * "Loose analogy" (appears at the top of each slide) * "Give a man a fish, and you feed him for a day." * "Teach a man to fish, and you feed him for a lifetime." * **Slide 6:** * "Teach him the taste of fish and make him hungry." * "Thank you!" * "Twitter: @dereksewell" The images on the slides show a speaker addressing an audience. The audience is seated at a long table, and the speaker is standing at a podium. The background is a large screen displaying the text of the proverb. ### Key Observations The presentation builds upon the well-known proverb. The addition of "Teach him the taste of fish and make him hungry" introduces the concept of fostering a desire for continuous learning and improvement, rather than simply providing skills. The timeline suggests this is a segment from a longer presentation. ### Interpretation The presentation uses the "fish" analogy to illustrate the importance of not just imparting knowledge or skills ("teaching a man to fish"), but also of inspiring a passion for learning and a drive for self-improvement ("teach him the taste of fish and make him hungry"). This suggests that true empowerment comes from cultivating a mindset of continuous growth and seeking out new challenges. The inclusion of a Twitter handle indicates the speaker is likely sharing these ideas online and encouraging further discussion. The overall message is about the power of motivation and the importance of fostering a lifelong love of learning. The presentation is not presenting data, but rather a philosophical concept. The visual element of the speaker and audience reinforces the idea of knowledge transfer and engagement. </details> Figure 12: Catching and understanding key details from an hour-long video course. Kimi-VL demonstrates its ability to comprehend and interpret instructional video content by analyzing frame sequences and extracting conceptual progression over time. In this case, the model identifies a deepening of the traditional saying “Teach a man to fish, and you feed him for a lifetime” into a more nuanced idea: “Teach him the taste of fish and make him hungry.” footnotemark: 4.1.1 College-level Academic Problems Our Kimi-VL model demonstrates competitive performance on college-level academic benchmarks. On MMMU validation set, it achieves a score of 57.0%, which outperforms DeepSeek-VL2 (51.1%) and is comparable to Qwen2.5-VL-7B (58.6%) and even Gemma-3-12B-IT (59.6%), despite having significantly fewer activated parameters. On video college-level problems, it significantly outperforms Qwen2.5-VL-7B and DeepSeek-VL2, only behind >10B Gemma-3-12B-IT, demonstrating reasonable university-level understanding capabilities compared to larger models. These results indicate that Kimi-VL effectively balances parameter efficiency with academic reasoning abilities. 4.1.2 General Visual Ability Kimi-VL exhibits strong general visual understanding capabilities across multiple benchmarks. On MMBench-EN-v1.1, it achieves 83.1% accuracy, outperforming all efficient VLMs in comparison, and performing on par with GPT-4o. For AI2D, our model achieves 84.9% and surpasses all compared models including GPT-4o (84.6%). On MMVet, Kimi-VL scores 66.7% and ties closely with Qwen2.5-VL-7B (67.1%) and GPT-4o-mini (66.9%). For RealWorldQA, it achieves 68.1%, outperforming Gemma3-12B (59.1%) and approaching Qwen2.5-VL-7B (68.5%). These results demonstrate that our model maintains robust general visual understanding despite its compact architecture. In multi-image reasoning tasks, Kimi-VL shows promising capabilities with a score of 57.3% on the BLINK benchmark. This performance surpasses Qwen2.5-VL-7B (56.4%), GPT-4o-mini (53.6%), Gemma3-12B-IT (50.3%), and Llama3.2-11B-Inst. (39.8%). The ability to reason across multiple images requires understanding spatial and temporal relationships between visual elements, which our model handles effectively with fewer parameters than most competitors. 4.1.3 Mathematical Reasoning With its relatively small scale, Kimi-VL also demonstrates strong mathematical reasoning capabilities, particularly on the MathVista benchmark where it achieves 68.7%, outperforming all compared models including GPT-4o (63.8%) and Qwen2.5-VL-7B (68.2%). It indicates our model’s exceptional ability to understand and solve mathematical problems presented in visual contexts. On the more challenging MathVision benchmark, due to limited activated parameters, Kimi-VL outperforms DeepSeek-VL2 and Llama-3.2-11B-Inst., but lags behind Qwen2.5-VL-7B and Gemma-12B-IT. Nevertheless, through RL and test-time scaling, Kimi-VL-Thinking has significantly improved and already on par with 30B-level VLMs (see Table 4). These results highlight our model’s effectiveness in combining visual perception with mathematical problem-solving, an essential capability for real-world applications. 4.1.4 Document Understanding and OCR Kimi-VL excels in document understanding and OCR tasks across all benchmarks in this category. On InfoVQA, it achieves 83.2% accuracy, outperforming GPT-4o (80.7%) and DeepSeek-VL2 (78.1%). For OCRBench, our model scores 86.7%, surpassing all other models including GPT-4o-mini (78.5%) and DeepSeek-VL2 (81.1%). These results demonstrate that our model has exceptional text recognition and document understanding capabilities, making it especially suitable for applications involving document processing and information extraction. 4.1.5 Agent Grounding and Multi-turn Agent Interaction In agent-based tasks, Kimi-VL demonstrates remarkable performance. On single-step grounding, our model shows strong accuracy, with 92.0% on ScreenSpot-V2 and 34.5% on extremely difficult ScreenSpot-Pro (on 4K screens), proving its strong agent grounding abilities. More importantly, it also shows strong multi-step turn agent interaction abilities: For OSWorld, Kimi-VL reaches 8.22%, outperforming GPT-4o (5.03%) and other capable open-source models; On WindowsAgentArena, our model achieves 10.4%, also surpassing GPT-4o (9.4%) and others. These results highlight Kimi-VL’s exceptional ability to understand and interact with operating system interfaces, suggesting strong potential for applications in automated UI navigation and task execution. 4.1.6 Long Document and Long Video Understanding Kimi-VL demonstrates competitive performance in long-form content understanding. On MMLongBench-Doc, a challenging benchmark with question-answering on up to 100+ pages, it achieves 35.1%, outperforming GPT-4o-mini (29.0%) and Qwen2.5-VL-7B (29.6%), only behind GPT-4o (42.8%). For long video understanding, on Video-MME, our model outperforms all efficient VLMs and especially leads on the fairer w/o subtitle setting, where models have to find answers from video frames instead of hacking from input subtitles; on w/ subtitle setting, it also reaches extraordinary 72.6% accuracy. On the MCQ subset of MLVU, Kimi-VL achieves an impressive 74.2% score, achieving state-of-the-art and surpassing both GPT-4o (64.6%) and Qwen2.5-VL-7B (70.2%). For LongVideoBench, it scores 64.5%, outperforming all compared models except GPT-4o (66.7%). These results demonstrate Kimi-VL ’s strong capability to understand long-form PDFs and videos. 4.1.7 Egocentric and Fine-grained Video Perception Kimi-VL also shows strong performance in more nuanced video perception tasks. On EgoSchema full set (hidden test set), it achieves 78.5%, significantly outperforming GPT-4o (72.2%), Qwen2.5-VL-7B (65.0%). For VSI-Bench, a very challenging benchmark that requires to understand spatial relationships and correspondences of multiple objects in a video, our model scores 37.4%, surpassing GPT-4o (34.0%) and Qwen2.5-VL-7B (34.2%). In TOMATO that examines fine-grained temporal perception of VLMs, Kimi-VL reaches 31.7%, outperforming Qwen2.5-VL-7B (27.6%) and GPT-4o-Mini (28.8%). These results demonstrate our model’s strong capability to understand dynamic visual content, track objects over time, and interpret complex actions in video sequences, making it well-suited for applications requiring temporal visual understanding. 4.2 Kimi-VL-A3B-Thinking: A Reasoning Extension of Kimi-VL Furthermore, we conduct a reasoning extension to empower Kimi-VL to reason with CoT and present a long-thinking version of the model, Kimi-VL-Thinking, through long-CoT activation and reinforcement learning. We validate its superior performance on several image benchmarks, as shown in Table 4. | MathVision (full) (Pass@1) MathVista (mini) (Pass@1) MMMU (val) (Pass@1) | 30.4 63.8 69.1 | - 56.7 60.0 | 38.1 74.8 74.8 | 25.1 68.2 58.6 | 35.5 62.3 64.8 | 32.1 56.4 59.6 | - 71.0 77.3 | 35.9 71.4 70.3 | 38.6 74.9 70.0 | 36.8 71.3 61.7 | 56.9 80.1 64.0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | MMMU-Pro (avg) (Pass@1) | 51.7 | 37.6 | 51.1 | 38.1 | - | 32.1 | - | - | - | 43.0 | 46.3 | | VideoMMMU (Pass@1) | 61.1 | - | 60.2 | 47.0 | 61.8 | 57.2 | - | - | - | 55.5 | 65.2 | Table 4: Performance of Kimi-VL-Thinking and Kimi-VL-Thinking-2506 on multimodal reasoning benchmarks. The metrics evaluated include MathVista (mini), MMMU (val), MMMU-Pro (average), MathVision (full) and VideoMMMU, with results expressed in Pass@1. The Kimi-VL-Thinking-2506 performs well in most cases, showcasing the enhanced reasoning and processing capabilities of the “thinking” variant across different domains and scales. <details> <summary>x13.png Details</summary> ![85064bf1](/v1/image/85064bf175ab52c1b88249ead3ac65995900129d6f6e231f8bd80b86541c15c1) ### Visual Description \n ## Scatter Plots: Accuracy vs. Max Thinking Length for Math Reasoning Benchmarks ### Overview The image presents three separate scatter plots, each displaying the relationship between "Max Thinking Length (k tokens)" and "Test Time Accuracy (%)" for different math reasoning benchmarks: MathVision, MathVista, and MMMU. Each plot contains four data points, representing accuracy scores at different thinking lengths. ### Components/Axes Each plot shares the following components: * **X-axis:** "Max Thinking Length (k tokens)" with markers at 1, 2, 4, 8, and 16. * **Y-axis:** "Test Time Accuracy (%)" ranging from approximately 16% to 72%. * **Title:** Indicates the benchmark being evaluated (MathVision, MathVista, MMMU). * **Data Points:** Black circular markers representing accuracy values at specific thinking lengths. ### Detailed Analysis **1. MathVision** * **Trend:** The accuracy generally increases as the Max Thinking Length increases. * **Data Points:** * Max Thinking Length = 1 k tokens: Accuracy ≈ 18.7% * Max Thinking Length = 2 k tokens: Accuracy ≈ 22.6% * Max Thinking Length = 4 k tokens: Accuracy ≈ 29.0% * Max Thinking Length = 8 k tokens: Accuracy ≈ 34.0% * Max Thinking Length = 16 k tokens: Accuracy ≈ 36.8% **2. MathVista** * **Trend:** The accuracy increases rapidly from 1 to 4 k tokens, then plateaus with a slight increase from 4 to 16 k tokens. * **Data Points:** * Max Thinking Length = 1 k tokens: Accuracy ≈ 66.7% * Max Thinking Length = 2 k tokens: Accuracy ≈ 69.0% * Max Thinking Length = 4 k tokens: Accuracy ≈ 70.9% * Max Thinking Length = 8 k tokens: Accuracy ≈ 70.6% * Max Thinking Length = 16 k tokens: Accuracy ≈ 71.3% **3. MMMU** * **Trend:** The accuracy increases steadily as the Max Thinking Length increases. * **Data Points:** * Max Thinking Length = 1 k tokens: Accuracy ≈ 49.2% * Max Thinking Length = 2 k tokens: Accuracy ≈ 52.4% * Max Thinking Length = 4 k tokens: Accuracy ≈ 56.2% * Max Thinking Length = 8 k tokens: Accuracy ≈ 60.1% * Max Thinking Length = 16 k tokens: Accuracy ≈ 61.7% ### Key Observations * MathVista consistently achieves the highest accuracy across all thinking lengths. * MathVision shows the lowest accuracy, but exhibits a clear positive correlation between thinking length and performance. * The rate of accuracy improvement diminishes with increasing thinking length for MathVista, suggesting a point of diminishing returns. * MMMU shows a consistent, linear improvement in accuracy with increasing thinking length. ### Interpretation These plots demonstrate the impact of "Max Thinking Length" on the performance of language models on various math reasoning benchmarks. The "Max Thinking Length" parameter likely controls the amount of computational resources (tokens) allocated to the model for problem-solving. The differences in accuracy across benchmarks suggest varying levels of complexity and the models' inherent capabilities in tackling different types of math problems. MathVista appears to be the easiest benchmark, as it achieves high accuracy even with limited thinking length. MathVision is the most challenging, requiring more extensive reasoning to achieve comparable results. The diminishing returns observed in MathVista indicate that beyond a certain point, increasing the thinking length does not significantly improve performance. This could be due to the model reaching its capacity to effectively utilize additional computational resources for that specific task. The linear improvement in MMMU suggests that the model could potentially benefit from even longer thinking lengths, although practical limitations (computational cost) may exist. These results are valuable for optimizing the performance of language models on math reasoning tasks by identifying the optimal balance between thinking length and accuracy for each benchmark. </details> Figure 13: Test-time accuracy when scaling the max thinking token length of our Kimi-VL-Thinking model. Kimi-VL-Thinking significantly improves over the base Kimi-VL model, with gains of 2.6% on MathVista, 4.7% on MMMU, and 15.4% on MathVision, demonstrating its capability to leverage test-time computation for deeper reasoning and better handling of complex multimodal queries. In Table 4, Kimi-VL-Thinking further outperforms or rivals state-of-the-art thinking and non-thinking models: achieving 71.3% on MathVista, outperforming GPT-4o (63.8%) and GPT-4o-mini (56.7%); scoring 61.7% on MMMU, surpassing GPT-4o-mini (60.0%) and Qwen2.5-VL-7B (58.6%); and reaching 36.8% on MathVision, exceeding GPT-4o (30.4%) and Gemma-3-27B-IT (35.5%), even QVQ-72B (35.9%). While marginally behind some larger-scale models on select benchmarks, Kimi-VL-Thinking accomplishes these results with only 3B activated parameters—orders of magnitude fewer than its counterparts—underscoring its strong efficiency and effectiveness in multimodal reasoning. Our Kimi-VL-Thinking model also exhibits strong test-time scaling properties, as shown in Figure 13. Specifically, increasing the max thinking token length at inference time consistently improves test-time accuracy across all three benchmarks. For example, on MathVision, the accuracy rises steadily from 18.7% at 1k tokens to 36.8% at 16k tokens, and similar upward trend is also observed on MMMU, indicating that the model is able to utilize longer reasoning chains for better performance. However, not all benchmarks benefit equally from longer thinking lengths. On MathVista, performance saturates early, with accuracy reaching 70.9% at 4k tokens and no further significant gains observed as the token length increases to 16k. It suggests that for this task, the necessary reasoning depth is already captured within a relatively short context, and additional computation does not yield further improvements. 4.3 Kimi-VL-A3B-Thinking-2506: From Reasoning Extension to Integrated Thinking Model Table 5: Performance of Kimi-VL-A3B-Thinking-2506 on multimodal benchmarks that do not require extensive reasoning. | Benchmark (Metric) General Multimodal MMBench-EN-v1.1 (Acc) | GPT-4o 83.1 | Qwen2.5- VL-7B 83.2 | Gemma3- 12B-IT 74.6 | Kimi-VL-A3B- Instruct 82.9 | Kimi-VL-A3B- Thinking 76.0 | Kimi-VL-A3B- Thinking-2506 84.4 | | --- | --- | --- | --- | --- | --- | --- | | RealWorldQA (Acc) | 75.4 | 68.5 | 59.1 | 68.1 | 64.0 | 70.0 | | OCRBench (Acc) | 815 | 864 | 702 | 864 | 864 | 869 | | MMStar (Acc) | 64.0 | 63.0 | 56.1 | 61.7 | 64.2 | 70.4 | | MMVet (Acc) | 69.1 | 67.1 | 64.9 | 66.7 | 69.5 | 78.1 | | Video | | | | | | | | MMVU ${}_{\text{val}}$ (Pass@1) | 67.4 | 50.1 | 57.0 | 52.7 | 53.0 | 57.5 | | Video-MME (w/ sub.) (Acc) | 77.2 | 71.6 | 62.1 | 72.7 | 66.0 | 71.9 | | OS-Agent Grounding | | | | | | | | ScreenSpot-Pro (Acc) | 0.8 | 29.0 | — | 35.4 | — | 52.8 | | ScreenSpot-V2 (Acc) | 18.1 | 84.2 | — | 92.8 | — | 91.4 | | OSWorld-G (Acc) | - | 31.5 | — | 41.6 | — | 52.5 | | Long Document | | | | | | | | MMLongBench-Doc (Acc) | 42.8 | 29.6 | 21.3 | 35.1 | 32.5 | 42.1 | While Kimi-VL-A3B-Thinking shows excellent thinking abilities on hard reasoning tasks, we further provide the updated Kimi-VL-A3B-Thinking-2506 Tech Blog: https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506, a new reasoning variant that is not only smarter, but integrates key abilities of Kimi-VL-A3B-Instruct (perception, video, long-document, and OS-agent abilities) into this thinking model. Kimi-VL-Thinking-2506 significantly improves reasoning efficiency while reducing token consumption. As shown in Table 4, Kimi-VL-Thinking-2506 achieves 56.9% on MathVision (+20.1% improvement on original Kimi-VL-Thinking), 80.1% on MathVista (+8.4%), 46.3% on MMMU-Pro (+3.2%), and 64.0% on MMMU (+2.1%), demonstrating non-trivial gains across multiple reasoning benchmarks. Meanwhile, while solving these hard reasoning problems, the 2506 version reduces the average output token length by around 20% (e.g., 2.9K $→$ 2.4K on MMMU-val and 5.8K $→$ 4.4K on MathVision), facilitating it to be more efficient and user-friendly for practical deployments. Beyond extensive reasoning tasks, Kimi-VL-Thinking demonstrates stronger visual perception capabilities (Table 5). Compared to the previous non-thinking variant (Kimi-VL-A3B-Instruct), Kimi-VL-A3B-Thinking-2506 achieves competitive or superior results on general multimodal understanding benchmarks: 84.4% on MMBench-EN-v1.1, 70.4% on MMStar, 70.0% on RealWorldQA, and 78.4% on MMVet, underscoring its broader competence in vision-language tasks. In terms of token efficiency, the 2506 version only requires in average 180 tokens per answer when solving MMBench, 1/3 compared to the previous thinking model while improving 8.4% accuracy. Kimi-VL-A3B-Thinking-2506 also extends its reasoning ability to video and long-context domains. It establishes new state-of-the-art results among open-source models on VideoMMMU (65.2%, 4% better than GPT-4o), a challenging video reasoning benchmark; it also maintains robust general video understanding performance with 71.9% on Video-MME, matching the long video understanding ability of Kimi-VL-A3B-Instruct. It also scores 42.1% (first open-source model matching GPT-4o) on MMLongBench-Doc (Table 5), a 10% improvement over the previous thinking model and 7% over the previous instruct model, demonstrating its robust ability on broader long-form visual inputs. As mentioned in the method part, the continual training on MoonViT (3.2 million max input pixels) of Kimi-VL-A3B-Thinking-2506 leads to substantial improvements on high-resolution perception and OS grounding benchmarks, achieving 83.2% on V* Benchmark (without external tools), 52.8% on ScreenSpot-Pro, and 52.5% on OSWorld-G (full set with refusal samples), showing huge improvements over both previous variants. We hope that this high-resolution multimodal reasoning model brings about interesting new capabilities in the real world. 5 Conclusion, Limitation, and Future Work We introduce Kimi-VL, a VLM designed with a balanced approach to cover both multimodal and text-only pre-training/post-training, underpinned by an MoE-based architecture for scalable efficiency. Its 128K extended context window enables precise retrieval in lengthy texts and videos, while the native-resolution encoder MoonViT helps maintain high accuracy with low computational overhead in ultra-high-resolution visual tasks. Additionally, Kimi-VL-Thinking facilitates effective long-chain reasoning in complex image and video inference. Overall, Kimi-VL demonstrates robust adaptability and efficiency across multimodal, long-context, and high-resolution tasks, indicating substantial potential for future research and industrial applications. However, Kimi-VL still faces several challenges: 1. Although the current model size performs effectively for many standard tasks, it remains too limited to address highly specialized or domain-specific problems, or problems that are strongly dependent on language abilities, restricting Kimi-VL’s ability to handle extremely complex scenarios. 1. While the reasoning capability is already strong for typical use cases, it has yet to reach its theoretical upper bound, particularly for intricate tasks requiring multi-step inference or deeper contextual understanding. 1. Despite providing a 128K extended context window, due to limited parameters in its attention layers (which is only comparable to a 3B model), its long-context abilities is still insufficient for certain advanced applications that involve extremely long sequences or high-volume contextual information. In the future, we will tackle these challenges by scaling up the model size, expanding pre-training data, and enhancing post-training algorithms. Our next steps include optimizing Kimi-VL and releasing larger versions, as well as refining post-training and test-time scaling mechanisms for a better thinking model. These efforts will pave the way for more advanced applications in both research and industry. \printbibliography [title=References] Appendix Appendix A Contributions Core Contributors Bohong Yin Bowei Xing Cheng Chen Chu Wei Dehao Zhang Dongliang Wang Haoning Wu ∗ Haotian Yao Haoyu Lu ∗ Hao Yang Kun Ouyang Lin Sui Xinyuan Wang # Xinyu Zhou Yang Li Y. Charles ∗ Yiping Bao Yimin Chen Yuanxin Liu Yuxin Wu Zaida Zhou Zhaowei Li Zhiqi Huang Zhilin Yang Ziwei Chen Contributors Angang Du Bowen Qu Bowen Wang # Chenlin Zhang Chenzhuang Du Congcong Wang Dikang Du Enming Yuan Enzhe Lu Fang Li Flood Sung Guangda Wei Guokun Lai Han Zhu Hao Ding Hao Hu Hao Zhang Heng Wang Hongcheng Gao Huabin Zheng Jiaming Li Jianlin Su Jianzhou Wang Jiaqi Deng # Jiezhong Qiu Jin Xie Jinhong Wang Jingyuan Liu Junjie Yan Liang Chen Longhui Yu Mengfan Dong Mengnan Dong Nuo Xu Pengyu Cheng Qizheng Gu Runjie Zhou Shaowei Liu Sihan Cao Tao Yu # Tianhui Song Tongtong Bai Weiran He Wei Song Weixiao Huang Weixin Xu Xiaokun Yuan Xingzhe Wu Xingcheng Yao Xinhao Li Xinxing Zu Yangyang Hu Yan Zhong Yanru Chen Yibo Miao Yejie Wang Yibo Liu Yidao Qin Yiqin Wang Yongsheng Kang Yuhao Dong Yulun Du Yuzhi Wang Yuzi Yan Zhejun Jiang Zheng Zhang Zihao Huang Zijia Zhao Zongyu Lin * Project lead(s). # The University of Hong Kong, Moonshot.ai The listing of authors is in alphabetical order based on their first names. Appendix B Evaluation Details B.1 Image Benchmark MMMU \parencite yue2024mmmu encompasses a carefully curated collection of 11.5K multimodal questions sourced from college exams, quizzes, and textbooks. These questions span six major academic fields: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. MMBench-EN-v1.1 \parencite MMBench is a fine-grained benchmark that contains 2974 multiple-choice questions, covering 20 ability dimensions. It incorporate perception and reasoning as the top-level ability dimensions in its ability taxonomy, leading to different levels of evaluation in various ability dimensions. MMStar \parencite chen2024mmstar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans. It is designed to benchmark 6 core capabilities and 18 detailed axes, aiming to evaluate the multimodal capacities of LVLMs with a carefully balanced and purified selection of samples. MMVet \parencite yu2024mmvet is designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language capabilities. It defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. RealWorldQA \parencite realworldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal models. It assesses how well the models comprehend physical environments. The benchmark consists of over 700 images, each accompanied by a question and a verifiable answer, and these images are drawn from various real-world scenarios. AI2D \parencite kembhavi2016ai2d is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground truth syntactic parses, and more than 15000 corresponding multiple choice questions. MathVision \parencite wang2024measuring is a carefully curated collection of 3,040 high-quality mathematical problems with visual contexts that are sourced from real math competitions. It covers 16 distinct mathematical disciplines and is graded across 5 levels of difficulty. This dataset offers a comprehensive and diverse set of challenges, making it ideal for evaluating the mathematical reasoning abilities of LMMs. MathVista \parencite lu2023mathvista is a benchmark that integrates challenges from a variety of mathematical and visual tasks, demanding participants to exhibit fine-grained, deep visual understanding along with compositional reasoning to successfully complete the tasks. BLINK \parencite fu2024blink is a benchmark designed to evaluate multi-image visual cognition, encompassing tasks related to depth relationships, feature matching, digital forensics, and spatiotemporal reasoning. It features a diverse set of multi-image perceptual similarity tasks, validated through standardized protocols. InfoVQA \parencite mathew2022infographicvqa is a dataset specifically designed to assess models’ capabilities in interpreting and reasoning with complex infographics that integrate text, graphics, and visual elements. Model performance on this dataset is evaluated using the ANLS metric on the test set. OCRBench \parencite liu2023hidden evaluates the OCR capabilities of MLLMs across five tasks: text recognition, scene text VQA, document VQA, key information extraction, and handwritten math expression recognition. The benchmark is scored out of a maximum of 1000 points. B.2 Video and Long Document Benchmark VideoMMMU \parencite arxiv2025videommmu is a video benchmark designed to evaluate the college-level knowledge acquisition capabilities of large multimodal models. It consists of 300 expert-level videos and 900 human-annotated questions. The videos span six diverse academic disciplines: Art, Humanities, Medicine, Business, Science, and Engineering. The questions are structured to align with three cognitive stages: Perception, Comprehension, and Adaptation. MMVU \parencite arxiv2025mmvu is a video benchmark designed to evaluate the expert-level video understanding ability. The benchmark contains 3,000 expert-annotated questions over 1,529 videos, which span 27 subjects from four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Video-MME \parencite arxiv2024videomme is a video benchmark that consists of 900 manually selected videos (totaling 254 hours length), and 2,700 QA pairs. The videos, varying in duration, are categorized into 30 fine-grained classes across six diverse domains: Knowledge, Film & Television, Sports Competition, Artistic Performance, Life Record, and Multilingual content. Evaluations are conducted under two different settings: with and without subtitles. MLVU \parencite arxiv2024mlvu is designed to evaluate the model performance in comprehending long videos from multiple aspects. It consists of 1,730 videos along with 3,102 corresponding question-answer pairs (2,593 in dev set and 509 in test set). Videos of this benchmark are collected from multiple scenarios, including Sport, Ego-centric, Life Record, Tutorial, etc. The close-ended task set of MLVU comprises 7 different tasks: Action Order, Action Count, Topic Reasoning, Anomaly Recognition, Plot QA, Ego Reasoning, and Needle QA. LongVideoBench \parencite nips2024longvideobench is a video question-answering benchmark designed to evaluate the long-form multimodal perception and relation capability of large multimodal models. The benchmark includes 3,763 web-collected videos spanning various lengths and themes, along with their corresponding subtitles. It includes 6,678 human-annotated multiple-choice questions, distributed across 17 fine-grained categories, which accesses different aspects of video-language understanding. EgoSchema \parencite nips2023egoschema is a video benchmark designed to evaluate the long-form video understanding capabilities within the ego-centric scenario. Derived from Ego4D \parencite cvpr2022ego4d, the benchmark comprises over 5,031 multiple choice question-answer pairs spanning more than 250 hours real-world videos with a semi-automatic data pipeline. VSI-Bench \parencite arxiv2024vsibench is designed to evaluate the visual-spatial comprehensive capabilities of large multimodal models. It consists of over 5,000 question-answer pairs across around 290 real indoor-scene videos. TOMATO \parencite iclr2025tomato is a video benchmark comprises 1,484 human-annotated question-answer pairs and 1,417 videos. TOMATO focuses on evaluating the temporal reasoning capabilities of large multimodal models, including action counting, direction prediction, rotation analysis, shape & trend detection, velocity & frequency estimation, and visual cue interpretation. B.3 Agent Benchmark ScreenSpot V2 \parencite wu2024osatlas is an enhanced version of the ScreenSpot \parencite cheng2024seeclick benchmark, which focuses on evaluating the performance of GUI grounding models across multiple platforms, including web, desktop, and mobile interfaces. This updated version addresses several issues identified in the original ScreenSpot dataset, such as incorrect or ambiguous annotations, spelling mistakes, and mislabeled bounding boxes. ScreenSpot Pro \parencite li2025screenspotpro is a benchmark for evaluating GUI grounding in high-resolution, complex UI environments. It contains 1,581 real-world, high-resolution images and expert-annotated tasks from diverse professional domains. Including domain-specific interface conventions that challenge models to understand professional-grade interfaces beyond consumer applications. OSWorld \parencite xie2024osworld is a pioneering scalable, real computer environment designed for multimodal agents, facilitating task setup, execution-based evaluation, and interactive learning across multiple operating systems, including Ubuntu, Windows, and macOS. It serves as a unified platform for evaluating open-ended computer tasks that involve arbitrary applications, addressing the limitations of existing benchmarks that often lack interactive environments or are confined to specific applications or domains. WindowsAgentArena \parencite bonatti2024windowsagentarenaevaluating is a benchmark designed to evaluate multimodal agents in realistic Windows environments. Built on the OSWorld framework, it allows agents to interact with a full range of applications and web tools. The benchmark is scalable and can complete evaluations in under 20 minutes on Azure. It offers insights into agent performance, highlighting the potential for future research in agent development and task automation.

Rendering Paper...