2504.07491v3

Model: healer-alpha-free

# Kimi-VL Technical Report **Authors**: Kimi Team template.bib ## Abstract We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities —all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities (84.4 on MMBench, 83.2 on V* and 52.8 on ScreenSpot-Pro). With only around 3B activated parameters, it sets a new standard for efficient yet capable multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL. <details> <summary>x1.png Details</summary> ![5ac832c1](/v1/image/5ac832c132a187e5944f4b43a7ce7ed089a18c04c470fd8574942dce718ad108) ### Visual Description ## Scatter Plot: AI Model Performance vs. Model Size ### Overview This image is a scatter plot comparing the performance of various multimodal AI models on a mathematical vision benchmark against their computational scale (activated parameters). The plot reveals a general trend where larger models tend to achieve higher scores, but with significant outliers demonstrating high efficiency. ### Components/Axes * **X-Axis:** Labeled **"Activated Parameters (B)"**. The scale is logarithmic, with major tick marks at **3, 10, 30, and 70** billion parameters. * **Y-Axis:** Labeled **"MathVision Pass@1"**. The scale is linear, ranging from **20 to 65**, with major grid lines at intervals of 15 (20, 35, 50, 65). * **Data Series & Legend:** The plot contains multiple data series, each represented by a distinct color and marker shape. The legend is embedded directly as labels next to each data point. * **Dark Blue Star:** `Kimi-VL-A3B-Thinking-2506` * **Light Blue Star:** `Kimi-VL-A3B-Thinking` * **Purple Circles (connected by dashed line):** `Gemma-3-4B-IT`, `Gemma-3-12B-IT`, `Gemma-3-27B-IT` * **Gray Circles (connected by dashed line):** `Qwen-2.5-VL-3B`, `Qwen-2.5-VL-7B`, `Qwen-2.5-VL-32B`, `Qwen-2.5-VL-72B` * **Blue Circle:** `DeepSeek-VL2-A4.5B` * **Red Circle:** `Llama-3.2-11B-Inst.` * **Green Crosses:** `QVQ-72B-Preview`, `QVQ-Max-Preview` ### Detailed Analysis **Data Points (Approximate Coordinates: Activated Parameters (B), MathVision Pass@1):** 1. **Kimi-VL-A3B-Thinking-2506 (Dark Blue Star):** Positioned at the top-left. Coordinates: **(~3B, ~60)**. This is the highest-performing model on the chart. 2. **Kimi-VL-A3B-Thinking (Light Blue Star):** Positioned below the first star. Coordinates: **(~3B, ~37)**. 3. **Gemma-3 Series (Purple Circles, upward trend):** * `Gemma-3-4B-IT`: **(~4B, ~25)** * `Gemma-3-12B-IT`: **(~12B, ~32)** * `Gemma-3-27B-IT`: **(~27B, ~35)** * *Trend:* Performance increases with model size, but the rate of improvement slows. 4. **Qwen-2.5-VL Series (Gray Circles, upward then plateauing trend):** * `Qwen-2.5-VL-3B`: **(~3B, ~21)** * `Qwen-2.5-VL-7B`: **(~7B, ~25)** * `Qwen-2.5-VL-32B`: **(~32B, ~38)** * `Qwen-2.5-VL-72B`: **(~72B, ~38)** * *Trend:* Strong improvement from 3B to 32B, then a plateau between 32B and 72B. 5. **DeepSeek-VL2-A4.5B (Blue Circle):** Coordinates: **(~4.5B, ~18)**. Positioned below the Gemma-3-4B-IT point. 6. **Llama-3.2-11B-Inst. (Red Circle):** Coordinates: **(~11B, ~15)**. This is the lowest-performing model on the chart for its size. 7. **QVQ Series (Green Crosses, high-parameter region):** * `QVQ-72B-Preview`: **(~72B, ~36)**. Positioned slightly below the Qwen-2.5-VL-72B point. * `QVQ-Max-Preview`: **(~120B?, ~49)**. The rightmost point, with an estimated parameter count beyond the 70B tick. ### Key Observations 1. **Efficiency Outliers:** The `Kimi-VL-A3B-Thinking-2506` model is a dramatic outlier, achieving the highest score (~60) with one of the smallest parameter counts (~3B). This indicates exceptional parameter efficiency for this specific task. 2. **Performance Plateau:** The `Qwen-2.5-VL` series shows a clear performance plateau, where increasing parameters from 32B to 72B yields no improvement in the MathVision Pass@1 score. 3. **Size-Performance Disconnect:** Larger models do not guarantee better performance. `Llama-3.2-11B-Inst.` (~11B) underperforms both smaller models (e.g., `Qwen-2.5-VL-3B`) and similarly sized models (e.g., `Gemma-3-12B-IT`). 4. **General Trend:** Excluding the major outliers, there is a loose positive correlation between activated parameters and benchmark score, as seen in the Gemma-3 and the initial segment of the Qwen-2.5-VL series. ### Interpretation This chart visualizes the trade-off and variance in **efficiency versus scale** for multimodal AI models on a mathematical reasoning task. * **The "Kimi" models** suggest that architectural innovations or training techniques (implied by the "-Thinking" suffix) can lead to breakthroughs in efficiency, achieving state-of-the-art results with a fraction of the parameters used by competitors. * The **plateau in the Qwen series** indicates diminishing returns for simply scaling a particular model architecture on this benchmark. It suggests that beyond a certain point (~32B for this model family), other factors like data quality, training methodology, or architectural limits become the primary bottleneck. * The **underperformance of Llama-3.2-11B-Inst.** highlights that not all models of a certain size are created equal; their training data, objective alignment, and architecture critically determine their capability on specialized tasks like visual math. * The **QVQ-Max-Preview** point shows that very large scale can still push performance higher, but it requires a massive increase in parameters to achieve a score that is still below the much smaller "Kimi" model. **In summary, the data argues that for specialized reasoning tasks, intelligent model design and training can be far more impactful than brute-force scaling. The chart serves as a benchmark for evaluating not just raw performance, but the efficiency and effectiveness of different AI development approaches.** </details> Figure 1: Comparison between Kimi-VL-Thinking-2506 and frontier open-source VLMs, including short-thinking VLMs (e.g. Gemma-3 series, Qwen2.5-VL series) and long-thinking VLMs (QVQ-72B/Max-Preview), on MathVision benchmark. Our model achieves strong multimodal reasoning with just 2.8B LLM activated parameters. <details> <summary>x2.png Details</summary> ![2e35a83d](/v1/image/2e35a83db70ed2843d8c95d65d5e95c7fd3e4fcf2082d9622dabd7f2456c06b7) ### Visual Description \n ## Grouped Bar Chart: AI Model Performance Across Visual-Language Benchmarks ### Overview This image is a composite grouped bar chart comparing the performance of seven different AI models across six major benchmark categories, each containing one or two specific tests. The chart is organized into six distinct sections, each representing a benchmark category. The primary purpose is to visually compare the scores of the models, with a particular emphasis on the model "Kimi-VL-A3B," which is highlighted in blue. ### Components/Axes **Legend (Top Center):** The legend is positioned at the top of the entire chart, spanning horizontally. It maps model names to specific colors: * **Kimi-VL-A3B**: Bright Blue * **Qwen2.5-VL-7B**: Dark Gray * **DeepSeek-VL2**: Light Gray * **GPT-4o**: Very Dark Gray (almost black) * **GPT-4o-mini**: Very Light Gray * **Llama-3.2-11B-Inst.**: Light Brown/Tan * **Gemma-3-12B-IT**: Beige/Light Tan **Chart Sections & Axes:** The chart is segmented into six regions, each with its own title and y-axis scale. The x-axis within each section lists the specific benchmark names. 1. **GENERAL (Top Left):** * **Benchmarks:** MMMU (val), MMBench-EN-v1.1 * **Y-Axis:** Linear scale from 40 to 60 for MMMU (val); from 60 to 90 for MMBench-EN-v1.1. 2. **OCR (Top Center):** * **Benchmark:** InfoVQA * **Y-Axis:** Linear scale from 30 to 90. 3. **MULTI-IMAGE (Top Right):** * **Benchmark:** BLINK * **Y-Axis:** Linear scale from 38 to 62. 4. **LONG VIDEO (Bottom Left):** * **Benchmarks:** LongVideoBench, Video-MME (w/o sub) * **Y-Axis:** Linear scale from 40 to 72 for LongVideoBench; from 40 to 72 for Video-MME (w/o sub). 5. **LONG DOC (Bottom Center):** * **Benchmark:** MMLongBench-Doc * **Y-Axis:** Linear scale from 8 to 40. 6. **AGENT (Bottom Right):** * **Benchmarks:** ScreenSpot-Pro, OSWorld (Pass@1) * **Y-Axis:** Linear scale from 0 to 40 for ScreenSpot-Pro; from 0 to 10 for OSWorld (Pass@1). ### Detailed Analysis **1. GENERAL Benchmarks:** * **MMMU (val):** * **Trend:** Kimi-VL-A3B and Qwen2.5-VL-7B lead, followed by GPT-4o-mini, then DeepSeek-VL2, with Llama-3.2-11B-Inst. and Gemma-3-12B-IT trailing. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 57 * Qwen2.5-VL-7B (Dark Gray): 58.6 * DeepSeek-VL2 (Light Gray): 51.1 * GPT-4o-mini (Very Light Gray): 60 * Llama-3.2-11B-Inst. (Light Brown): 48 * Gemma-3-12B-IT (Beige): 59.6 * **MMBench-EN-v1.1:** * **Trend:** Kimi-VL-A3B leads, followed closely by Qwen2.5-VL-7B and DeepSeek-VL2. GPT-4o-mini and Gemma-3-12B-IT are in the next tier, with Llama-3.2-11B-Inst. significantly lower. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 83.1 * Qwen2.5-VL-7B (Dark Gray): 82.6 * DeepSeek-VL2 (Light Gray): 79.6 * GPT-4o-mini (Very Light Gray): 77.1 * Llama-3.2-11B-Inst. (Light Brown): 65.8 * Gemma-3-12B-IT (Beige): 74.6 **2. OCR Benchmark:** * **InfoVQA:** * **Trend:** Kimi-VL-A3B and Qwen2.5-VL-7B are the top performers, with DeepSeek-VL2 close behind. GPT-4o-mini is in the middle tier, while Llama-3.2-11B-Inst. and Gemma-3-12B-IT score notably lower. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 83.2 * Qwen2.5-VL-7B (Dark Gray): 82.6 * DeepSeek-VL2 (Light Gray): 78.1 * GPT-4o-mini (Very Light Gray): 57.9 * Llama-3.2-11B-Inst. (Light Brown): 34.6 * Gemma-3-12B-IT (Beige): 43.8 **3. MULTI-IMAGE Benchmark:** * **BLINK:** * **Trend:** Kimi-VL-A3B leads, followed by Qwen2.5-VL-7B and DeepSeek-VL2. Gemma-3-12B-IT is in the next tier, with Llama-3.2-11B-Inst. scoring the lowest. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 57.3 * Qwen2.5-VL-7B (Dark Gray): 56.4 * DeepSeek-VL2 (Light Gray): 53.6 * Llama-3.2-11B-Inst. (Light Brown): 39.8 * Gemma-3-12B-IT (Beige): 50.3 **4. LONG VIDEO Benchmarks:** * **LongVideoBench:** * **Trend:** Kimi-VL-A3B leads, followed by DeepSeek-VL2 and Qwen2.5-VL-7B. Gemma-3-12B-IT is next, with Llama-3.2-11B-Inst. scoring the lowest. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 64.5 * Qwen2.5-VL-7B (Dark Gray): 56 * DeepSeek-VL2 (Light Gray): 58.2 * Llama-3.2-11B-Inst. (Light Brown): 45.5 * Gemma-3-12B-IT (Beige): 51.5 * **Video-MME (w/o sub):** * **Trend:** Kimi-VL-A3B leads, followed closely by Qwen2.5-VL-7B and DeepSeek-VL2. Gemma-3-12B-IT is in the next tier, with Llama-3.2-11B-Inst. scoring the lowest. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 67.8 * Qwen2.5-VL-7B (Dark Gray): 65.1 * DeepSeek-VL2 (Light Gray): 64.8 * Llama-3.2-11B-Inst. (Light Brown): 46 * Gemma-3-12B-IT (Beige): 58.2 **5. LONG DOC Benchmark:** * **MMLongBench-Doc:** * **Trend:** Kimi-VL-A3B leads significantly. Qwen2.5-VL-7B and DeepSeek-VL2 are in the next tier, followed by Gemma-3-12B-IT and Llama-3.2-11B-Inst. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 35.1 * Qwen2.5-VL-7B (Dark Gray): 29.6 * DeepSeek-VL2 (Light Gray): 29 * Llama-3.2-11B-Inst. (Light Brown): 13.8 * Gemma-3-12B-IT (Beige): 21.3 **6. AGENT Benchmarks:** * **ScreenSpot-Pro:** * **Trend:** Kimi-VL-A3B leads, followed by Qwen2.5-VL-7B. GPT-4o-mini scores very low. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 34.5 * Qwen2.5-VL-7B (Dark Gray): 29 * GPT-4o-mini (Very Light Gray): 0.8 * **OSWorld (Pass@1):** * **Trend:** Kimi-VL-A3B leads, followed by GPT-4o and Qwen2.5-VL-7B. * **Data Points (Approximate):** * Kimi-VL-A3B (Blue): 8.2 * Qwen2.5-VL-7B (Dark Gray): 2.5 * GPT-4o (Very Dark Gray): 5 ### Key Observations 1. **Consistent Leader:** The Kimi-VL-A3B model (blue bars) achieves the highest or near-highest score in every single benchmark presented. 2. **Strong Competitors:** Qwen2.5-VL-7B (dark gray) and DeepSeek-VL2 (light gray) are consistently in the top tier, often swapping second and third place. 3. **Variable Performance of Other Models:** GPT-4o-mini, Llama-3.2-11B-Inst., and Gemma-3-12B-IT show more variable performance. They are competitive in some benchmarks (e.g., Gemma-3-12B-IT in MMMU val) but fall significantly behind in others (e.g., Llama-3.2-11B-Inst. in InfoVQA and MMLongBench-Doc). 4. **Missing Data:** The GPT-4o model (very dark gray) only appears in the OSWorld (Pass@1) benchmark, suggesting it was not evaluated on the other tasks shown here. 5. **Task-Specific Gaps:** The performance gap between the leading models and the lower-performing ones is most pronounced in the OCR (InfoVQA) and LONG DOC (MMLongBench-Doc) benchmarks. ### Interpretation This chart serves as a comparative performance report for visual-language AI models. The data strongly suggests that **Kimi-VL-A3B is a state-of-the-art model across a wide spectrum of visual-language tasks**, excelling in general understanding, OCR, multi-image reasoning, long video comprehension, long document processing, and agent-based interaction. The consistent high ranking of Qwen2.5-VL-7B and DeepSeek-VL2 indicates they are also top-tier models, forming a leading group with Kimi-VL-A3B. The variability in the performance of models like Llama-3.2-11B-Inst. highlights that model capabilities are highly task-dependent; a model strong in one area (e.g., general benchmarks) may be weak in another (e.g., OCR or long-document understanding). The chart is likely intended for a technical audience (researchers, engineers) to quickly assess model strengths and inform decisions about which model to use for specific applications. The emphasis on Kimi-VL-A3B, through its distinctive color and consistent top placement, suggests the chart may be part of a promotional or technical report highlighting its capabilities. The absence of GPT-4o from most benchmarks is a notable data gap, limiting a full comparison with that specific model. </details> Figure 2: Highlights of Kimi-VL performance for a wide range of benchmarks like, general benchmarks (MMMU, MMBench), OCR (InfoVQA), multi-image (BLINK), long video (LongVideoBench, Video-MME), long document (MMLongBench-Doc), and agent (ScreenSpot-Pro and OSWorld). Detailed results are presented in Table 3. ## 1 Introduction With the rapid advancement of artificial intelligence, human expectations for AI assistants have transcended traditional language-only interactions, increasingly aligning with the inherently multimodal nature of our world. To better understand and interact with these expectations, new generations of natively multimodal models, such as GPT-4o openai2024gpt4ocard and Google Gemini geminiteam2024gemini15unlockingmultimodal, have emerged with the capability to seamlessly perceive and interpret visual inputs alongside language processing. Most recently, advanced multimodal models, pioneered by OpenAI o1 series o12024 and Kimi k1.5 team2025kimi, have further pushed these boundaries by incorporating deeper and longer reasoning on multimodal inputs, thereby tackling more complex problems in the multimodal domain. Nevertheless, development in large VLMs in the open-source community has significantly lagged behind their language-only counterparts, particularly in aspects of scalability, computational efficiency, and advanced reasoning capabilities. While language-only model DeepSeek R1 deepseekai2025deepseekr1incentivizingreasoningcapability has already leveraged the efficient and more scalable mixture-of-experts (MoE) architecture and facilitated sophisticated long chain-of-thought (CoT) reasoning, most recent open-source VLMs, e.g. Qwen2.5-VL bai2025qwen25vltechnicalreport and Gemma-3 gemmateam2025gemma3technicalreport, continue to rely on dense architectures and do not support long-CoT reasoning. Early explorations into MoE-based vision-language models, such as DeepSeek-VL2 wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels and Aria li2024ariaopenmultimodalnative, exhibit limitations in other crucial dimensions. Architecturally, both models still adopt relatively traditional fixed-size vision encoders, hindering their adaptability to diverse visual inputs. From a capability perspective, DeepSeek-VL2 supports only a limited context length (4K), while Aria falls short in fine-grained visual tasks. Additionally, neither of them supports long-thinking abilities. Consequently, there remains a pressing need for an open-source VLM that effectively integrates structural innovation, stable capabilities, and enhanced reasoning through long-thinking. In light of this, we present Kimi-VL, a vision-language model for the open-source community. Structurally, Kimi-VL consists of our Moonlight liu2025muonscalablellmtraining MoE language model with only 2.8B activated (16B total) parameters, paired with a 400M native-resolution MoonViT vision encoder. In terms of capability, as illustrated in Figure 2, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc.) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc.). Specifically, it features the following exciting abilities: 1) Kimi-VL is smart: it has comparable text ability against efficient pure-text LLMs; without long thinking, Kimi-VL is already competitive in multimodal reasoning and multi-turn agent benchmarks, e.g., MMMU, MathVista, OSWorld. 2) Kimi-VL processes long: it effectively tackles long-context understanding on various multimodal inputs within its 128K context window, far ahead of similar-scale competitors on long video benchmarks and MMLongBench-Doc. 3) Kimi-VL perceives clear: it shows all-round competitive ability over existing efficient dense and MoE VLMs in various vision-language scenarios: visual perception, visual world knowledge, OCR, high-resolution OS screenshot, etc. Furthermore, with long-CoT activation and reinforcement learning (RL), we introduce the long-thinking version of Kimi-VL, Kimi-VL-Thinking, which further substantially improves performance on more complex multimodal reasoning scenarios. Despite its small scale, Kimi-VL-Thinking offers compelling performance on hard reasoning benchmarks (e.g., MMMU, MathVision, MathVista), outperforming many state-of-the-art VLMs with even larger sizes. We further release and improved version of the thinking model, Kimi-VL-Thinking-2506. The improved version has even better performance on these reasoning benchmarks while retaining or improving on common visual perception and understanding scenarios, e.g. high-resolution perception (V*), OS grounding, video and long document understanding. ## 2 Approach ### 2.1 Model Architecture <details> <summary>x3.png Details</summary> ![40620844](/v1/image/4062084420a5e52a27a434658e7439f2700a175972b38a53ededb2a5866dd72d) ### Visual Description ## Technical Diagram: Multimodal AI System Architecture ### Overview This image is a technical diagram illustrating the architecture of a multimodal AI system. It depicts how various input types (text, images, video, UI screenshots, OCR text) are processed through a central vision transformer ("MoonViT") and then fed into a Mixture-of-Experts (MoE) language decoder. The diagram emphasizes native-resolution processing and the integration of diverse data modalities. ### Components/Axes The diagram is organized into three primary regions: 1. **Top Region (Blue Background): Mixture-of-Experts (MoE) Language Decoder** * **Main Label:** "Mixture-of-Experts (MoE) Language Decoder" * **Sub-components:** * "MoE FFN" (Feed-Forward Network) * "Attention Layer" * A detailed breakout box showing the MoE routing mechanism: * "Router" * "Non-shared Experts" (represented by a row of outlined squares) * "Shared Experts" (represented by a row of filled squares) * The notation "× N" indicates this block is repeated N times. * **Input/Output:** A sequence of colored squares (representing tokens) flows into and out of this decoder block. 2. **Central Region: Core Processing Module** * **Primary Module:** "MoonViT" with the subtitle "(Native-resolution)". This is the central vision transformer. * **Bridge Component:** "MLP Projector" (Multi-Layer Perceptron), positioned between the MoE decoder and MoonViT, likely for feature projection. 3. **Bottom Region: Input Modalities** Five distinct input types are shown, all feeding into the MoonViT module via colored arrows: * **Left (Light Blue Arrow):** "SMALL IMAGE" * Contains a mathematical notation: `2a - b` * Dimension labels: "50px" (width), "20px" (height). * **Bottom Left (Red Arrow):** "LONG VIDEO" * Depicted as a stack of video frames. * Dimension labels: "480px" (width), "270px" (height). * Text visible on the video frames: "CULTURAL CROSSINGS: A JOURNEY OF DISCOVERY". * **Bottom Center (Green Arrow):** "FINE-GRAINED" * A high-resolution photograph of a terraced tea plantation. * Dimension labels: "1113px" (width), "1008px" (height). * A small white bounding box is drawn on the image, highlighting a specific detail. * **Bottom Right (Gray Arrow):** "OCR (SPECIAL ASPECT RATIO)" * A handwritten text snippet: "fastest? That is the exciting competition going on". * Dimension label: "58px" (height). * **Right (Orange Arrow):** "UI SCREENSHOT" * A screenshot of a smartphone home screen (iOS-style). * Dimension labels: "800px" (width), "1731px" (height). * Visible UI elements include app icons (FaceTime, Calendar, Photos, Camera, Mail, Clock, Maps, Weather, Notes, etc.), widgets (calendar, weather), and status bar icons. ### Detailed Analysis * **Data Flow:** The flow is bottom-up and then top-down. Raw multimodal inputs (image, video, UI, text) are first processed by the MoonViT vision encoder. The encoded visual features are then passed through the MLP Projector to the MoE Language Decoder, which generates the final textual output (represented by the token sequence at the very top). * **Key Architectural Features:** * **Native-Resolution Processing:** The "MoonViT" label explicitly states it operates on native-resolution inputs, avoiding resizing or padding that could distort information, especially critical for the "FINE-GRAINED" image and "UI SCREENSHOT". * **Mixture-of-Experts (MoE):** The language decoder uses an MoE architecture. The "Router" dynamically directs input tokens to a subset of "Non-shared Experts" while also utilizing "Shared Experts". This design aims for computational efficiency and model specialization. * **Diverse Input Handling:** The system is designed to handle a wide range of aspect ratios and content types, from small mathematical images (50x20px) to tall UI screenshots (800x1731px) and long video sequences. ### Key Observations 1. **Input Diversity:** The diagram explicitly showcases five fundamentally different input types, highlighting the system's multimodal capability. 2. **Resolution Emphasis:** Pixel dimensions are provided for every input, underscoring the importance of resolution and aspect ratio in the system's design. 3. **Specialized OCR Input:** The "OCR (SPECIAL ASPECT RATIO)" input suggests the system has a dedicated pathway or training for recognizing text in unusual layouts or handwritten forms. 4. **Visual Detail Focus:** The "FINE-GRAINED" input with its bounding box implies the system can process and reason about specific regions within a high-resolution image. 5. **MoE Complexity:** The detailed breakout of the MoE block indicates that the language generation component is a significant and complex part of the architecture. ### Interpretation This diagram represents a sophisticated, unified multimodal AI architecture. The core innovation appears to be the **MoonViT** module, which acts as a universal visual encoder capable of ingesting images, video frames, and screenshots at their native resolutions. This preserves critical spatial and textual details that would be lost with standard resizing. The encoded visual information is then translated (via the MLP Projector) into a format that the powerful **MoE Language Decoder** can understand. The MoE decoder, with its router and mix of shared/specialized experts, is designed to efficiently generate coherent and contextually appropriate language based on the complex visual input. The system's purpose is likely **visual question answering, document understanding, or detailed image/video captioning**. It can take a complex scene (like a UI screenshot or a detailed landscape) and answer questions about it, describe it, or extract information from it (as hinted by the OCR input and the "What can you interpret from..." text fragment near the top). The inclusion of "LONG VIDEO" suggests it may also handle temporal reasoning across frames. **Notable Anomaly/Challenge:** The vast difference in input dimensions (from 20px height to 1731px height) presents a significant technical challenge for consistent feature extraction, which the "native-resolution" claim of MoonViT aims to address. The architecture suggests a move away from traditional, rigid vision encoders towards more flexible, resolution-agnostic models. </details> Figure 3: The model architecture of Kimi-VL and Kimi-VL-Thinking, consisting of a MoonViT that allows native-resolution images, an MLP projector, and a Mixture-of-Experts (MoE) language decoder. The architecture of Kimi-VL consists of three parts: a native-resolution vision encoder (MoonViT), an MLP projector, and an MoE language model, as depicted in Figure 3. We introduce each part in this section. MoonViT: A Native-resolution Vision Encoder We design MoonViT, the vision encoder of Kimi-VL, to natively process images at their varying resolutions, eliminating the need for complex sub-image splitting and splicing operations, as employed in LLaVA-OneVision li2024llavaonevisioneasyvisualtask. We incorporate the packing method from NaViT dehghani2023patchnpacknavit, where images are divided into patches, flattened, and sequentially concatenated into 1D sequences. These preprocessing operations enable MoonViT to share the same core computation operators and optimization as a language model, such as the variable-length sequence attention mechanism supported by FlashAttention dao2022flashattentionfastmemoryefficientexact, ensuring non-compromised training throughput for images of varying resolutions. MoonViT is initialized from and continually pre-trained on SigLIP-SO-400M zhai2023sigmoidlosslanguageimage, which originally employs learnable fixed-size absolute positional embeddings to encode spatial information. While we interpolate these original position embeddings to better preserve SigLIP’s capabilities, these interpolated embeddings become increasingly inadequate as image resolution increases. To address this limitation, we incorporate 2D rotary positional embedding (RoPE) su2023roformerenhancedtransformerrotary across the height and width dimensions, which improves the representation of fine-grained positional information, especially in high-resolution images. These two positional embedding approaches work together to encode spatial information for our model and seamlessly integrate with the flattening and packing procedures. This integration enables MoonViT to efficiently process images of varying resolutions within the same batch. The resulting continuous image features are then forwarded to the MLP projector and, ultimately, to the MoE language model for subsequent training stages. In Kimi-VL-A3B-Thinking-2506, we further continually train this MoonViT to authentically encode up to 3.2 million pixels from a single image, 4 times compared to the original limit. MLP Projector We employ a two-layer MLP to bridge the vision encoder (MoonViT) and the LLM. Specifically, we first use a pixel shuffle operation to compress the spatial dimension of the image features extracted by MoonViT, performing 2×2 downsampling in the spatial domain and correspondingly expanding the channel dimension. We then feed the pixel-shuffled features into a two-layer MLP to project them into the dimension of LLM embeddings. Mixture-of-Experts (MoE) Language Model The language model of Kimi-VL utilizes our Moonlight model liu2025muonscalablellmtraining, an MoE language model with 2.8B activated parameters, 16B total parameters, and an architecture similar to DeepSeek-V3 deepseekai2025deepseekv3technicalreport. For our implementation, we initialize from an intermediate checkpoint in Moonlight’s pre-training stage—one that has processed 5.2T tokens of pure text data and activated an 8192-token (8K) context length. We then continue pre-training it using a joint recipe of multimodal and text-only data totaling 2.3T tokens, as detailed in Sec. 2.3. ### 2.2 Muon Optimizer We use an enhanced Muon optimizer liu2025muon for model optimization. Compared to the original Muon optimizer jordan2024muon, we add weight decay and carefully adjust the per-parameter update scale. Additionally, we develop a distributed implementation of Muon following the ZeRO-1 rajbhandari2020zero optimization strategy, which achieves optimal memory efficiency and reduced communication overhead while preserving the algorithm’s mathematical properties. This enhanced Muon optimizer is used throughout the entire training process to optimize all model parameters, including the vision encoder, the projector, and the language model. ### 2.3 Pre-Training Stages As illustrated in Figure 4 and Table 1, after loading the intermediate language model discussed above, Kimi-VL’s pre-training comprises a total of 4 stages consuming 4.4T tokens overall: first, standalone ViT training to establish a robust native-resolution visual encoder, followed by three joint training stages (pre-training, cooldown, and long-context activation) that simultaneously enhance the model’s language and multimodal capabilities. The details are as follows. <details> <summary>x4.png Details</summary> ![ecdbd39d](/v1/image/ecdbd39df96d54df8e606edabea1530d57550120102e1082a9abef2dd42f7b63) ### Visual Description ## Diagram: Multimodal AI Model Training Pipeline ### Overview The image is a horizontal flowchart illustrating a multi-stage training pipeline for a multimodal AI model. The process flows from left to right, beginning with two parallel initial training phases that converge into a joint training sequence. The diagram uses color-coded blocks (shades of blue) and annotated green arrows to depict stages, data volumes, and key procedural notes. ### Components/Axes The diagram consists of four primary rectangular blocks arranged horizontally, with one additional block stacked vertically on the far left. Green curved arrows with text annotations connect specific stages. **Block 1 (Top-Left):** * **Title:** Text Pre-training * **Data Volume:** 5.2T data * **Description:** Pure Text Data **Block 2 (Bottom-Left, stacked below Block 1):** * **Title:** ViT Training * **Data Volume:** 2.0T -> 0.1T data * **Description:** CoCa-loss with tiny language decoder -> align to LLM **Block 3 (Center-Left):** * **Title:** Joint Pre-training * **Data Volume:** 1.4T data * **Description:** Up to 40% Multimodal Data / Progressive Multimodal Ratio **Block 4 (Center-Right):** * **Title:** Joint Cooldown * **Data Volume:** 0.6T data * **Description:** High-quality Text & Multimodal Data / Re-warmup to higher LR **Block 5 (Far-Right):** * **Title:** Joint Long-context * **Data Volume:** 0.3T data * **Description:** Long Text & Long Video & Long Doc / RoPE base: 50,000 -> 800,000 **Connecting Elements:** * **Arrow 1:** A green, curved arrow originates from the top-right corner of the "Text Pre-training" block and points to the top-left corner of the "Joint Pre-training" block. The text above the arrow reads: `resumes LR scheduler`. * **Arrow 2:** A green, curved arrow originates from the top-right corner of the "Joint Pre-training" block and points to the top-left corner of the "Joint Cooldown" block. The text above the arrow reads: `resumes LR scheduler`. ### Detailed Analysis The pipeline describes a sequential training regimen with distinct phases characterized by data type, volume, and learning rate (LR) schedule. 1. **Initial Parallel Phase:** * **Text Pre-training:** This is the largest single data phase, using 5.2 trillion (`5.2T`) tokens of pure text data. * **ViT Training:** This phase shows a data reduction, starting with 2.0 trillion (`2.0T`) tokens and ending with 0.1 trillion (`0.1T`) tokens. It uses a CoCa-loss function with a tiny language decoder, with the explicit goal to "align to LLM." 2. **Joint Training Sequence:** The outputs of the initial phases feed into a joint training sequence. * **Joint Pre-training:** Uses 1.4 trillion (`1.4T`) data tokens. The multimodal data ratio is not fixed; it increases progressively up to a maximum of 40%. * **Joint Cooldown:** Uses a smaller, curated dataset of 0.6 trillion (`0.6T`) tokens described as "High-quality Text & Multimodal Data." A key procedural step is a "Re-warmup to higher LR," indicating a deliberate adjustment of the learning rate schedule. * **Joint Long-context:** The final phase uses the smallest dataset of 0.3 trillion (`0.3T`) tokens. It focuses on extending the model's context window for "Long Text & Long Video & Long Doc." A technical specification notes the RoPE (Rotary Positional Embedding) base is increased from 50,000 to 800,000. ### Key Observations * **Data Volume Trend:** The total data volume decreases significantly across the joint training phases (1.4T -> 0.6T -> 0.3T), suggesting a shift from broad pre-training to specialized fine-tuning. * **Learning Rate (LR) Management:** The LR scheduler is explicitly "resumed" when transitioning from the initial text pre-training to joint pre-training, and again from joint pre-training to joint cooldown. The cooldown phase itself involves a "re-warmup to a higher LR," indicating active and nuanced management of this hyperparameter. * **Specialization of Phases:** Each joint phase has a clear, distinct purpose: general multimodal integration (Pre-training), quality refinement (Cooldown), and context window extension (Long-context). * **Architectural Alignment:** The ViT (Vision Transformer) training phase has the explicit goal of aligning its output to the LLM (Large Language Model), which is a critical step for effective multimodal fusion. ### Interpretation This diagram outlines a sophisticated, staged approach to building a capable multimodal AI. The process begins by separately establishing strong unimodal foundations in text (LLM) and vision (ViT). The critical "alignment" step in ViT training ensures the vision encoder's output is compatible with the language model's representation space. The subsequent joint phases represent a deliberate curriculum. The model first learns to process mixed text and image/video data (Joint Pre-training). It then refines this ability on a smaller, higher-quality dataset while adjusting the learning rate to escape potential local minima (Joint Cooldown). Finally, it specializes in handling very long sequences of text and visual data, which is essential for understanding documents, videos, and complex narratives (Joint Long-context). The progressive increase in multimodal data ratio and the final extension of the RoPE base are technical strategies to efficiently build a model that is not just multimodal, but also capable of deep, long-form reasoning across modalities. The decreasing data volumes across joint stages imply a focus on precision and specialization over raw scale as the model matures. </details> Figure 4: The pre-training stages of Kimi-VL consume a total of 4.4T tokens after text-only pre-training of its language model. To preserve text abilities, all stages that update the language model are joint training stages. Table 1: Overview of training stages: data composition, token volumes, sequence lengths, and trainable components. | Stages Data | ViT Training Alt text Synthesis Caption | Joint Pre-training + Text, Knowledge | Joint Cooldown + High-quality Text | Joint Long-context + Long Text | | --- | --- | --- | --- | --- | | Grounding | Interleaving | High-quality Multimodal | Long Video | | | OCR | Video, Agent | Academic Sources | Long Document | | | Tokens | 2T + 0.1T | 1.4T | 0.6T | 0.3T | | Sequence length | 8192 | 8192 | 8192 | 32768->131072 | | Training | ViT | ViT & LLM | ViT & LLM | ViT & LLM | ViT Training Stages The MoonViT is trained on image-text pairs, where the text components consist of a variety of targets: image alt texts, synthetic captions, grounding bboxes, and OCR texts. The training incorporates two objectives: a SigLIP zhai2023sigmoidlosslanguageimage loss $\mathcal{L}_{siglip}$ (a variant of contrastive loss) and a cross-entropy loss $\mathcal{L}_{caption}$ for caption generation conditioned on input images. Following CoCa’s approach yu2022cocacontrastivecaptionersimagetext, the final loss function is formulated as $\mathcal{L}=\mathcal{L}_{siglip}+\lambda\mathcal{L}_{caption}$ , where $\lambda=2$ . Specifically, the image and text encoders compute the contrastive loss, while the text decoder performs next-token prediction (NTP) conditioned on features from the image encoder. To accelerate training, we initialized both encoders with SigLIP SO-400M zhai2023sigmoidlosslanguageimage weights and implemented a progressive resolution sampling strategy to gradually allow larger size; the text decoder is initialized from a tiny decoder-only language model. During training, we observed an emergence in the caption loss while scaling up OCR data, indicating that the text decoder had developed some OCR capabilities. After training the ViT in the CoCa-alike stage with 2T tokens, we align the MoonViT to the MoE language model using another 0.1T tokens, where only MoonViT and MLP projector are updated. This alignment stage significantly reduces the initial perplexity of MoonViT embeddings in the language model, allowing a smoother joint pre-training stage as follows. Joint Pre-training Stage In the joint pre-training stage, we train the model with a combination of pure text data (sampled from the same distribution as the initial language model) and a variety of multimodal data (as discussed in Sec. 3.1). We continue training from the loaded LLM checkpoint using the same learning rate scheduler, consuming an additional 1.4T tokens. The initial steps utilize solely language data, after which the proportion of multimodal data gradually increases. Through this progressive approach and the previous alignment stage, we observe that joint pre-training preserves the model’s language capabilities while successfully integrating visual comprehension abilities. Joint Cooldown Stage The stage following the pre-training stage is a multimodal cooldown phase, where the model is continue trained with high-quality language and multimodal datasets to ensure superior performance. For the language part, through empirical investigation, we observe that the incorporation of synthetic data during the cooling phase yields significant performance improvements, particularly in mathematical reasoning, knowledge-based tasks, and code generation. The general text components of the cooldown dataset are curated from high-fidelity subsets of the pre-training corpus. For math, knowledge, and code domains, we employ a hybrid approach: utilizing selected pre-training subsets while augmenting them with synthetically generated content. Specifically, we leverage existing mathematical knowledge and code corpora as source material to generate question-answer (QA) pairs through a proprietary language model, implementing rejection sampling techniques to maintain quality standards yue2023mammoth,su2024nemotron. These synthesized QA pairs undergo comprehensive validation before being integrated into the cooldown dataset. For the multimodal part, in addition to the two strategies as employed in text cooldown data preparation, i.e. question-answer synthesis and high-quality subset replay, to allow more comprehensive visual-centric perception and understanding li2024llavaonevisioneasyvisualtask,tong2024cambrian1fullyopenvisioncentric,guo2024mammothvlelicitingmultimodalreasoning, we filter and rewrite a variety of academic visual or vision-language data sources to QA pairs. Unlike post-training stages, these language and multimodal QA pairs in the cooldown stage are only included for activating specific abilities and henceforth facilitating learning high-quality data, thus, we keep their ratio at a low portion to avoid overfitting these QA patterns. The joint cooldown stage significantly improves both language and multimodal abilities of the model. Table 2: Needle-in-a-Haystack (NIAH) test on text/video haystacks, where needles are uniformly distributed at various positions within the haystack. We report recall accuracy across different haystack lengths up to 131,072 tokens (128K). | - text haystack - video haystack | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 100.0 100.0 | 87.0 91.7 | | --- | --- | --- | --- | --- | --- | --- | --- | Joint Long-context Activation Stage In the final pre-training stage, we extend the context length of the model from 8192 (8K) to 131072 (128K), with the inverse frequency of its RoPE su2023roformerenhancedtransformerrotary embeddings reset from 50,000 to 800,000. The joint long-context stage is conducted in two sub-stages, where each one extends the model’s context length by four times. For data composition, we filter and upsample the ratio of long data to 25% in each sub-stage, while using the remaining 75% tokens to replay shorter data in its previous stage; our exploration confirms that this composition allows the model to effectively learn long-context understanding while maintaining short-context ability. To allow the model to activate long-context abilities on both pure-text and multimodal inputs, the long data used in Kimi-VL’s long-context activation consists of not only long text, but also long multimodal data, including long interleaved data, long videos, and long documents. Similar as cooldown data, we also synthesize a small portion of QA pairs to augment the learning efficiency of long-context activation. After the long-context activations, the model can pass needle-in-a-haystack (NIAH) evaluations with either long pure-text or long video haystack, proving its versatile long-context ability. We provide the NIAH recall accuracy on various range of context length up to 128K in Table 2. <details> <summary>x5.png Details</summary> ![21242f44](/v1/image/21242f441ef8dd2176ea4fc049adef6abcbeba8dda568a6cc374b89158915514) ### Visual Description ## Diagram: Kimi-VL Training Pipeline ### Overview The image is a horizontal flowchart illustrating a three-stage training pipeline for an AI model named "Kimi-VL," culminating in a variant called "Kimi-VL-Thinking." The diagram uses a left-to-right flow with blue rectangular boxes representing distinct training phases, connected by arrows indicating the sequence and data flow. ### Components/Axes The diagram consists of three primary blue boxes arranged horizontally, connected by arrows. Text is embedded within each box and along the connecting arrows. 1. **Leftmost Box (Stage 1):** * **Position:** Far left. * **Primary Label:** "Joint Supervised Fine-tuning" * **Subtext (Line 1):** "Text + Multimodal SFT Data" * **Subtext (Line 2):** "1 Epoch@32K + 1 Epoch@128K" 2. **First Connecting Arrow:** * **Position:** Between the first and second boxes. * **Label (Vertical Text):** "Kimi-VL" 3. **Middle Box (Stage 2):** * **Position:** Center. * **Primary Label:** "Long-CoT Supervised Fine-tuning" * **Subtext (Line 1):** "Text + Multimodal Long-CoT Data" * **Subtext (Line 2):** "Planning, Evaluation, Reflection, Exploration" 4. **Second Connecting Arrow:** * **Position:** Between the second and third boxes. * **Label:** No text on this arrow. 5. **Rightmost Box (Stage 3):** * **Position:** Far right. * **Primary Label:** "Reinforcement Learning (RL)" * **Subtext (Line 1):** "Online RL on Answer Only" * **Subtext (Line 2):** "Length penalty, Difficulty control" 6. **Final Output Arrow:** * **Position:** Extending from the right side of the third box. * **Label (Vertical Text):** "Kimi-VL-Thinking" ### Detailed Analysis The diagram outlines a sequential, multi-stage training methodology: * **Stage 1 - Joint Supervised Fine-tuning:** This initial phase uses a combined dataset of text and multimodal data for supervised fine-tuning (SFT). The training schedule is specified as one epoch on a 32K context length followed by one epoch on a 128K context length. * **Stage 2 - Long-CoT Supervised Fine-tuning:** The model from Stage 1 ("Kimi-VL") undergoes a second supervised fine-tuning phase. This stage uses "Long-CoT" (Long Chain-of-Thought) data, which includes both text and multimodal examples. The training focuses on instilling reasoning capabilities described as "Planning, Evaluation, Reflection, Exploration." * **Stage 3 - Reinforcement Learning (RL):** The model from Stage 2 is further refined using reinforcement learning. Key details are that it is "Online RL" (likely meaning updates are performed during interaction) and is applied "on Answer Only," suggesting the reward model or policy update focuses on the final answer quality. Training is guided by a "Length penalty" and "Difficulty control" mechanism. * **Final Output:** The result of this three-stage pipeline is a model designated "Kimi-VL-Thinking." ### Key Observations * The pipeline shows a clear progression from general supervised learning to specialized reasoning-focused training, and finally to optimization via reinforcement learning. * The transition from "SFT Data" to "Long-CoT Data" indicates a deliberate shift in training data composition to foster complex reasoning. * The RL stage's constraints ("Length penalty, Difficulty control") suggest an effort to balance answer quality with efficiency and to manage the complexity of training examples. * The naming convention implies that "Kimi-VL-Thinking" is an enhanced version of the base "Kimi-VL" model, specifically endowed with advanced reasoning ("Thinking") capabilities through this pipeline. ### Interpretation This diagram represents a sophisticated, contemporary approach to training large multimodal language models. The pipeline is designed to systematically build capabilities: 1. **Foundation:** Stage 1 establishes a broad base of knowledge and alignment using standard supervised fine-tuning on diverse data. 2. **Reasoning Specialization:** Stage 2 explicitly targets the development of chain-of-thought reasoning, a critical skill for complex problem-solving. The listed components (Planning, Evaluation, etc.) are hallmarks of advanced cognitive processes. 3. **Refinement and Optimization:** Stage 3 uses reinforcement learning to fine-tune the model's outputs based on reward signals, likely improving accuracy, helpfulness, and adherence to desired formats. The "online" aspect and specific penalties point to a dynamic and controlled training environment. The overall flow suggests that achieving a model capable of sophisticated "Thinking" requires more than just feeding it data; it requires a structured curriculum that first teaches it *what* to know, then *how* to reason, and finally *how to optimize* its reasoning for specific goals. The explicit mention of multimodal data at each supervised stage underscores that this reasoning capability is intended to operate across text and visual inputs. </details> Figure 5: The post-training stages of Kimi-VL and Kimi-VL-Thinking, including two stages of joint SFT in 32K and 128K context, and further long-CoT SFT and RL stages to activate and enhance long thinking abilities. ### 2.4 Post-Training Stages Joint Supervised Fine-tuning (SFT) In this phase, we fine-tune the base model of Kimi-VL with instruction-based fine-tuning to enhance its ability to follow instructions and engage in dialogue, culminating in the creation of the interactive Kimi-VL model. This is achieved by employing the ChatML format (Openai, 2024), which allows for a targeted instruction optimization while maintaining architectural consistency with Kimi-VL. We optimize the language model, MLP projector, and vision encoder using a mixture of pure-text and vision-language SFT data, which will be described in Sec 3.2. Supervision is applied only to answers and special tokens, with system and user prompts being masked. The model is exposed to a curated set of multimodal instruction-response pairs, where explicit dialogue role tagging, structured injection of visual embeddings, and preservation of cross-modal positional relationships are ensured through the format-aware packing. Additionally, to guarantee the model’s comprehensive proficiency in dialogue, we incorporate a mix of multimodal data and pure text dialogue data used in Moonlight, ensuring its versatility across various dialogue scenarios. We first train the model at the sequence length of 32k tokens for 1 epoch, followed by another epoch at the sequence length of 128k tokens. In the first stage (32K), the learning rate decays from $2\times 10^{-5}$ to $2\times 10^{-6}$ , before it re-warmups to $1\times 10^{-5}$ in the second stage (128K) and finally decays to $1\times 10^{-6}$ . To improve training efficiency, we pack multiple training examples into each single training sequence. Long-CoT Supervised Fine-Tuning With the refined RL prompt set, we employ prompt engineering to construct a small yet high-quality long-CoT warmup dataset, containing accurately verified reasoning paths for both text and image inputs. This approach resembles rejection sampling (RS) but focuses on generating long-CoT reasoning paths through prompt engineering. The resulting warmup dataset is designed to encapsulate key cognitive processes that are fundamental to human-like reasoning, such as planning, where the model systematically outlines steps before execution; evaluation, involving critical assessment of intermediate steps; reflection, enabling the model to reconsider and refine its approach; and exploration, encouraging consideration of alternative solutions. By performing a lightweight SFT on this warm-up dataset, we effectively prime the model to internalize these multimodal reasoning strategies. As a result, the fine-tuned long-CoT model demonstrates improved capability in generating more detailed and logically coherent responses, which enhances its performance across diverse reasoning tasks. Reinforcement Learning To further advance the model’s reasoning abilities, we then train the model with reinforcement learning (RL), enabling the model to autonomously generate structured CoT rationales. Specifically, similar as Kimi k1.5 team2025kimi, we adopt a variant of online policy mirror descent as our RL algorithm, which iteratively refines the policy model $\pi_{\theta}$ to improve its problem-solving accuracy. During the $i$ -th training iteration, we treat the current model as a reference policy model and optimize the following objective, regularized by relative entropy to stabilize policy updates: $$ \displaystyle\max_{\theta}\mathbb{E}_{(x,y^{*})\sim\mathcal{D}}\left[\mathbb{E }_{(y,z)\sim\pi_{\theta}}\left[r(x,y,y^{*})\right]-\tau\mathrm{KL}(\pi_{\theta }(x)||\pi_{\theta_{i}}(x))\right]\,, \tag{1} $$ where $r$ is a reward model that justifies the correctness of the proposed answer $y$ for the given problem $x$ , by assigning a value $r(x,y,y^{*})\in\{0,1\}$ based on the ground truth $y^{*}$ , and $\tau>0$ is a parameter controlling the degree of regularization. Each training iteration begins by sampling a problem batch from the dataset $\mathcal{D}$ , and the model parameters are updated to $\theta_{i+1}$ using the policy gradient derived from (1), with the optimized policy model subsequently assuming the role of reference policy for the subsequent iteration. To enhance RL training efficiency, we implement a length-based reward to penalize excessively long responses, mitigating the overthinking problem where the model generates redundant reasoning chains. Besides, we employ two sampling strategies including curriculum sampling and prioritized sampling, which leverage difficulty labels and per-instance success rates to focus training effort on the most pedagogically valuable examples, thereby optimizing the learning trajectory and improving training efficiency. Through large-scale reinforcement learning training, we can derive a model that harnesses the strengths of both basic prompt-based CoT reasoning and sophisticated planning-enhanced CoT approaches. During inference, the model maintains standard autoregressive sequence generation, eliminating the deployment complexities associated with specialized planning algorithms that require parallel computation. Simultaneously, the model develops essential meta-reasoning abilities including error detection, backtracking, and iterative solution refinement by effectively utilizing the complete history of explored reasoning paths as contextual information. With endogenous learning from its complete reasoning trace history, the model can effectively encode planned search procedures into its parametric knowledge. ### 2.5 Infrastructure Storage We utilize S3 amazon_s3 compatible object storage from cloud service vendors to store our visual-text data. To minimize the time between data preparation and model training, we store visual data in its original format and have developed an efficient and flexible data loading system. This system provides several key benefits: - Supports on-the-fly data shuffling, mixing, tokenization, loss masking and packing during training, allowing us to adjust data proportions as needed; - Enables random augmentation of both visual and text data, while preserving the correctness of 2D coordinate and orientation information during transformations; - Ensures reproducibility by strictly controlling random states and other states across different data loader workers, guaranteeing that any interrupted training can be resumed seamlessly—the data sequence after resumption remains identical to an uninterrupted run; - Delivers high-performance data loading: through multiple caching strategies, our system reliably supports training on large scale clusters while maintaining controlled request rates and throughput to the object storage. Additionally, to ensure consistent dataset quality control, we developed a centralized platform for data registration, visualization, compiling statistics, synchronizing data across cloud storage systems, and managing dataset lifecycles. <details> <summary>x6.png Details</summary> ![0c9ce97a](/v1/image/0c9ce97aba2070f177148529283faa53b5eecb211cdb8850b9f622637c1ada35) ### Visual Description ## Screenshot: AI Assistant Interface Analyzing Handwritten Manuscripts ### Overview The image is a screenshot of a graphical user interface (GUI) for an AI assistant. The interface is divided into two main vertical panels. The left panel, titled "Instruction," contains a user query and two embedded images of handwritten mathematical manuscripts. The right panel, titled "Response," displays the AI's detailed, step-by-step analysis and final conclusion regarding the authorship and content of those manuscripts. The overall aesthetic is clean and functional, with a light gray background and distinct panel borders. ### Components/Axes The interface has no chart axes. Its primary components are: 1. **Left Panel ("Instruction"):** * **Header:** Contains a pencil icon and the bold text "Instruction". * **User Query Text:** "Please infer step by step who this manuscript belongs to and what it records". * **Embedded Images:** Two rectangular images showing pages of aged, yellowish paper filled with handwritten mathematical equations and notes. 2. **Right Panel ("Response"):** * **Header:** Contains a stylized "H" icon and the bold text "Response". * **Main Content Area:** A large text block containing the AI's analysis. This is subdivided into: * A `<Think>` section (enclosed in tags) detailing the internal reasoning process. * A final, formatted answer section with bold headings ("Key Observations:", "Conclusion:", "Final Answer:"). ### Detailed Analysis / Content Details **A. Left Panel - Manuscript Images:** * **Image 1 (Top):** Shows a page of handwritten notes. The equations involve partial derivatives (denoted by `∂`), summations (Σ), and variables like `g`, `M`, `T`. The term "Einheitsvektor" (German for "unit vector") is visible. The handwriting is cursive and somewhat dense. * **Image 2 (Bottom):** Shows another page with more numerical calculations and algebraic manipulations. It includes integrals, references to constants labeled "K", "M", "T", and what appears to be a small graph or diagram in the lower right corner. The handwriting style is consistent with the first image. **B. Right Panel - AI Response Text (Transcription of Key Sections):** * **User Query:** "Please infer step by step who this manuscript belongs to and what it records" * **AI's `<Think>` Section (Summary of Reasoning):** * The handwriting suggests a mathematician or physicist from the 20th century. * The content involves gravitational fields, celestial mechanics, tensor calculus, and field theory. * German terms like "Einheitsvektor" and "Gln" (likely for "Gleichung"/equation) point to a German-speaking author. * The style and content are consistent with Albert Einstein's work on general relativity and gravitational field equations. * The two images likely represent different sections of the same research manuscript. * **AI's Final Answer (Formatted Output):** * **Assertion:** "The manuscripts in the images belong to **Albert Einstein**... The content... records his work on **gravitational field equations** and related mathematical derivations..." * **Key Observations:** 1. **Handwriting Style:** Consistent with Einstein's known style. 2. **Content Analysis:** Equations involve partial derivatives, summations, Greek letters (γ, g), resembling Einstein's formulation of gravitational field equations. The second image shows detailed numerical derivations. 3. **German Terms:** Presence of "Einheitsvektor" and "Gln" suggests a German-speaking author. * **Conclusion:** "These manuscripts are part of Einstein's research materials, documenting his mathematical and theoretical work on gravitational fields and field equations... a cornerstone of modern physics." * **Final Answer:** "The manuscripts belong to Albert Einstein and record his work on gravitational field equations and mathematical derivations in general relativity." ### Key Observations 1. **Structured AI Reasoning:** The response explicitly shows a chain-of-thought process (`<Think>` tags) before delivering the final answer, demonstrating a step-by-step analytical approach. 2. **Multimodal Analysis:** The AI successfully integrates visual analysis (handwriting style, paper age) with textual and mathematical content analysis (equations, German terms) to form its conclusion. 3. **Specific Attribution:** The analysis does not merely suggest a field of study but makes a definitive attribution to a specific historical figure (Albert Einstein) based on correlating multiple lines of evidence. 4. **Content Focus:** The extracted information centers entirely on the *metadata* of the manuscripts (author, subject) rather than a full transcription of the complex mathematical equations themselves, which are described generically. ### Interpretation This screenshot captures a meta-demonstration of an AI's capability to perform expert-level document analysis. The "data" here is not numerical but forensic and historical. * **What it demonstrates:** The AI acts as a digital historian and physicist's assistant. It synthesizes paleographic clues (handwriting), linguistic analysis (German technical terms), and domain-specific knowledge (theoretical physics, general relativity) to authenticate and contextualize primary source documents. * **Relationship between elements:** The user's open-ended query ("infer step by step") directly triggers the AI's structured, evidence-based reasoning process displayed in the response. The two manuscript images serve as the primary evidence, and the AI's text is the analytical report derived from that evidence. * **Notable pattern:** The AI's conclusion is built on a convergence of independent indicators: the *language* (German), the *scientific content* (gravitational field equations), and the *physical artifact's style* (handwriting). This mirrors how a human expert would approach the problem, lending credibility to the output. * **Underlying purpose:** The image showcases the AI's utility in academic and research contexts, specifically for digitizing, interpreting, and attributing historical scientific documents, potentially accelerating scholarship in the history of science. </details> Figure 6: Manuscript reasoning visualization. Kimi-VL-Thinking demonstrates the ability to perform historical and scientific inference by analyzing handwritten manuscripts step by step. In this example, our model identifies the author as Albert Einstein based on handwriting style, content analysis, and language cues. It reasons that the manuscripts relate to gravitational field equations, consistent with Einstein’s contributions to general relativity. Parallelism We adopt a 4D parallelism strategy—Data Parallelism li2020pytorchdistributedexperiencesaccelerating, Expert Parallelism fedus2022switchtransformersscalingtrillion, Pipeline Parallelism huang2019gpipeefficienttraininggiant,narayanan2021efficientlargescalelanguagemodel, and Context Parallelism jacobs2023deepspeedulyssesoptimizationsenabling,liu2023ringattentionblockwisetransformers—to accelerate the speed of Kimi-VL . After optimizing parallel strategies, the resulting training throughput of our model is around 60% higher than a 7B dense VLM (e.g. VLMs based on Qwen2.5-7B). - Data Parallelism (DP). DP replicates the model across multiple devices, each processing different micro-batches. This setup allows larger effective batch sizes by simply increasing the number of devices. - Expert Parallelism (EP). EP distributes expert modules in the MoE layer across multiple devices. When combined with DP, experts on a given device can handle tokens from different DP groups, enhancing computational efficiency. - Pipeline Parallelism (PP). PP splits the model into multiple layer-based stages. To minimize pipeline bubbles, we allocate the Vision Tower (VT) and several decoder layers to the first stage, place the output layer and additional decoder layers in the last stage, and distribute the remaining decoder layers evenly across intermediate stages based on their time overhead. - Context Parallelism (CP). CP addresses long-sequence training by splitting sequences across different CP ranks in conjunction with flash attention dao2022flashattentionfastmemoryefficientexact. This substantially reduces peak memory usage and relieves the memory pressure from attention computations. Beyond these four parallel strategies, we incorporate ZeRO1 rajbhandari2020zero and Selective Checkpointing Activation chen2016trainingdeepnetssublinear, korthikanti2022reducingactivationrecomputationlarge to further optimize memory usage. ZeRO1 reduces optimizer state overhead by using a distributed optimizer while avoiding extra communication costs. Selective Checkpointing Activation trades time for space by recomputing only those layers that have low time overhead but high memory consumption, striking a balance between computation efficiency and memory demands. For extremely long sequences, we expand recomputation to a broader set of layers to prevent out-of-memory errors. ## 3 Data Construction ### 3.1 Pre-Training Data Our multimodal pre-training corpus is designed to provide high-quality data that enables models to process and understand information from multiple modalities, including text, images, and videos. To this end, we have also curated high-quality data from six categories – caption, interleaving, OCR, knowledge, video, and agent – to form the corpus. When constructing our training corpus, we developed several multimodal data processing pipelines to ensure data quality, encompassing filtering, synthesis, and deduplication. Establishing an effective multimodal data strategy is crucial during the joint training of vision and language, as it both preserves the capabilities of the language model and facilitates alignment of knowledge across diverse modalities. We provide a detailed description of these sources in this section, which is organized into the following categories: Caption Data Our caption data provides the model with fundamental modality alignment and a broad range of world knowledge. By incorporating caption data, the multimodal LLM gains wider world knowledge with high learning efficiency. We have integrated various open-source Chinese and English caption datasets like schuhmann2022laion, gadre2024datacomp and also collected substantial in-house caption data from multiple sources. However, throughout the training process, we strictly limit the proportion of synthetic caption data to mitigate the risk of hallucination stemming from insufficient real-world knowledge. For general caption data, we follow a rigorous quality control pipeline that avoids duplication and maintain high image-text correlation. We also vary image resolution during pre-training to ensure that the vision tower remains effective when processing images of both high- and low-resolution. Image-text Interleaving Data During the pre-training phase, the model benefits from interleaving data for many aspects. For example, multi-image comprehension ability can be boosted by interleaving data; interleaving data always provides detailed knowledge for the given image; a longer multimodal context learning ability can also be gained by interleaving data. What’s more, we also find that interleaving data can contribute positively to maintaining the model’s language abilities. Thus, image-text interleaving data is an important part in our training corpus. Our multimodal corpus considered open-sourced interleave datasets like zhu2024multimodal,laurenccon2024obelics and also constructed large-scale in-house data using resources like textbooks, webpages, and tutorials. Further, we also find that synthesizing the interleaving data benefits the performance of multimodal LLM for keeping the text knowledge. To ensure each image’s knowledge is sufficiently studied, for all the interleaving data, despite standard filtering, deduping, and other quality control pipeline, we also integrate a data reordering procedure to keep all the image and text in the correct order. OCR Data Optical Character Recognition (OCR) is a widely adopted technique that converts text from images into an editable format. In our model, a robust OCR capability is deemed essential for better aligning the model with human values. Accordingly, our OCR data sources are diverse, ranging from open-source to in-house datasets, encompassing both clean and augmented images, and spanning over single-page and multi-page inputs. In addition to the publicly available data, we have developed a substantial volume of in-house OCR datasets, covering multilingual text, dense text layouts, web-based content, and handwritten samples. Furthermore, following the principles outlined in OCR 2.0 wei2024general, our model is also equipped to handle a variety of optical image types, including figures, tables, geometry diagrams, mermaid plots, and natural scene text. We apply extensive data augmentation techniques—such as rotation, distortion, color adjustments, and noise addition—to enhance the model’s robustness. As a result, our model achieves a high level of proficiency in OCR tasks. In addition to single-page OCR data, we collect and convert a large volume of in-house multi-page OCR data to activate the model’s understanding of long documents in the real world. With the help of these data, our model is capable of performing accurate OCR on a single image but can also comprehend an entire academic paper or a scanned book. Knowledge Data The concept of multimodal knowledge data is analogous to the previously mentioned text pre-training data, except here we focus on assembling a comprehensive repository of human knowledge from diverse sources to further enhance the model’s capabilities. For example, carefully curated geometry data in our dataset is vital for developing visual reasoning skills, ensuring the model can interpret the abstract diagrams created by humans. Our knowledge corpus adheres to a standardized taxonomy to balance content across various categories, ensuring diversity in data sources. Similar to text-only corpora, which gather knowledge from textbooks, research papers, and other academic materials, multimodal knowledge data employs both a layout parser and an OCR model to process content from these sources. While we also include filtered data from internet-based and other external resources. Because a significant portion of our knowledge corpus is sourced from internet-based materials, infographics can cause the model to focus solely on OCR-based information. In such cases, relying exclusively on a basic OCR pipeline may limit training effectiveness. To address this, we have developed an additional pipeline that better captures the purely textual information embedded within images. Agent Data For agent tasks, the model’s grounding and planning capabilities have been significantly enhanced. In addition to utilizing publicly available data, a platform has been established to efficiently manage and execute virtual machine environments in bulk. Within these virtual environments, heuristic methods were employed to collect screenshots and corresponding action data. This data was then processed into dense grounding formats and continuous trajectory formats. The design of the Action Space was categorized according to Desktop, Mobile, and Web environments. Furthermore, icon data was collected to strengthen the model’s understanding of the meanings of icons within software graphical user interfaces (GUIs). To enhance the model’s planning ability for solving multi-step desktop tasks, a set of computer-use trajectories was collected from human annotators, each accompanied by synthesized Chain-of-Thought (Aguvis xu2024aguvis). These multi-step agent demonstrations equip Kimi-VL with the capability to complete real-world desktop tasks (on both Ubuntu and Windows). Video Data In addition to image-only and image-text interleaved data, we also incorporate large-scale video data during pre-training, cooldown, and long-context activation stages to enable two directions of essential abilities of our model: first, to understand a long-context sequence dominated by images (e.g. hour-long videos) in addition to long text; second, to perceive fine-grained spatio-temporal correspondence in short video clips. Our video data are sourced from diverse resources, including open-source datasets as well as in-house web-scale video data, and span videos of varying durations. Similarly, to ensure sufficient generalization ability, our video data cover a wide range of scenes and diverse tasks. We cover tasks such as video description and video grounding, among others. For long videos, we carefully design a pipeline to produce dense captions. Similar to processing the caption data, we strictly limit the proportion of the synthetic dense video description data to reduce the risk of hallucinations. Text Data Our text pretrain corpus directly utilizes the data in Moonlight [liu2025muonscalablellmtraining], which is designed to provide comprehensive and high-quality data for training large language models (LLMs). It encompasses five domains: English, Chinese, Code, Mathematics & Reasoning, and Knowledge. We employ sophisticated filtering and quality control mechanisms for each domain to ensure the highest quality training data. For all pretrain data, we conducted rigorous individual validation for each data source to assess its specific contribution to the overall training recipe. This systematic evaluation ensures the quality and effectiveness of our diverse data composition. To optimize the overall composition of our training corpus, the sampling strategy for different document types is empirically determined through extensive experimentation. We conduct isolated evaluations to identify document subsets that contribute most significantly to the model’s knowledge acquisition capabilities. These high-value subsets are upsampled in the final training corpus. However, to maintain data diversity and ensure model generalization, we carefully preserve a balanced representation of other document types at appropriate ratios. This data-driven approach helps us optimize the trade-off between focused knowledge acquisition and broad generalization capabilities. footnotetext: GPT-4o and GPT-4o-mini results use Omniparser without UIA, according to [bonatti2024windowsagentarenaevaluating]. | | Benchmark (Metric) | GPT-4o | GPT- | Qwen2.5- | Llama3.2- | Gemma3- | DeepSeek- | Kimi-VL- | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 4o-mini | VL-7B | 11B-Inst. | 12B-IT | VL2 | A3B | | | | | Architecture | - | - | Dense | Dense | Dense | MoE | MoE | | | # Act. Params ${}_{\text{(LLM+VT)}}$ | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B | | | # Total Params | - | - | 8B | 11B | 12B | 28B | 16B | | | College-level | MMMU ${}_{\text{val}}$ (Pass@1) | 69.1 | 60.0 | 58.6 | 48 | 59.6 | 51.1 | 57.0 | | VideoMMMU (Pass@1) | 61.2 | - | 47.4 | 41.8 | 57.2 | 44.4 | 52.6 | | | MMVU ${}_{\text{val}}$ (Pass@1) | 67.4 | 61.6 | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 | | | General | MMBench-EN-v1.1 (Acc) | 83.1 | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | 83.1 | | MMStar (Acc) | 64.7 | 54.8 | 63.9 | 49.8 | 56.1 | 55.5 | 61.3 | | | MMVet (Pass@1) | 69.1 | 66.9 | 67.1 | 57.6 | 64.9 | 60.0 | 66.7 | | | RealWorldQA (Acc) | 75.4 | 67.1 | 68.5 | 63.3 | 59.1 | 68.4 | 68.1 | | | AI2D (Acc) | 84.6 | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | 84.9 | | | Multi-image | BLINK (Acc) | 68.0 | 53.6 | 56.4 | 39.8 | 50.3 | - | 57.3 | | Math | MathVista (Pass@1) | 63.8 | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | 68.7 | | MathVision (Pass@1) | 30.4 | - | 25.1 | 13.6 | 32.1 | 17.3 | 21.4 | | | OCR | InfoVQA (Acc) | 80.7 | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | 83.2 | | OCRBench (Acc) | 815 | 785 | 864 | 753 | 702 | 811 | 867 | | | OS Agent | ScreenSpot-V2 (Acc) | 18.1 | - | 86.8 | - | - | - | 92.8 | | ScreenSpot-Pro (Acc) | 0.8 | - | 29.0 | - | - | - | 34.5 | | | OSWorld (Pass@1) | 5.03 | - | 2.5 | - | - | - | 8.22 | | | WindowsAgentArena (Pass@1) footnotemark: | 9.4 | 2.7 | 3.4 | - | - | - | 10.4 | | | Long Document | MMLongBench-Doc (Acc) | 42.8 | 29.0 | 29.6 | 13.8 | 21.3 | - | 35.1 | | Long Video | Video-MME (w/o sub. / w/ sub.) | 71.9/77.2 | 64.8/68.9 | 65.1/71.6 | 46.0/49.5 | 58.2/62.1 | - | 67.8/72.6 | | MLVU ${}_{\text{MCQ}}$ (Acc) | 64.6 | 48.1 | 70.2 | 44.4 | 52.3 | - | 74.2 | | | LongVideoBench ${}_{\text{val}}$ | 66.7 | 58.2 | 56.0 | 45.5 | 51.5 | - | 64.5 | | | Video Perception | EgoSchema ${}_{\text{full}}$ | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | 78.5 | | VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | 37.4 | | | TOMATO | 37.7 | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | 31.7 | | Table 3: Performance of Kimi-VL against proprietary and open-source efficient VLMs; performance of GPT-4o is also listed in gray for reference. Top and second-best models are in boldface and underline respectively. Some results of competing models are unavailable due to limitation of model ability on specific tasks or model context length. ### 3.2 Instruction Data At this stage, the data is primarily aimed at enhancing the model’s conversational abilities and instruction-following capabilities. To cover as many scenarios as possible, we enrich the data across different domains. For non-reasoning tasks, including chart interpretation, agent grounding, OCR, image-grounded conversations, question-answering, writing, and text processing, we initially construct a seed dataset through human annotation. This seed dataset is used to train a seed model. Subsequently, we collect a diverse set of prompts and employ the seed model to generate multiple responses to each prompt. Annotators then rank these responses and refine the top-ranked response to produce the final version. For reasoning tasks like visual coding, visual reasoning, and math/science problems, where rule-based and model-based verifications are more accurate and efficient than human judgment, we utilize rejection sampling to expand the SFT dataset. The complete vanilla SFT dataset comprises approximately a 1:1 ratio of text tokens to image tokens. <details> <summary>x7.png Details</summary> ![560f1aae](/v1/image/560f1aae8a50166f34d7cda7d619c2e056cc00cc96ba3bb90f45cba6328d9f85) ### Visual Description ## Composite Image: Instruction-Response Pairs for Visual Question Answering ### Overview The image is a composite graphic displaying three distinct examples of a visual question-answering (VQA) system. Each example is presented in a separate, horizontally-aligned panel. Each panel is divided into two main sections: a left-hand "Instruction" box containing a user query and an associated image, and a right-hand "Response" box containing an AI-generated answer. The overall layout is clean and structured, with a light gray background and blue borders defining the instruction boxes. ### Components/Axes The image is segmented into three primary rows or panels. Each panel contains: 1. **Instruction Box (Left):** A light blue rectangle with a pencil icon in the top-left corner. It contains the text "Instruction" in bold, followed by a user's question and an embedded image. 2. **Response Box (Right):** A white rectangle with a black "H" logo in the top-left corner. It contains the text "Response" in bold, followed by a paragraph of AI-generated text. 3. **Embedded Images:** Each instruction box contains a specific image relevant to the question. ### Detailed Analysis **Panel 1 (Top):** * **Instruction Text:** "Which of the four subgraphs in the figure is taken at the same place as subgraph 1?" * **Embedded Image:** An aerial photograph of a dense urban area. Key features include a mix of historical and modern buildings, green spaces (parks/trees), and a distinctive large, circular structure with a white roof (possibly a dome or observatory) located in the lower-left quadrant of the image. * **Response Text:** The response analyzes four hypothetical sub-images (labeled 1-4). It describes Image 1 (the provided aerial photo) and compares it to descriptions of Images 2, 3, and 4. * Image 4 is described as having a similar dense urban environment, a large building complex with a central courtyard, and a circular structure resembling the one in Image 1. * Image 3 is noted to have a church spire and a more spread-out layout. * Image 2 is described as suburban with larger plots. * **Conclusion in Response:** "Conclusion: Image 4 is the most likely candidate as it matches the urban density, building types, and the circular structure seen in image 1." * **Final Answer in Response:** "Answer: The fourth sub-picture (image 4) is in the same place as the first picture." **Panel 2 (Middle):** * **Instruction Text:** "What is the dome building in the picture?" * **Embedded Image:** A wide-angle cityscape photograph taken at dusk or dawn. The sky is filled with dramatic, colorful clouds. The most prominent features are a very tall, slender communications and observation tower (the CN Tower) on the right side and a large, white, domed stadium (the Rogers Centre) to its left. A multi-lane highway with streaks of car lights runs through the foreground. Several modern high-rise buildings are visible on the left. * **Response Text:** "The dome building in the image is the Rogers Centre, a multi-purpose stadium in Toronto, Canada. It is recognizable by its distinctive retractable roof and is a landmark in the city's skyline, often visible alongside the CN Tower (the tall spire in the background). The Rogers Centre hosts various events, including sports games, concerts, and conventions." **Panel 3 (Bottom):** * **Instruction Text:** "Where am I ?" * **Embedded Image:** A first-person perspective screenshot from a video game. The scene is a dimly lit, futuristic bar or club interior. The aesthetic is "cyberpunk," characterized by neon green and blue lighting, holographic displays, and industrial architecture. A bar counter with stools is in the foreground. Several non-player characters (NPCs) with cybernetic augmentations are present. A heads-up display (HUD) element is visible in the top-right corner, showing a mini-map and the text "Sit next to Jackie". The game's title, "CYBERPUNK 2077," is faintly visible as a watermark in the upper-left area of the scene. * **Response Text:** "You are in Cyberpunk 2077, a open-world action role-playing game set in Night City (year 2077). The image shows a futuristic bar or club within the game's cyberpunk-themed environment, likely a mission or social interaction location (e.g., "Sit next to Jackie" is visible in the HUD). The setting features neon lights, holographic displays, and characters with cybernetic enhancements, typical of the game's aesthetic." ### Key Observations 1. **Structured Format:** The image demonstrates a consistent UI pattern for presenting VQA tasks: a clear separation between the user's multimodal input (text + image) and the AI's textual output. 2. **Task Diversity:** The three examples showcase different types of visual understanding: * **Panel 1:** Spatial reasoning and comparative analysis across multiple images (though only one is shown). * **Panel 2:** Landmark identification and factual knowledge retrieval based on a real-world photograph. * **Panel 3:** Contextual scene understanding and environment identification from a synthetic (video game) screenshot, including reading in-game HUD text. 3. **Response Detail:** The AI responses are not just simple answers but include explanatory reasoning, descriptive details about the visual content, and contextual information (e.g., the function of the Rogers Centre, the setting of Cyberpunk 2077). 4. **Visual Elements in Text:** The responses directly reference and describe specific visual elements from the images (e.g., "circular structure," "retractable roof," "Sit next to Jackie" in the HUD), demonstrating the system's ability to ground its language in the provided visual data. ### Interpretation This composite image serves as a demonstration or showcase of a multimodal AI system's capabilities. It illustrates how the system can process diverse visual inputs—aerial photography, real-world cityscapes, and synthetic game environments—and generate coherent, informative, and contextually appropriate textual responses to natural language questions. The progression from Panel 1 to Panel 3 shows an increase in the complexity of the required understanding: * Panel 1 requires **comparative visual analysis**. * Panel 2 requires **specific entity recognition** linked to a knowledge base. * Panel 3 requires **holistic scene interpretation** within a specific fictional context, including parsing embedded text (the HUD). The consistent "Instruction/Response" framing suggests this is likely a figure from a technical paper, product documentation, or a marketing piece designed to highlight the AI's proficiency in visual question answering across various domains. The inclusion of the "H" logo in the response boxes indicates this is a branded output from a specific AI model or platform. The image effectively communicates that the AI doesn't just "see" images but can reason about them, retrieve relevant knowledge, and communicate its findings in a human-readable format. </details> Figure 7: Kimi-VL exhibits strong visual reasoning capabilities by grounding visual content in spatial, contextual, and cultural knowledge. It accurately identifies matching urban locations based on structural and layout features, interprets scenes from video games like Cyberpunk 2077 using stylistic cues, and recognizes real-world landmarks such as the Rogers Centre in Toronto. ### 3.3 Reasoning Data Our reasoning data is meticulously constructed for activation and enhancement of the model’s multimodal reasoning capabilities during both the long-CoT supervised fine-tuning and reinforcement learning stages. Through developing a generation pipeline that resembles rejection sampling (RS) and prompt engineering, we collect and synthesize an amount of high-quality long-CoT data. Specifically, we first assemble a collection of QA data with ground truth annotations that require multi-step reasoning, such as mathematical problem-solving and domain-specific VQA. Subsequently, we sample multiple detailed reasoning trajectories for each question by leveraging a powerful long-CoT model - Kimi k1.5 team2025kimi with curated reasoning prompts. In rejection sampling, we feed the true labels and model predictions into an off-the-shelf reward model for judgment. Wrong chain-of-thought responses are filtered out according to the model evaluation as well as some rule-based rewards, thus improving the reasoning data quality. ## 4 Evaluation We begin by presenting our comprehensive model and conducting a comparative analysis with leading state-of-the-art (SoTA) solutions. Following this introduction, we proceed to assess various sub-capabilities of the model through detailed performance evaluations. This part examines how effectively the model handles different tasks and scenarios, providing insights into its strengths and limitations across diverse functional domains. ### 4.1 Comparison to the State-of-the-Art Models Table 3 presents a comprehensive evaluation of Kimi-VL against state-of-the-art vision-language models across multiple benchmarks. Although having a more parameter-efficient architecture (2.8B+0.4B activated parameters) compared to larger models such as GPT-4o, Llama-3.2-11B-Inst. and Gemma3-12B-IT, Kimi-VL demonstrates competitive or superior performance in several key areas. Our model employs a Mixture-of-Experts (MoE) architecture similar to DeepSeek-VL2, but outperforms it on most benchmarks with significantly fewer parameters (activated: 2.8B vs 4.5B; total: 16B vs 28B); it also outperforms Qwen2.5-VL-7B (actually 8.3B) on 19 out of 24 benchmarks, though the latter has 2.59 $\times$ more activated parameters. The following sections analyze performance across specific domains, which reveals Kimi-VL ’s strengths in OCR, math, agent, long-form content understanding, multi-image and video perception. <details> <summary>x8.png Details</summary> ![b0a12016](/v1/image/b0a12016495dbb5359a316f0177d74afad9fa4af7fb0886f0ce7c0fca559d09e) ### Visual Description ## Geometry Problem and Solution: Circle Angle Calculation ### Overview The image displays a two-part educational interface. The top section presents a geometry problem with a diagram and a multiple-choice question in Chinese. The bottom section provides a detailed, step-by-step solution to the problem, also in Chinese. The content is presented within a clean, digital learning platform layout. ### Components/Axes The image is divided into two main rectangular containers: 1. **Top Container (Instruction):** * **Header:** Contains a pencil icon and the English word "Instruction". * **Sub-header:** Contains the Chinese text "回答问题" (Answer the question). * **Diagram:** A circle with center `O`. A horizontal line segment `AB` passes through `O`, making it the diameter. Points `C` and `D` are on the circle's circumference. Lines connect `A` to `C`, `C` to `B`, `C` to `D`, and `D` to `B`. The angle at point `D` (∠CDB) is labeled. * **Problem Statement (Chinese):** "如图,AB为⊙O的直径,点D、C在⊙O上,∠D=62°,则∠ACO的度数为() A. 26° B. 28° C. 30° D. 32°" * **English Translation of Problem:** "As shown in the figure, AB is the diameter of circle O, points D and C are on circle O, ∠D = 62°. Then the degree measure of ∠ACO is (). A. 26° B. 28° C. 30° D. 32°" 2. **Bottom Container (Response):** * **Header:** Contains an "H+" icon and the English word "Response". * **Solution Text:** A complete, logically structured solution written in Chinese, using mathematical notation and equations. ### Detailed Analysis / Content Details **Problem Statement (Transcribed from Image):** * **Given:** AB is the diameter of circle O. Points C and D lie on the circumference of circle O. The measure of angle D (∠CDB) is 62°. * **To Find:** The measure of angle ACO (∠ACO). * **Options:** A. 26°, B. 28°, C. 30°, D. 32°. **Solution Steps (Transcribed and Translated from Image):** The solution proceeds in three main steps: 1. **Analyze Known Conditions:** * "AB是直径, 所以∠ACB = 90° (圆周角定理)。" -> "AB is the diameter, so ∠ACB = 90° (Inscribed Angle Theorem)." * "∠D = 62°。" -> "∠D = 62°." 2. **Apply the Inscribed Angle Theorem:** * "∠AOC是圆心角, 等于2倍的圆周角∠D。" -> "∠AOC is a central angle, equal to 2 times the inscribed angle ∠D." * "∠AOC = 2 × 62° = 124°。" -> "∠AOC = 2 × 62° = 124°." 3. **Calculate ∠ACO:** * "∠ACO是三角形AOC的内角。" -> "∠ACO is an interior angle of triangle AOC." * "∠ACO + ∠CAO + ∠AOC = 180°。" -> "∠ACO + ∠CAO + ∠AOC = 180°." * "因为OA = OC (半径相等), 所以∠CAO = ∠ACO。" -> "Because OA = OC (radii are equal), therefore ∠CAO = ∠ACO." * "设∠ACO = x, 则：" -> "Let ∠ACO = x, then:" * `x + x + 124° = 180°` * `2x + 124° = 180°` * `2x = 56°` * `x = 28°` * "因此, ∠ACO的度数为28°。答案是B。" -> "Therefore, the degree measure of ∠ACO is 28°. The answer is B." ### Key Observations * The solution correctly identifies and applies two fundamental circle theorems: the **Inscribed Angle Theorem** (angle at the center is twice the angle at the circumference subtended by the same arc) and the property that **angles subtended by the same arc are equal**. * The problem cleverly uses the given inscribed angle ∠D (62°) to find the central angle ∠AOC (124°), which is then used in the isosceles triangle ΔAOC (OA=OC=radius) to find the unknown base angle ∠ACO. * The multiple-choice options are closely spaced (26°, 28°, 30°, 32°), requiring precise calculation. ### Interpretation This image is a snapshot of a digital math tutoring or assessment system. It demonstrates a classic geometry problem designed to test a student's understanding of circle theorems and their ability to chain logical steps. The **Peircean investigative reading** reveals: * **Sign (Diagram & Text):** The diagram is an iconic sign representing the geometric configuration. The text is a symbolic sign stating the problem. * **Object (Mathematical Relationship):** The underlying object is the set of immutable geometric relationships within a circle (diameter, inscribed angles, central angles, isosceles triangles). * **Interpretant (Solution Process):** The solution is the interpretant, a rule or habit of reasoning that connects the sign to the object. It shows how to transform the given information (∠D=62°) into the desired information (∠ACO=28°) using established mathematical laws. The solution's structure is pedagogical, breaking the problem into "Analyze," "Apply Theorem," and "Calculate" phases. This not only provides the answer but also reinforces the underlying concepts and problem-solving methodology. The final answer, **28° (Option B)**, is derived through necessary and deductive reasoning, leaving no room for ambiguity given the initial conditions. </details> Figure 8: Kimi-VL demonstrates its capability to perform symbolic reasoning and geometric inference by solving a circle geometry problem step by step. The model analyzes given conditions, applies geometric theorems such as the inscribed angle theorem and properties of triangle angles, and accurately derives the target angle. <details> <summary>x9.png Details</summary> ![e9fe1c67](/v1/image/e9fe1c67b668863f46cc1a9e1cc1b4eb9845423c2692f7ca462a7ac349d9e590) ### Visual Description ## Composite Image: Three AI Interaction Examples ### Overview The image is a composite of three vertical panels, each demonstrating a different type of AI task: data extraction from a financial table, mathematical formula transcription to LaTeX, and handwriting recognition/transcription of a Chinese essay. Each panel follows a consistent structure: an "Instruction" header with a user prompt, followed by a "Response" section showing the AI's output. ### Components/Axes The image is divided into three distinct vertical panels, each with a light blue header labeled "Instruction" and a white content area below for the "Response". **Left Panel:** * **Instruction:** "Parse the complete table and output it in markdown table format" * **Content:** A screenshot of a financial table titled "Sparkling Smiles Clinic Ratio Analysis" with data from 2024 to 2031. * **Response:** A markdown-formatted version of the same table. **Middle Panel:** * **Instruction:** "Please generate the latex code of the formula in the image." * **Content:** A complex mathematical formula involving probability distributions and parameters. * **Response:** The corresponding LaTeX code and a rendered version of the formula. **Right Panel:** * **Instruction:** "识别这篇作文的内容" (Language: Chinese. Translation: "Identify the content of this essay.") * **Content:** A photograph of a handwritten Chinese essay on lined paper. * **Response:** A transcription of the essay's content in Chinese. ### Detailed Analysis #### **Left Panel: Financial Table Extraction** **Instruction Content (Source Table):** The table is titled "Sparkling Smiles Clinic Ratio Analysis". It contains multiple sub-tables with the following sections and data (years 2024-2031): 1. **Growth** * Row: `Sales Growth` * Values (2024-2031): `30%`, `7%`, `7%`, `7%`, `7%`, `7%`, `7%`, `7%` 2. **Percent of Total Assets** * Rows: `Inventory`, `Other Current Assets`, `Total Current Assets`, `Long-term Assets`, `Total Assets`, `Current Liabilities`, `Long-term Liabilities`, `Total Liabilities`, `Net Worth` * Example (2024): `17%`, `3%`, `52%`, `48%`, `100%`, `4%`, `4%`, `8%`, `92%` 3. **Percent of Sales** * Rows: `Gross Margin`, `Selling, General & Administrative Expenses`, `Advertising Expenses`, `Profit Before Interest and Taxes` * Example (2024): `72%`, `100%`, `2%`, `0%` 4. **Main Ratios** * Rows: `Current Ratio`, `Quick Ratio`, `Total Debt to Total Assets`, `Pre-tax Return on Net Worth`, `Pre-tax Return on Assets` * Example (2024): `4.42`, `-0.23`, `0.48`, `-4%`, `-2%` 5. **Additional Ratios** * Rows: `Net Profit Margin`, `Return on Equity` * Example (2024): `-1%`, `-4%` 6. **Activity Ratios** * Row: `Inventory Turnover` * Values (2024-2031): `78.41`, `61.54`, `61.54`, `61.54`, `61.54`, `61.54`, `61.54`, `61.54` **Response Content (Markdown Output):** The AI's response is a structured markdown representation of the above data, organized under headers like `## Growth`, `## Percent of Total Assets`, etc. The numerical data matches the source table. #### **Middle Panel: Mathematical Formula Transcription** **Instruction Content (Source Formula):** The formula is a probabilistic model, likely from machine learning (e.g., diffusion models). It defines a distribution `q(x_{t-1} | x_t, x_0)` as a Normal distribution `N(μ_q, Σ_q)`. The mean `μ_q` and covariance `Σ_q` are defined by complex expressions involving parameters `α_t`, `ᾱ_t`, `β_t`, and vectors `x_t`, `x_0`, and the identity matrix `I`. **Response Content (LaTeX Code & Render):** The AI provides the LaTeX code within a `latex` code block. The code uses the `align*` environment and defines the formula across multiple lines. Below the code, a "Rendered formula" section shows the properly typeset mathematical equation, which visually matches the source image. #### **Right Panel: Handwriting Recognition (Chinese)** **Instruction Content (Source Image):** A photograph of a handwritten essay on grid paper. The handwriting is in Chinese characters. The text appears to be a personal letter or reflection. **Response Content (Transcription & Translation):** The AI provides a direct transcription of the Chinese text. The language is explicitly identified as Chinese. **Transcribed Chinese Text:** 得勤快，我会练字。我就是我自己的手机毒霸，管好我自己，少做傻事情哈。其实聪明人也可以很听话的，至少在没有长大以前。问你能不能不跟你爸爸切北京啊，“我不切，一个人在成都你养我”“我养你啊”，哎，想到就心酸，等着吧。好了，我不写了。你，要好好的，要切煮饭了，成都天气也凉了，北京也一样吧。多穿衣服多喝热水好了，我们就到这。以后再遇到起：好久不见。你好吗？我很好！ @六年级二班 - 王乐乐 **English Translation (Provided by AI):** Be diligent, I will practice calligraphy. I am my own phone poison guard, manage myself well, do fewer silly things. Actually, smart people can also be very obedient, at least before they grow up. Asking you if you can not go to Beijing with your dad, "I won't go,养 me alone in Chengdu" "I'll养 you", sigh, it's heartbreaking to think about, just wait. Okay, I won't write anymore. You, be well, go cook now, the weather in Chengdu has cooled down, Beijing is probably the same. Wear more clothes and drink more hot water, let's end here. If we meet again in the future: Long time no see. How are you? I'm very good! @Grade 6 Class 2 - Wang Lele ### Key Observations 1. **Task Diversity:** The composite image showcases three fundamentally different AI capabilities: structured data parsing, symbolic math processing, and optical character recognition (OCR) for handwritten text. 2. **Fidelity:** In all three cases, the AI's response appears to be a high-fidelity reproduction of the source information. The markdown table preserves all data points, the LaTeX code accurately represents the complex formula, and the Chinese transcription matches the visible handwriting. 3. **Layout:** The consistent "Instruction/Response" panel format suggests these are examples from a user interface or a demonstration of an AI assistant's functionality across modalities. 4. **Language Handling:** The right panel explicitly handles non-English (Chinese) input and provides both transcription and translation, demonstrating multilingual capability. ### Interpretation This composite image serves as a demonstration of a multimodal AI assistant's core competencies in information extraction and transformation. It highlights the system's ability to: * **Parse and reformat structured data** (financial table to markdown), which is crucial for data analysis and reporting workflows. * **Interpret and digitize complex symbolic notation** (mathematical formula to LaTeX), essential for academic, scientific, and technical documentation. * **Recognize and transcribe unstructured, handwritten content** (Chinese essay), bridging the gap between physical documents and digital text, with added translation for cross-lingual understanding. The underlying theme is the conversion of information from one human-readable or machine-readable format to another, preserving semantic meaning. This capability is foundational for building tools that can interact with the diverse ways information exists in the real world—whether in spreadsheets, textbooks, or personal notes. The image implicitly argues for the utility of such an AI as a universal translator and digitizer across different domains of knowledge. </details> Figure 9: Diverse OCR visualization. Kimi-VL demonstrates strong OCR capabilities across varied content types, including structured financial tables, complex mathematical formulas, and handwritten Chinese text. The model accurately parses tabular data into markdown, converts formulas to LaTeX, and transcribes handwritten paragraphs with contextual understanding, showcasing its versatility in multimodal text extraction and interpretation. <details> <summary>x10.png Details</summary> ![2f082f3c](/v1/image/2f082f3cce00b38cb07f20436c000ca80cd2e78fddb921347a7c0f854f0b9b91) ### Visual Description \n ## Screenshot: Step-by-Step Guide to Enabling 'Do Not Track' in Chrome ### Overview The image is a composite instructional graphic detailing a 12-step process for enabling the "Do Not Track" feature in the Google Chrome web browser. It consists of a header with the main instruction and a two-column layout below containing numbered steps. Each step includes a screenshot of the Chrome interface, a "Thought" paragraph explaining the reasoning, an "Action" instruction, and a "Toolcall" with precise click coordinates or a scroll command. ### Components/Axes The image is segmented into two primary regions: 1. **Header (Top Section):** Contains a pencil icon and the primary user instruction. 2. **Instructional Steps (Bottom Section):** A grid of 12 steps, arranged in two columns (Steps 1-6 on the left, Steps 7-12 on the right). Each step is a self-contained block with a screenshot and associated text. **Textual Elements Present:** * **Header Instruction:** "Instruction: Can you enable the 'Do Not Track' feature in Chrome to enhance my online privacy?" * **Step Labels:** "Step 1" through "Step 12". * **Thought Process Text:** Detailed reasoning for each action. * **Action Instructions:** Direct commands for what to click or do. * **Toolcall Commands:** Programmatic actions with `click(x=..., y=...)` or `scroll(...)` parameters. * **UI Text in Screenshots:** Visible text within the Chrome browser interface, including menu items ("Settings", "Privacy and security", "Security", "Third-party cookies", "Manage HTTPS/SSL certificates and settings", "Send a 'Do Not Track' request with your browsing traffic"), button labels ("Cancel", "Confirm"), and the "Advanced" section header. ### Detailed Analysis The process navigates through Chrome's settings hierarchy. The following is a complete transcription of the instructional text for each step. **Step 1** * **Thought:** ... To enable the 'Do Not Track' feature in Chrome, I need to access Chrome's settings. Looking at the interface, I can see the vertical ellipsis menu button in the top-right corner of the browser window. This menu button is the standard way to access Chrome's settings and other features. ... * **Action:** Click on the vertical ellipsis menu button in the top-right corner of the Chrome window to open the browser menu. * **Toolcall:** `click(x=0.884, y=0.144)` **Step 2** * **Thought:** ... The next logical step would be to access Chrome's settings, where privacy options like "Do Not Track" are typically located. Looking at the menu that's currently open, I can see the "Settings" option at the bottom of the menu. This would be the most direct path to access privacy settings. ... * **Action:** Click on the "Settings" option in the Chrome menu to access the browser's settings page. * **Toolcall:** `click(x=0.744, y=0.851)` **Step 3** * **Thought:** ... Now, I need to find where the "Do Not Track" setting is located. Based on my knowledge of Chrome's settings, this feature is typically found in the Privacy and security section. I can see that "Privacy and security" is one of the options in the left sidebar menu. ... * **Action:** Click on the "Privacy and security" option in the left sidebar menu. * **Toolcall:** `click(x=0.386, y=0.311)` **Step 4** * **Thought:** ... Now we need to find the "Do Not Track" setting. Looking at the current screen, I can see that "Do Not Track" is not immediately visible in the list of options. However, I notice that there's a "Security" option with a lock icon that might contain privacy-related settings including "Do Not Track". ... * **Action:** Click on the "Security" option in the Privacy and security section to access more detailed security settings. * **Toolcall:** `click(x=0.522, y=0.884)` **Step 5** * **Thought:** ... To find the "Do Not Track" setting, I need to scroll down in the Security section. The current view doesn't show all available options, and the "Do Not Track" setting is likely further down the page. Scrolling down will reveal more settings options that aren't currently visible. ... * **Action:** Scroll down in the Security settings page to reveal more options. * **Toolcall:** `scroll(-5)` **Step 6** * **Thought:** ... The "Manage HTTPS/SSL certificates and settings" option is visible in the list. This is likely where the "Do Not Track" setting would be located, as certificate management is related to privacy and security features. ... * **Action:** Click on the "Manage HTTPS/SSL certificates and settings" option to access certificate management settings. * **Toolcall:** `click(x=0.527, y=0.883)` **Step 7** * **Thought:** ... I notice that we're currently on the "Manage certificates" page, which is not where the "Do Not Track" setting would be located. ... To enable the "Do Not Track" feature, we need to navigate to the appropriate settings section. ... * **Action:** Click on the "Privacy and security" option in the left sidebar menu to expand its submenu. * **Toolcall:** `click(x=0.365, y=0.317)` **Step 8** * **Thought:** ... I can see that "Security" is one of the options in the Privacy and security section. This makes sense as "Do Not Track" is a security feature that helps protect user privacy by preventing websites from tracking browsing activity. ... * **Action:** Click on the "Security" option in the Privacy and security section to access more detailed security settings. * **Toolcall:** `click(x=0.522, y=0.884)` **Step 9** * **Thought:** ... Since the current page doesn't show the "Do Not Track" setting, the logical next action would be to continue navigating through the settings. The back button in the browser would allow the user to return to the previous page and continue their search for the "Do Not Track" setting. ... * **Action:** Click the back button in the browser to return to the previous page. * **Toolcall:** `click(x=0.486, y=0.242)` **Step 10** * **Thought:** ... Looking at the available options, "Third-party cookies" is visible in the list. This option controls whether third-party cookies are blocked in Incognito mode. While this isn't exactly "Do Not Track," it is related to privacy controls. ... * **Action:** Click on the "Third-party cookies" option in the Privacy and security section to access its settings. * **Toolcall:** `click(x=0.542, y=0.772)` **Step 11** * **Thought:** ... I notice there's an "Advanced" section in the current view that contains a "Send a 'Do Not Track' request with your browsing traffic" option. This is exactly what we need to enable the "Do Not Track" feature. The toggle switch for this option is currently off. ... * **Action:** Click the toggle switch next to "Send a 'Do Not Track' request with your browsing traffic" to enable the feature. * **Toolcall:** `click(x=0.808, y=0.871)` **Step 12** * **Thought:** ... The dialog explains what "Do Not Track" does and has two buttons: "Cancel" and "Confirm". To complete the task of enabling this feature, we need to confirm the action by clicking the "Confirm" button. ... * **Action:** Click the "Confirm" button to enable the "Do Not Track" feature. * **Toolcall:** `click(x=0.708, y=0.669)` ### Key Observations 1. **Non-Linear Navigation:** The process is not a straight path. Steps 6-9 show a detour into certificate management and the use of the back button, indicating a search or correction in the navigation flow. 2. **Location of Setting:** The "Do Not Track" toggle is ultimately found under **Privacy and security > Third-party cookies > Advanced section**. 3. **Confirmation Required:** Enabling the feature triggers a confirmation dialog (Step 12), requiring an explicit user confirmation. 4. **Precise Coordinates:** Each click action is accompanied by normalized coordinates (x, y values between 0 and 1), suggesting this guide may be derived from or intended for an automated testing or recording tool. ### Interpretation This image serves as a technical walkthrough for a specific browser configuration task. It demonstrates the hierarchical and sometimes non-intuitive structure of modern software settings menus. The inclusion of "Thought" processes provides pedagogical value, explaining the *why* behind each action, which is crucial for user understanding beyond rote memorization. The guide reveals that the "Do Not Track" feature, while a privacy setting, is categorized under security and cookie management within Chrome's interface. The detour in steps 6-9 highlights a common user experience of exploring related but incorrect sub-menus before finding the correct path. The final confirmation dialog underscores the importance browsers place on ensuring users understand the implications of enabling such features, as "Do Not Track" is a request to websites and not a guaranteed block. The toolcall coordinates imply this documentation could be used to script or automate the process for testing or deployment purposes. </details> Figure 10: Kimi-VL is capable of following multi-step reasoning processes to complete complex GUI tasks. In this example, it successfully enables the “Do Not Track” feature in the Chrome browser to enhance online privacy. The agent interprets each screen, identifies relevant UI elements, and performs the appropriate actions sequentially with clear thoughts, actions, and API calls. <details> <summary>x11.png Details</summary> ![0313c967](/v1/image/0313c967453dc589dc93ed4cb6605a33bc3ac8e9fcff68ee360dee25d3db64d5) ### Visual Description \n ## Screenshot: Video Scene Analysis Interface ### Overview The image displays a software interface or document split into two primary panels: an "Instruction" panel on the left and a "Response" panel on the right. The interface presents a task to analyze a video and provides a detailed, scene-by-scene breakdown as the output. The video analyzed appears to be a ski/adventure film titled "INTO THE MIND" by Sherpas Cinema. ### Components/Axes **Left Panel - Instruction:** * **Header:** "Instruction" * **Instruction Text:** "Please split this video into scenes, providing start time, end time, and detailed descriptions for each scene." * **Visual Element:** A grid of small, sequential video thumbnails (approximately 12 columns by 10 rows) showing frames from the video. The thumbnails are too small for individual frame analysis but collectively show a progression of scenes involving mountains, snow, skiing, climbing, and people. **Right Panel - Response:** * **Header:** "Response" * **Content:** A long, continuous block of text providing the requested scene-by-scene analysis. The text is structured with timestamps in the format `(HH:MM:SS - HH:MM:SS)` followed by a descriptive paragraph for each scene. ### Detailed Analysis / Content Details The response text transcribes the following scene breakdown: **(00:00:00 - 00:00:15)** Scene in a dark room with a person cooking. Text appears: "THE NORTH FACE PRESENTS," setting a mysterious and intriguing theme. **(00:00:15 - 00:00:36)** Close-up of an elderly person's face and hands with a prayer wheel. Themes of spirituality, contemplation, and cultural depth. **(00:00:36 - 00:00:46)** Aerial view of snow-capped mountains. Title card appears: "A NEW FILM BY SHERPAS CINEMA." Themes of natural grandeur and adventure. **(00:00:46 - 00:01:09)** Montage: close-up of an eye, prayer wheel, wave crashing, underwater view, mountain range. Blends nature, spirituality, and vastness. **(00:01:09 - 00:01:27)** Interior with candles, close-up of elderly person, yellow jacket details, prayer wheel, mountain range, clouds, forest, river, hand holding prayer wheel. Weaves together spirituality, tradition, and nature. **(00:01:27 - 00:01:39)** Close-up of person in yellow jacket, eye, aerial view of snow-covered mountains. Themes of preparation, readiness, and awe-inspiring nature. **(00:01:39 - 00:02:03)** Group of climbers ascending, close-up of hand turning book pages, aerial mountain view, skier descending. Themes of human determination, adventure, and conquering elements. **(00:02:03 - 00:02:17)** Close-up of prayer wheel, snow-capped peak, vintage watch, person in tent, climbers on a ridge, skier launching off a cliff. Weaves preparation, reflection, and intense adventure. **(00:02:17 - 00:02:42)** Skier in red jacket performing tricks, skier in blue jacket, skier in red helmet, snowboarder in red jacket, bird soaring. Explores skill, freedom, and connection between humans and nature. **(00:02:42 - 00:03:05)** Skier performing trick at sunset, incense burning, snow-covered mountain under pink sky, surfer riding a wave. Blend of thrill, reflection, and majesty of nature. **(00:03:05 - 00:03:27)** Forest, temple with prayer flags, forest floor with leaves, cracked earth, title "INTO THE MIND" appears, credits roll. Concludes with peace, cycles of life, and final credits. **(00:03:27 - 00:03:37)** Black screen with credits, transitions to a dark, rocky interior. Final credits and fade to black. ### Key Observations 1. **Narrative Structure:** The scene descriptions reveal a clear narrative arc: introduction/mystery -> cultural/spiritual context -> preparation and awe -> intense action and adventure -> synthesis of action and reflection -> conclusion/credits. 2. **Thematic Consistency:** The analysis repeatedly identifies core themes: spirituality, the majesty and challenge of nature, human endeavor (skiing, climbing), preparation, and the connection between humans and the natural world. 3. **Visual Motifs:** Recurring visual elements noted include: prayer wheels, close-ups of faces/eyes/hands, yellow and red jackets, aerial mountain shots, and transitions between serene and intense action. 4. **Film Identity:** The video is explicitly identified as "INTO THE MIND" by Sherpas Cinema, presented by The North Face. ### Interpretation This image documents the output of a video analysis task, likely performed by an AI or a detailed human editor. The "Response" is not raw data but a synthesized, interpretive breakdown that extracts narrative and thematic content from visual sequences. * **What the data suggests:** The analysis demonstrates an ability to parse temporal video data into discrete, meaningful segments and describe their content, mood, and symbolic significance. It moves beyond simple visual description to infer themes (e.g., "spirituality," "preparation") and narrative function (e.g., "setting the stage," "concluding the narrative"). * **How elements relate:** The "Instruction" poses a clear, structured task. The "Response" fulfills it with a highly structured output that mirrors the requested format (start/end time + description). The thumbnail grid on the left serves as a visual reference or source for the analysis on the right. * **Notable patterns:** The analysis shows a strong focus on cinematic language—mentioning camera shots (close-up, aerial), editing ("transitions to," "cuts to"), and mood ("mysterious," "serene," "awe-inspiring"). This suggests the analyzer is tuned to filmic elements, not just content. The consistent identification of themes across disparate scenes indicates a successful high-level comprehension of the video's overarching message about adventure, spirituality, and nature. </details> Figure 11: Video scene splitting. Kimi-VL processes a long-form video by segmenting it into coherent scenes and providing detailed start/end timestamps along with fine-grained natural language descriptions for each scene. footnotemark: <details> <summary>x12.png Details</summary> ![f00947db](/v1/image/f00947db45b78e80297e9c94b39e83b0a7265a373456c407475c585129943940) ### Visual Description ## Screenshot: Instructional Presentation Interface with Timeline ### Overview The image is a screenshot of a digital interface, likely from a video player or presentation software. It displays an instructional task at the top, a visual timeline of a presentation in the middle, and a sample response at the bottom. The core content revolves around extending a well-known proverb. ### Components/Axes 1. **Instruction Box (Top):** * A blue-bordered container with a pencil icon and the title "Instruction". * Contains the primary task text. 2. **Timeline (Middle):** * A horizontal sequence of 10 thumbnail images representing slides or frames from a presentation. * A timestamp "00:00" is positioned above the first thumbnail on the left. * A timestamp "35:55" is positioned below the last thumbnail on the right. * Each thumbnail (except the last) has a text caption below it. 3. **Response Box (Bottom):** * A separate container with an "H" icon and the title "Response". * Contains a block of explanatory text. ### Detailed Analysis **1. Instruction Text:** > "There is an old Chinese saying: Give a man a fish, and you feed him for a day; teach a man to fish, and you feed him for a lifetime. In this representation, the author puts forward a further requirement. Find it then explain it in detail." **2. Timeline Thumbnail Content (Left to Right):** * **Thumbnail 1 (00:00):** * *Image:* A person presenting in front of a projection screen. * *Caption Title (Cyan):* "Loose analogy" * *Caption Text:* * "Give a man a fish, and you feed him for a day." * "Teach a man to fish, and you feed him for a lifetime." * **Thumbnail 2:** * *Image:* Similar presentation scene. * *Caption:* Identical to Thumbnail 1. * **Thumbnail 3:** * *Image:* Similar presentation scene. * *Caption:* Identical to Thumbnail 1. * **Thumbnail 4:** * *Image:* Similar presentation scene. * *Caption Title:* "Loose analogy" * *Caption Text:* * "Give a man a fish, and you feed him for a day." * "Teach a man to fish, and you feed him for a lifetime." * **"Teach him the taste of fish and make him hungry"** (This is the new, third line). * **Thumbnail 5:** * *Image:* Similar presentation scene. * *Caption:* Identical to Thumbnail 4 (includes the three lines). * **Thumbnail 6:** * *Image:* A different slide with dense text (illegible at this resolution). * *Caption:* None visible. * **Thumbnail 7:** * *Image:* Similar dense text slide. * *Caption:* None visible. * **Thumbnail 8:** * *Image:* Similar dense text slide. * *Caption:* None visible. * **Thumbnail 9:** * *Image:* Similar dense text slide. * *Caption:* None visible. * **Thumbnail 10 (35:55):** * *Image:* A final slide with the text "Thank you!" and a Twitter handle "@hweichung22". * *Caption:* None. **3. Response Text:** > "The requirement is to "Teach him the taste of fish and make him hungry." This implies that in addition to teaching someone how to perform a task or acquire a skill, it is also important to inspire and motivate them to continue learning and improving on their own. By making them hungry for more knowledge or experience, they will be more likely to seek out new challenges and opportunities for growth, which can lead to greater success and fulfillment in the long run." ### Key Observations 1. **Progressive Revelation:** The "further requirement" is not present in the first three thumbnails. It is introduced in the fourth thumbnail and persists in the fifth, indicating it is the key addition to the original proverb. 2. **Visual Structure:** The timeline visually narrates the progression of a presentation, starting with the familiar proverb and culminating in the introduction of the new concept. 3. **Language:** The primary language is English. The referenced "old Chinese saying" is presented in its common English translation. 4. **Source Attribution:** The final slide provides a Twitter handle (`@hweichung22`), suggesting the origin or presenter of this material. ### Interpretation The image documents a pedagogical or motivational concept that extends the classic "teach a man to fish" proverb. The original proverb advocates for empowerment through skill transfer over temporary aid. The "further requirement" identified—"Teach him the taste of fish and make him hungry"—adds a critical layer: **instilling intrinsic motivation and desire.** * **Relationship of Elements:** The instruction poses a puzzle. The timeline provides the visual evidence (the presentation slides) where the answer is revealed. The response box offers the correct answer and its interpretation. * **Core Meaning:** The extension shifts the focus from merely providing a skill (`how to fish`) to creating a fundamental appreciation and craving for the outcome (`the taste of fish`). This "hunger" represents an internal drive for continuous learning, improvement, and self-directed growth. It suggests that sustainable success requires not just capability, but also passion and initiative. * **Implication:** In contexts like education, management, or personal development, this implies that the most effective teaching or leadership involves sparking curiosity and a love for the subject or goal, thereby creating autonomous, motivated individuals who seek challenges rather than just completing assigned tasks. The data (the slide text) demonstrates this conceptual evolution from a two-part to a three-part model of development. </details> Figure 12: Catching and understanding key details from an hour-long video course. Kimi-VL demonstrates its ability to comprehend and interpret instructional video content by analyzing frame sequences and extracting conceptual progression over time. In this case, the model identifies a deepening of the traditional saying “Teach a man to fish, and you feed him for a lifetime” into a more nuanced idea: “Teach him the taste of fish and make him hungry.” footnotemark: #### 4.1.1 College-level Academic Problems Our Kimi-VL model demonstrates competitive performance on college-level academic benchmarks. On MMMU validation set, it achieves a score of 57.0%, which outperforms DeepSeek-VL2 (51.1%) and is comparable to Qwen2.5-VL-7B (58.6%) and even Gemma-3-12B-IT (59.6%), despite having significantly fewer activated parameters. On video college-level problems, it significantly outperforms Qwen2.5-VL-7B and DeepSeek-VL2, only behind >10B Gemma-3-12B-IT, demonstrating reasonable university-level understanding capabilities compared to larger models. These results indicate that Kimi-VL effectively balances parameter efficiency with academic reasoning abilities. #### 4.1.2 General Visual Ability Kimi-VL exhibits strong general visual understanding capabilities across multiple benchmarks. On MMBench-EN-v1.1, it achieves 83.1% accuracy, outperforming all efficient VLMs in comparison, and performing on par with GPT-4o. For AI2D, our model achieves 84.9% and surpasses all compared models including GPT-4o (84.6%). On MMVet, Kimi-VL scores 66.7% and ties closely with Qwen2.5-VL-7B (67.1%) and GPT-4o-mini (66.9%). For RealWorldQA, it achieves 68.1%, outperforming Gemma3-12B (59.1%) and approaching Qwen2.5-VL-7B (68.5%). These results demonstrate that our model maintains robust general visual understanding despite its compact architecture. In multi-image reasoning tasks, Kimi-VL shows promising capabilities with a score of 57.3% on the BLINK benchmark. This performance surpasses Qwen2.5-VL-7B (56.4%), GPT-4o-mini (53.6%), Gemma3-12B-IT (50.3%), and Llama3.2-11B-Inst. (39.8%). The ability to reason across multiple images requires understanding spatial and temporal relationships between visual elements, which our model handles effectively with fewer parameters than most competitors. #### 4.1.3 Mathematical Reasoning With its relatively small scale, Kimi-VL also demonstrates strong mathematical reasoning capabilities, particularly on the MathVista benchmark where it achieves 68.7%, outperforming all compared models including GPT-4o (63.8%) and Qwen2.5-VL-7B (68.2%). It indicates our model’s exceptional ability to understand and solve mathematical problems presented in visual contexts. On the more challenging MathVision benchmark, due to limited activated parameters, Kimi-VL outperforms DeepSeek-VL2 and Llama-3.2-11B-Inst., but lags behind Qwen2.5-VL-7B and Gemma-12B-IT. Nevertheless, through RL and test-time scaling, Kimi-VL-Thinking has significantly improved and already on par with 30B-level VLMs (see Table 4). These results highlight our model’s effectiveness in combining visual perception with mathematical problem-solving, an essential capability for real-world applications. #### 4.1.4 Document Understanding and OCR Kimi-VL excels in document understanding and OCR tasks across all benchmarks in this category. On InfoVQA, it achieves 83.2% accuracy, outperforming GPT-4o (80.7%) and DeepSeek-VL2 (78.1%). For OCRBench, our model scores 86.7%, surpassing all other models including GPT-4o-mini (78.5%) and DeepSeek-VL2 (81.1%). These results demonstrate that our model has exceptional text recognition and document understanding capabilities, making it especially suitable for applications involving document processing and information extraction. #### 4.1.5 Agent Grounding and Multi-turn Agent Interaction In agent-based tasks, Kimi-VL demonstrates remarkable performance. On single-step grounding, our model shows strong accuracy, with 92.0% on ScreenSpot-V2 and 34.5% on extremely difficult ScreenSpot-Pro (on 4K screens), proving its strong agent grounding abilities. More importantly, it also shows strong multi-step turn agent interaction abilities: For OSWorld, Kimi-VL reaches 8.22%, outperforming GPT-4o (5.03%) and other capable open-source models; On WindowsAgentArena, our model achieves 10.4%, also surpassing GPT-4o (9.4%) and others. These results highlight Kimi-VL’s exceptional ability to understand and interact with operating system interfaces, suggesting strong potential for applications in automated UI navigation and task execution. #### 4.1.6 Long Document and Long Video Understanding Kimi-VL demonstrates competitive performance in long-form content understanding. On MMLongBench-Doc, a challenging benchmark with question-answering on up to 100+ pages, it achieves 35.1%, outperforming GPT-4o-mini (29.0%) and Qwen2.5-VL-7B (29.6%), only behind GPT-4o (42.8%). For long video understanding, on Video-MME, our model outperforms all efficient VLMs and especially leads on the fairer w/o subtitle setting, where models have to find answers from video frames instead of hacking from input subtitles; on w/ subtitle setting, it also reaches extraordinary 72.6% accuracy. On the MCQ subset of MLVU, Kimi-VL achieves an impressive 74.2% score, achieving state-of-the-art and surpassing both GPT-4o (64.6%) and Qwen2.5-VL-7B (70.2%). For LongVideoBench, it scores 64.5%, outperforming all compared models except GPT-4o (66.7%). These results demonstrate Kimi-VL ’s strong capability to understand long-form PDFs and videos. #### 4.1.7 Egocentric and Fine-grained Video Perception Kimi-VL also shows strong performance in more nuanced video perception tasks. On EgoSchema full set (hidden test set), it achieves 78.5%, significantly outperforming GPT-4o (72.2%), Qwen2.5-VL-7B (65.0%). For VSI-Bench, a very challenging benchmark that requires to understand spatial relationships and correspondences of multiple objects in a video, our model scores 37.4%, surpassing GPT-4o (34.0%) and Qwen2.5-VL-7B (34.2%). In TOMATO that examines fine-grained temporal perception of VLMs, Kimi-VL reaches 31.7%, outperforming Qwen2.5-VL-7B (27.6%) and GPT-4o-Mini (28.8%). These results demonstrate our model’s strong capability to understand dynamic visual content, track objects over time, and interpret complex actions in video sequences, making it well-suited for applications requiring temporal visual understanding. ### 4.2 Kimi-VL-A3B-Thinking : A Reasoning Extension of Kimi-VL Furthermore, we conduct a reasoning extension to empower Kimi-VL to reason with CoT and present a long-thinking version of the model, Kimi-VL-Thinking, through long-CoT activation and reinforcement learning. We validate its superior performance on several image benchmarks, as shown in Table 4. | MathVision (full) (Pass@1) MathVista (mini) (Pass@1) MMMU (val) (Pass@1) | 30.4 63.8 69.1 | - 56.7 60.0 | 38.1 74.8 74.8 | 25.1 68.2 58.6 | 35.5 62.3 64.8 | 32.1 56.4 59.6 | - 71.0 77.3 | 35.9 71.4 70.3 | 38.6 74.9 70.0 | 36.8 71.3 61.7 | 56.9 80.1 64.0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | MMMU-Pro (avg) (Pass@1) | 51.7 | 37.6 | 51.1 | 38.1 | - | 32.1 | - | - | - | 43.0 | 46.3 | | VideoMMMU (Pass@1) | 61.1 | - | 60.2 | 47.0 | 61.8 | 57.2 | - | - | - | 55.5 | 65.2 | Table 4: Performance of Kimi-VL-Thinking and Kimi-VL-Thinking-2506 on multimodal reasoning benchmarks. The metrics evaluated include MathVista (mini), MMMU (val), MMMU-Pro (average), MathVision (full) and VideoMMMU, with results expressed in Pass@1. The Kimi-VL-Thinking-2506 performs well in most cases, showcasing the enhanced reasoning and processing capabilities of the “thinking” variant across different domains and scales. <details> <summary>x13.png Details</summary> ![85064bf1](/v1/image/85064bf175ab52c1b88249ead3ac65995900129d6f6e231f8bd80b86541c15c1) ### Visual Description ## Scatter Plot Series: Model Accuracy vs. Thinking Length ### Overview The image displays three separate scatter plots arranged horizontally. Each plot illustrates the relationship between a model's "Max Thinking Length" (in thousands of tokens) and its "Test Time Accuracy" (as a percentage) on a specific benchmark. The benchmarks are, from left to right: **MathVision**, **MathVista**, and **MMMU**. All plots share the same x-axis label and scale but have different y-axis scales and data ranges. ### Components/Axes * **Titles:** Each plot has a bold, centered title at the top: "MathVision", "MathVista", "MMMU". * **X-Axis (Common):** Labeled "Max Thinking Length (k tokens)". The axis has discrete tick marks at the values: 1, 2, 4, 8, and 16. * **Y-Axis (Per Plot):** Labeled "Test Time Accuracy (%)". The scale and range differ for each plot: * **MathVision:** Ranges from 16% to 36%, with major ticks at 16, 20, 24, 28, 32, 36. * **MathVista:** Ranges from 66% to 71%, with major ticks at 66, 67, 68, 69, 70, 71. * **MMMU:** Ranges from 48% to 60%, with major ticks at 48, 52, 56, 60. * **Data Series:** Each plot contains a single data series represented by black circular markers. Each marker is annotated with its precise percentage value. ### Detailed Analysis **1. MathVision Plot (Left)** * **Trend:** Shows a strong, positive, and roughly logarithmic trend. Accuracy increases rapidly with thinking length initially, then the rate of improvement slows. * **Data Points:** * At 1k tokens: 18.7% * At 2k tokens: 22.6% * At 4k tokens: 29.0% * At 8k tokens: 34.0% * At 16k tokens: 36.8% **2. MathVista Plot (Center)** * **Trend:** Shows a positive trend that plateaus significantly after 4k tokens. The improvement from 4k to 16k tokens is minimal. * **Data Points:** * At 1k tokens: 66.7% * At 2k tokens: 69.0% * At 4k tokens: 70.9% * At 8k tokens: 70.6% (Note: A very slight decrease from the 4k point) * At 16k tokens: 71.3% **3. MMMU Plot (Right)** * **Trend:** Shows a consistent, positive, and nearly linear trend across the measured range. The rate of improvement is steady. * **Data Points:** * At 1k tokens: 49.2% * At 2k tokens: 52.4% * At 4k tokens: 56.2% * At 8k tokens: 60.1% * At 16k tokens: 61.7% ### Key Observations 1. **Universal Positive Correlation:** All three benchmarks demonstrate that increasing the maximum thinking length (computational budget) leads to higher test accuracy. 2. **Diminishing Returns:** The benefit of additional thinking length is not uniform. MathVision shows the most dramatic gains, MathVista plateaus early, and MMMU shows steady but less dramatic gains. 3. **Performance Ceiling:** MathVista appears to approach a performance ceiling near 71% accuracy with thinking lengths beyond 4k tokens. 4. **Anomaly:** The MathVista data point at 8k tokens (70.6%) is marginally lower than the point at 4k tokens (70.9%). This could be statistical noise or indicate a minor instability in the scaling trend for that specific benchmark. ### Interpretation This data suggests a fundamental trade-off in AI reasoning models between computational cost (thinking length) and performance (accuracy). The relationship is not linear and is highly dependent on the nature of the task (benchmark). * **MathVision** tasks likely involve complex, multi-step reasoning where additional "thinking" directly translates to solving more problems, hence the strong, sustained improvement. * **MathVista** tasks may have a inherent complexity ceiling; after a certain point, throwing more tokens at the problem yields negligible benefit, suggesting the model's reasoning capability or the problem's solvability saturates. * **MMMU** (Massive Multi-discipline Multimodal Understanding) tasks show a reliable, scalable benefit, indicating that broader knowledge integration and reasoning continue to improve with more processing. The key takeaway is that "thinking longer" is a powerful lever for improving AI performance, but its effectiveness is task-dependent. Optimizing for efficiency would require understanding where diminishing returns set in for a given class of problems, as seen starkly in the MathVista plot. The slight dip at 8k tokens in MathVista also hints that scaling behavior can have non-monotonic quirks worth investigating. </details> Figure 13: Test-time accuracy when scaling the max thinking token length of our Kimi-VL-Thinking model. Kimi-VL-Thinking significantly improves over the base Kimi-VL model, with gains of 2.6% on MathVista, 4.7% on MMMU, and 15.4% on MathVision, demonstrating its capability to leverage test-time computation for deeper reasoning and better handling of complex multimodal queries. In Table 4, Kimi-VL-Thinking further outperforms or rivals state-of-the-art thinking and non-thinking models: achieving 71.3% on MathVista, outperforming GPT-4o (63.8%) and GPT-4o-mini (56.7%); scoring 61.7% on MMMU, surpassing GPT-4o-mini (60.0%) and Qwen2.5-VL-7B (58.6%); and reaching 36.8% on MathVision, exceeding GPT-4o (30.4%) and Gemma-3-27B-IT (35.5%), even QVQ-72B (35.9%). While marginally behind some larger-scale models on select benchmarks, Kimi-VL-Thinking accomplishes these results with only 3B activated parameters—orders of magnitude fewer than its counterparts—underscoring its strong efficiency and effectiveness in multimodal reasoning. Our Kimi-VL-Thinking model also exhibits strong test-time scaling properties, as shown in Figure 13. Specifically, increasing the max thinking token length at inference time consistently improves test-time accuracy across all three benchmarks. For example, on MathVision, the accuracy rises steadily from 18.7% at 1k tokens to 36.8% at 16k tokens, and similar upward trend is also observed on MMMU, indicating that the model is able to utilize longer reasoning chains for better performance. However, not all benchmarks benefit equally from longer thinking lengths. On MathVista, performance saturates early, with accuracy reaching 70.9% at 4k tokens and no further significant gains observed as the token length increases to 16k. It suggests that for this task, the necessary reasoning depth is already captured within a relatively short context, and additional computation does not yield further improvements. ### 4.3 Kimi-VL-A3B-Thinking-2506 : From Reasoning Extension to Integrated Thinking Model Table 5: Performance of Kimi-VL-A3B-Thinking-2506 on multimodal benchmarks that do not require extensive reasoning. | Benchmark (Metric) General Multimodal MMBench-EN-v1.1 (Acc) | GPT-4o 83.1 | Qwen2.5- VL-7B 83.2 | Gemma3- 12B-IT 74.6 | Kimi-VL-A3B- Instruct 82.9 | Kimi-VL-A3B- Thinking 76.0 | Kimi-VL-A3B- Thinking-2506 84.4 | | --- | --- | --- | --- | --- | --- | --- | | RealWorldQA (Acc) | 75.4 | 68.5 | 59.1 | 68.1 | 64.0 | 70.0 | | OCRBench (Acc) | 815 | 864 | 702 | 864 | 864 | 869 | | MMStar (Acc) | 64.0 | 63.0 | 56.1 | 61.7 | 64.2 | 70.4 | | MMVet (Acc) | 69.1 | 67.1 | 64.9 | 66.7 | 69.5 | 78.1 | | Video | | | | | | | | MMVU ${}_{\text{val}}$ (Pass@1) | 67.4 | 50.1 | 57.0 | 52.7 | 53.0 | 57.5 | | Video-MME (w/ sub.) (Acc) | 77.2 | 71.6 | 62.1 | 72.7 | 66.0 | 71.9 | | OS-Agent Grounding | | | | | | | | ScreenSpot-Pro (Acc) | 0.8 | 29.0 | — | 35.4 | — | 52.8 | | ScreenSpot-V2 (Acc) | 18.1 | 84.2 | — | 92.8 | — | 91.4 | | OSWorld-G (Acc) | - | 31.5 | — | 41.6 | — | 52.5 | | Long Document | | | | | | | | MMLongBench-Doc (Acc) | 42.8 | 29.6 | 21.3 | 35.1 | 32.5 | 42.1 | While Kimi-VL-A3B-Thinking shows excellent thinking abilities on hard reasoning tasks, we further provide the updated Kimi-VL-A3B-Thinking-2506 Tech Blog: https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506, a new reasoning variant that is not only smarter, but integrates key abilities of Kimi-VL-A3B-Instruct (perception, video, long-document, and OS-agent abilities) into this thinking model. Kimi-VL-Thinking-2506 significantly improves reasoning efficiency while reducing token consumption. As shown in Table 4, Kimi-VL-Thinking-2506 achieves 56.9% on MathVision (+20.1% improvement on original Kimi-VL-Thinking), 80.1% on MathVista (+8.4%), 46.3% on MMMU-Pro (+3.2%), and 64.0% on MMMU (+2.1%), demonstrating non-trivial gains across multiple reasoning benchmarks. Meanwhile, while solving these hard reasoning problems, the 2506 version reduces the average output token length by around 20% (e.g., 2.9K $\to$ 2.4K on MMMU-val and 5.8K $\to$ 4.4K on MathVision), facilitating it to be more efficient and user-friendly for practical deployments. Beyond extensive reasoning tasks, Kimi-VL-Thinking demonstrates stronger visual perception capabilities (Table 5). Compared to the previous non-thinking variant (Kimi-VL-A3B-Instruct), Kimi-VL-A3B-Thinking-2506 achieves competitive or superior results on general multimodal understanding benchmarks: 84.4% on MMBench-EN-v1.1, 70.4% on MMStar, 70.0% on RealWorldQA, and 78.4% on MMVet, underscoring its broader competence in vision-language tasks. In terms of token efficiency, the 2506 version only requires in average 180 tokens per answer when solving MMBench, 1/3 compared to the previous thinking model while improving 8.4% accuracy. Kimi-VL-A3B-Thinking-2506 also extends its reasoning ability to video and long-context domains. It establishes new state-of-the-art results among open-source models on VideoMMMU (65.2%, 4% better than GPT-4o), a challenging video reasoning benchmark; it also maintains robust general video understanding performance with 71.9% on Video-MME, matching the long video understanding ability of Kimi-VL-A3B-Instruct. It also scores 42.1% (first open-source model matching GPT-4o) on MMLongBench-Doc (Table 5), a 10% improvement over the previous thinking model and 7% over the previous instruct model, demonstrating its robust ability on broader long-form visual inputs. As mentioned in the method part, the continual training on MoonViT (3.2 million max input pixels) of Kimi-VL-A3B-Thinking-2506 leads to substantial improvements on high-resolution perception and OS grounding benchmarks, achieving 83.2% on V* Benchmark (without external tools), 52.8% on ScreenSpot-Pro, and 52.5% on OSWorld-G (full set with refusal samples), showing huge improvements over both previous variants. We hope that this high-resolution multimodal reasoning model brings about interesting new capabilities in the real world. ## 5 Conclusion, Limitation, and Future Work We introduce Kimi-VL, a VLM designed with a balanced approach to cover both multimodal and text-only pre-training/post-training, underpinned by an MoE-based architecture for scalable efficiency. Its 128K extended context window enables precise retrieval in lengthy texts and videos, while the native-resolution encoder MoonViT helps maintain high accuracy with low computational overhead in ultra-high-resolution visual tasks. Additionally, Kimi-VL-Thinking facilitates effective long-chain reasoning in complex image and video inference. Overall, Kimi-VL demonstrates robust adaptability and efficiency across multimodal, long-context, and high-resolution tasks, indicating substantial potential for future research and industrial applications. However, Kimi-VL still faces several challenges: 1. Although the current model size performs effectively for many standard tasks, it remains too limited to address highly specialized or domain-specific problems, or problems that are strongly dependent on language abilities, restricting Kimi-VL’s ability to handle extremely complex scenarios. 1. While the reasoning capability is already strong for typical use cases, it has yet to reach its theoretical upper bound, particularly for intricate tasks requiring multi-step inference or deeper contextual understanding. 1. Despite providing a 128K extended context window, due to limited parameters in its attention layers (which is only comparable to a 3B model), its long-context abilities is still insufficient for certain advanced applications that involve extremely long sequences or high-volume contextual information. In the future, we will tackle these challenges by scaling up the model size, expanding pre-training data, and enhancing post-training algorithms. Our next steps include optimizing Kimi-VL and releasing larger versions, as well as refining post-training and test-time scaling mechanisms for a better thinking model. These efforts will pave the way for more advanced applications in both research and industry. [title=References] ## Appendix ## Appendix A Contributions Core Contributors Bohong Yin Bowei Xing Cheng Chen Chu Wei Dehao Zhang Dongliang Wang Haoning Wu ∗ Haotian Yao Haoyu Lu ∗ Hao Yang Kun Ouyang Lin Sui Xinyuan Wang # Xinyu Zhou Yang Li Y. Charles ∗ Yiping Bao Yimin Chen Yuanxin Liu Yuxin Wu Zaida Zhou Zhaowei Li Zhiqi Huang Zhilin Yang Ziwei Chen Contributors Angang Du Bowen Qu Bowen Wang # Chenlin Zhang Chenzhuang Du Congcong Wang Dikang Du Enming Yuan Enzhe Lu Fang Li Flood Sung Guangda Wei Guokun Lai Han Zhu Hao Ding Hao Hu Hao Zhang Heng Wang Hongcheng Gao Huabin Zheng Jiaming Li Jianlin Su Jianzhou Wang Jiaqi Deng # Jiezhong Qiu Jin Xie Jinhong Wang Jingyuan Liu Junjie Yan Liang Chen Longhui Yu Mengfan Dong Mengnan Dong Nuo Xu Pengyu Cheng Qizheng Gu Runjie Zhou Shaowei Liu Sihan Cao Tao Yu # Tianhui Song Tongtong Bai Weiran He Wei Song Weixiao Huang Weixin Xu Xiaokun Yuan Xingzhe Wu Xingcheng Yao Xinhao Li Xinxing Zu Yangyang Hu Yan Zhong Yanru Chen Yibo Miao Yejie Wang Yibo Liu Yidao Qin Yiqin Wang Yongsheng Kang Yuhao Dong Yulun Du Yuzhi Wang Yuzi Yan Zhejun Jiang Zheng Zhang Zihao Huang Zijia Zhao Zongyu Lin * Project lead(s). # The University of Hong Kong, Moonshot.ai The listing of authors is in alphabetical order based on their first names. ## Appendix B Evaluation Details ### B.1 Image Benchmark MMMU yue2024mmmu encompasses a carefully curated collection of 11.5K multimodal questions sourced from college exams, quizzes, and textbooks. These questions span six major academic fields: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. MMBench-EN-v1.1 MMBench is a fine-grained benchmark that contains 2974 multiple-choice questions, covering 20 ability dimensions. It incorporate perception and reasoning as the top-level ability dimensions in its ability taxonomy, leading to different levels of evaluation in various ability dimensions. MMStar chen2024mmstar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans. It is designed to benchmark 6 core capabilities and 18 detailed axes, aiming to evaluate the multimodal capacities of LVLMs with a carefully balanced and purified selection of samples. MMVet yu2024mmvet is designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language capabilities. It defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. RealWorldQA realworldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal models. It assesses how well the models comprehend physical environments. The benchmark consists of over 700 images, each accompanied by a question and a verifiable answer, and these images are drawn from various real-world scenarios. AI2D kembhavi2016ai2d is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground truth syntactic parses, and more than 15000 corresponding multiple choice questions. MathVision wang2024measuring is a carefully curated collection of 3,040 high-quality mathematical problems with visual contexts that are sourced from real math competitions. It covers 16 distinct mathematical disciplines and is graded across 5 levels of difficulty. This dataset offers a comprehensive and diverse set of challenges, making it ideal for evaluating the mathematical reasoning abilities of LMMs. MathVista lu2023mathvista is a benchmark that integrates challenges from a variety of mathematical and visual tasks, demanding participants to exhibit fine-grained, deep visual understanding along with compositional reasoning to successfully complete the tasks. BLINK fu2024blink is a benchmark designed to evaluate multi-image visual cognition, encompassing tasks related to depth relationships, feature matching, digital forensics, and spatiotemporal reasoning. It features a diverse set of multi-image perceptual similarity tasks, validated through standardized protocols. InfoVQA mathew2022infographicvqa is a dataset specifically designed to assess models’ capabilities in interpreting and reasoning with complex infographics that integrate text, graphics, and visual elements. Model performance on this dataset is evaluated using the ANLS metric on the test set. OCRBench liu2023hidden evaluates the OCR capabilities of MLLMs across five tasks: text recognition, scene text VQA, document VQA, key information extraction, and handwritten math expression recognition. The benchmark is scored out of a maximum of 1000 points. ### B.2 Video and Long Document Benchmark VideoMMMU arxiv2025videommmu is a video benchmark designed to evaluate the college-level knowledge acquisition capabilities of large multimodal models. It consists of 300 expert-level videos and 900 human-annotated questions. The videos span six diverse academic disciplines: Art, Humanities, Medicine, Business, Science, and Engineering. The questions are structured to align with three cognitive stages: Perception, Comprehension, and Adaptation. MMVU arxiv2025mmvu is a video benchmark designed to evaluate the expert-level video understanding ability. The benchmark contains 3,000 expert-annotated questions over 1,529 videos, which span 27 subjects from four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Video-MME arxiv2024videomme is a video benchmark that consists of 900 manually selected videos (totaling 254 hours length), and 2,700 QA pairs. The videos, varying in duration, are categorized into 30 fine-grained classes across six diverse domains: Knowledge, Film & Television, Sports Competition, Artistic Performance, Life Record, and Multilingual content. Evaluations are conducted under two different settings: with and without subtitles. MLVU arxiv2024mlvu is designed to evaluate the model performance in comprehending long videos from multiple aspects. It consists of 1,730 videos along with 3,102 corresponding question-answer pairs (2,593 in dev set and 509 in test set). Videos of this benchmark are collected from multiple scenarios, including Sport, Ego-centric, Life Record, Tutorial, etc. The close-ended task set of MLVU comprises 7 different tasks: Action Order, Action Count, Topic Reasoning, Anomaly Recognition, Plot QA, Ego Reasoning, and Needle QA. LongVideoBench nips2024longvideobench is a video question-answering benchmark designed to evaluate the long-form multimodal perception and relation capability of large multimodal models. The benchmark includes 3,763 web-collected videos spanning various lengths and themes, along with their corresponding subtitles. It includes 6,678 human-annotated multiple-choice questions, distributed across 17 fine-grained categories, which accesses different aspects of video-language understanding. EgoSchema nips2023egoschema is a video benchmark designed to evaluate the long-form video understanding capabilities within the ego-centric scenario. Derived from Ego4D cvpr2022ego4d, the benchmark comprises over 5,031 multiple choice question-answer pairs spanning more than 250 hours real-world videos with a semi-automatic data pipeline. VSI-Bench arxiv2024vsibench is designed to evaluate the visual-spatial comprehensive capabilities of large multimodal models. It consists of over 5,000 question-answer pairs across around 290 real indoor-scene videos. TOMATO iclr2025tomato is a video benchmark comprises 1,484 human-annotated question-answer pairs and 1,417 videos. TOMATO focuses on evaluating the temporal reasoning capabilities of large multimodal models, including action counting, direction prediction, rotation analysis, shape & trend detection, velocity & frequency estimation, and visual cue interpretation. ### B.3 Agent Benchmark ScreenSpot V2 wu2024osatlas is an enhanced version of the ScreenSpot cheng2024seeclick benchmark, which focuses on evaluating the performance of GUI grounding models across multiple platforms, including web, desktop, and mobile interfaces. This updated version addresses several issues identified in the original ScreenSpot dataset, such as incorrect or ambiguous annotations, spelling mistakes, and mislabeled bounding boxes. ScreenSpot Pro li2025screenspotpro is a benchmark for evaluating GUI grounding in high-resolution, complex UI environments. It contains 1,581 real-world, high-resolution images and expert-annotated tasks from diverse professional domains. Including domain-specific interface conventions that challenge models to understand professional-grade interfaces beyond consumer applications. OSWorld xie2024osworld is a pioneering scalable, real computer environment designed for multimodal agents, facilitating task setup, execution-based evaluation, and interactive learning across multiple operating systems, including Ubuntu, Windows, and macOS. It serves as a unified platform for evaluating open-ended computer tasks that involve arbitrary applications, addressing the limitations of existing benchmarks that often lack interactive environments or are confined to specific applications or domains. WindowsAgentArena bonatti2024windowsagentarenaevaluating is a benchmark designed to evaluate multimodal agents in realistic Windows environments. Built on the OSWorld framework, it allows agents to interact with a full range of applications and web tools. The benchmark is scalable and can complete evaluations in under 20 minutes on Azure. It offers insights into agent performance, highlighting the potential for future research in agent development and task automation.

Rendering Paper...