# NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
**Authors**: Code: https://github.com/OpenGVLab/NaViL
Abstract
Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained visual encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs. * Equal contribution. 🖂 Corresponding to Jifeng Dai <daijifeng@tsinghua.edu.cn>. † Work was done when Changyao Tian, Hao Li, and Jie Shao were interns at Shanghai AI Laboratory.
1 Introduction
Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in computer vision InternVL-2.5 ; mono_internvl ; Qwen2vl ; gpt4v ; reid2024gemini1_5 , continuously breaking through the upper limits of various multimodal tasks mathew2021docvqa ; yu2023mmvet ; liu2023mmbench ; Datasets:ChartQA . The great success of MLLM is inseparable from its compositional training paradigm, which independently pre-trains visual encoders openclip and LLMs touvron2023llama , and then integrates them through additional multimodal training. Due to the engineering simplicity and effectiveness, this paradigm has dominated MLLM area over the past few years. However, the shortcomings of compositional training have been gradually recognized by the community recently, e.g., unclear multimodal scaling property diao2024EVE ; shukor2025scaling .
Therefore, increasing attention has been directed toward the development of more native MLLMs. As illustrated in Fig. 1, native MLLMs aim to jointly optimize both visual and language spaces in an end-to-end manner, thereby maximizing vision-language alignment. Compared to the compositional paradigm, existing native MLLM methods demonstrate a promising scaling law and a significantly simplified training process team2024chameleon ; shukor2025scaling . Despite these advancements, the primary benefits of native MLLMs are often evaluated under the assumption of infinite training resources, overlooking the substantial challenges posed by limited data and large-scale training. Consequently, a critical practical question remains: whether and how native MLLMs can feasibly achieve or even surpass the performance upper bound of top-tier MLLMs at an acceptable cost.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Chart/Diagram Type: Multi-Panel Performance Comparison
### Overview
The image presents a multi-panel figure comparing the performance of different Native Multimodal Large Language Models (MLLMs) under varying conditions. The first two panels are line graphs showing validation loss as a function of training data size for different model designs and scaling properties. The third panel is a radar chart comparing the performance of several models across various benchmark datasets.
### Components/Axes
**Panel 1: Design Choices of Native MLLMs**
* **Title:** Design Choices of Native MLLMs
* **X-axis:** Training Data Size (logarithmic scale, markers at 10^7 and 10^8)
* **Y-axis:** Validation Loss (linear scale, from 0.8 to 2.2, increments of 0.2)
* **Legend (bottom-left):**
* Light Green (triangle marker): base
* Green (circle marker): w/ MoE
* Dark Green (square marker): w/ V*d,w(·)
**Panel 2: Scaling Properties of Native MLLMs**
* **Title:** Scaling Properties of Native MLLMs
* **X-axis:** Training Data Size (logarithmic scale, markers at 10^7 and 10^8)
* **Y-axis:** Validation Loss (linear scale, from 0.75 to 2.25, increments of 0.25)
* **Legend (bottom-left):**
* Light Green (triangle marker): 0.5B parameters
* Green (circle marker): 2B parameters
* Dark Green (square marker): 7B parameters
**Panel 3: Radar Chart**
* **Title:** (Implicitly) Performance Comparison on Various Benchmarks
* **Axes (radial):** Represent different benchmark datasets: ChartQA (79.2), AI2D (74.9), MMB (74.7), MMVet (78.3), GQA (62.9), MMMU (43.6), SQA-I (96.2), 804 OCRBench (804), TextVQA (76.9), MathVista (51.3), InfoVQA (60.9), CCBench (83.9), DocVQA (88.7)
* **Legend (bottom-right, in a box with a dashed border):**
* Yellow: Emu3
* Light Blue: EVEv2
* Gray: InternVL-2.5-2B
* Green: Chameleon-7B
* Brown: Mono-InternVL
* Pink: NaViL-2B (Ours)
### Detailed Analysis or ### Content Details
**Panel 1: Design Choices of Native MLLMs**
* **Base (Light Green, triangle):** Validation loss decreases from approximately 2.2 to 1.9 as training data size increases from 10^7 to 10^8.
* At 10^7: ~2.2
* At 10^8: ~1.9
* **w/ MoE (Green, circle):** Validation loss decreases from approximately 2.05 to 0.9 as training data size increases from 10^7 to 10^8.
* At 10^7: ~2.05
* At 10^8: ~0.9
* **w/ V*d,w(·) (Dark Green, square):** Validation loss decreases from approximately 2.05 to 0.85 as training data size increases from 10^7 to 10^8.
* At 10^7: ~2.05
* At 10^8: ~0.85
**Panel 2: Scaling Properties of Native MLLMs**
* **0.5B parameters (Light Green, triangle):** Validation loss decreases from approximately 2.25 to 1.5 as training data size increases from 10^7 to 10^8.
* At 10^7: ~2.25
* At 10^8: ~1.5
* **2B parameters (Green, circle):** Validation loss decreases from approximately 2.0 to 0.9 as training data size increases from 10^7 to 10^8.
* At 10^7: ~2.0
* At 10^8: ~0.9
* **7B parameters (Dark Green, square):** Validation loss decreases from approximately 1.9 to 0.75 as training data size increases from 10^7 to 10^8.
* At 10^7: ~1.9
* At 10^8: ~0.75
**Panel 3: Radar Chart**
* The radar chart visualizes the performance of different models across various benchmark datasets. The values for each benchmark are explicitly labeled on the axes.
* **Emu3 (Yellow):** Appears to perform well on ChartQA (~79), MMB (~75), and DocVQA (~89).
* **EVEv2 (Light Blue):** Shows relatively lower performance across most benchmarks, with a peak at DocVQA (~89).
* **InternVL-2.5-2B (Gray):** Shows relatively lower performance across most benchmarks, with a peak at DocVQA (~89).
* **Chameleon-7B (Green):** Shows relatively lower performance across most benchmarks, with a peak at DocVQA (~89).
* **Mono-InternVL (Brown):** Shows relatively lower performance across most benchmarks, with a peak at DocVQA (~89).
* **NaViL-2B (Ours) (Pink):** Generally outperforms other models across most benchmarks, especially SQA-I (~96), CCBench (~84), and DocVQA (~89).
### Key Observations
* In both line graphs, increasing the training data size leads to a decrease in validation loss, indicating improved model performance.
* The model "w/ V*d,w(·)" and the "7B parameters" model achieve the lowest validation loss at the largest training data size.
* The radar chart shows that NaViL-2B (Ours) generally performs better than other models across the selected benchmarks.
### Interpretation
The data suggests that both the design choices and scaling properties of Native MLLMs significantly impact their performance. Specifically, the "w/ V*d,w(·)" design and larger parameter sizes (7B) lead to lower validation loss, indicating better generalization. The radar chart highlights the strengths and weaknesses of different models across various tasks, with NaViL-2B demonstrating superior overall performance. The consistent decrease in validation loss with increasing training data size underscores the importance of large-scale training for MLLMs.
</details>
Figure 1: Comparison of design choices, scaling properties, and performance of our native MLLMs. We systematically investigate the designs and the scaling properties of native MLLMs under data constraints and yield valuable findings for building native MLLMs. After adopting these findings, our native MLLMs achieve competitive performance with top-tier MLLMs. $\mathcal{V}^{*}_{d,w}(·)$ denotes the visual encoder with optimal parameter size.
To answer this question, in this paper, we aim to systematically investigate the designs and the scaling properties of native MLLMs under data constraint. Specifically, we first explore the choices of key components in the native architecture including the mixture-of-experts, the visual encoder and the initialization of the LLM. Our findings can be summarized in two folds. Firstly, an appropriate pre-training initialization (e.g., the base LLM) of the LLM greatly benefits the training convergence on multimodal data. Secondly, combining visual encoder architectures and MoEs results in obvious gains against the vanilla decoder-only LLM. Following these findings, we build a meta architecture that optimally balances performance and training cost.
Based on the optimal meta architecture, we further explore the scaling properties of the visual encoder, the LLM and the entire native MLLM. Specifically, we first scale up the LLM and the visual encoder independently and observe different scaling properties: while scaling LLM exhibits similar patterns as the conventional language scaling laws, scaling visual encoder shows an upper bound in return due to the limitation of the LLM’s capacity, suggesting that the optimal encoder size varies with the LLM size. Further analysis reveals that the optimal encoder size increases approximately proportionally with the LLM size in log scale. This observation yields a different guidance against compositional paradigm, which employs a visual encoder of one size across all LLM scales.
Based on above principles, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. To validate our approach, we conduct extensive experiments across diverse benchmarks to evaluate its multimodal capabilities including image captioning chen2015cococaption ; Datasets:Flickr30k ; agrawal2019nocaps , optical character recognition (OCR) Datasets:TextVQA ; Datasets:DocVQA ; liu2023ocrbench , etc. Experimental results reveal that with ~600M pre-training image-text pairs, NaViL achieves competitive performance compared to current top-tier compositional MLLMs, highlighting the great practicality and capabilities of NaViL. In summary, our contributions are as follows:
- We systematically explore the design space and the optimal choice in native MLLMs under data constraint, including the LLM initialization, the visual encoder and the MoEs, and draw three critical findings that greatly benefit the training of native MLLMs.
- Based on above findings, we construct a novel native MLLM called NaViL. In NaViL, we explore the scaling properties of the visual encoder and the LLM and indicate their positively correlated scaling relationship.
- We conduct large-scale pre-training and fine-tuning experiments on NaViL. Experimental results show that NaViL can achieve top-tier performance with nearly 600M pre-training data. Our findings and results will encourage future work for native MLLMs in the community.
2 Related Work
Multimodal Large Language Models. Recent years have witnessed the significant progresses of Multimodal Large Language Models (MLLMs) llava-hr ; VLM:LLaVA ; VLM:LLaVA-1.5 ; Qwen2vl ; InternVL-2.5 , which have dominated various downstream tasks goyal2017vqav2 ; hudson2019gqa ; Datasets:TextVQA ; Datasets:AI2D . Starting from LLaVA VLM:LLaVA , most existing MLLMs adopt the compositional paradigm, which connects the pre-trained visual encoder VLP:CLIP and LLM qwen through a projector and finetune them on for alignment. Then, the whole structure will be further fine-tuned on multimodal data for alignment. Based on this paradigm, existing works mainly focus on the improvement of visual encoders Qwen2vl ; wang2023internimage ; llava-hr and the design of connectors li2022blip ; VLM:LLaVA . Despite the progress, such paradigm struggles to explore the joint scaling properties of vision and language. Their potential limitations in training pipeline shukor2025scaling and vision-language alignment diao2024EVE are also gradually recognized by the community.
Native Multimodal Large Language Models. To overcome the limitations of compositional paradigm, native MLLMs have emerged as another candidate solution diao2025evev2 ; diao2024EVE ; mono_internvl ; lei2025sail ; vora ; shukor2025scaling ; team2024chameleon . Compared to compositional paradigm, native MLLMs aim to pre-train both vision and language parameters in an end-to-end manner, thus achieving better alignment. The most representative methodology shukor2025scaling ; team2024chameleon is to directly pre-train the LLM from scratch on large-scale multimodal corpora, which typically requires expensive training costs. To address this issue, recent attempt initialize the LLM with a pre-trained checkpoint to facilitate training convergence diao2025evev2 ; diao2024EVE ; mono_internvl ; lei2025sail ; vora . Nevertheless, current research still lacks systematic investigation into the architectural design and scaling characteristics of native MLLMs, limiting their performance.
3 Visual Design Principles for native-MLLM
3.1 Problem Setup
We define native MLLMs as models that jointly optimize vision and language capabilities in an end-to-end manner. Dispite recent progress that shows promising scaling law and potential better performance compard with their compositional counterparts, how to build competitive native MLLMs compare to the state-of-the-art MLLMs with a practical data scale remains underexplored. In particular, there are two problems requiring to be investigated:
- (Sec. 3.2) How to choose the optimal architectures of the visual and linguistic components?
- (Sec. 3.3) How to optimally scale up the visual and linguistic components?
Meta Architecture. To study these two questions, we first define a general meta architecture of native MLLMs consisting of a visual encoder, an LLM, and a mixture-of-expert architecture injected to the LLM. The visual encoder $\mathcal{V}$ consists of a series of transformer layers and can be defined as
$$
\mathcal{V}_{d,w}(I)=\mathcal{C}\odot\mathcal{F}_{d}^{w}\odot\cdots\odot\mathcal{F}_{2}^{w}\odot\mathcal{F}_{1}^{w}\odot\mathcal{P}(I)=\mathcal{C}\bigodot_{i=1...d}\mathcal{F}_{i}^{w}\odot\mathcal{P}(I), \tag{1}
$$
where $\mathcal{F}_{i}^{w}$ denotes the $i$ -th transformer layer (out of $d$ layers) with hidden dimension $w$ , $\mathcal{P}$ denotes the Patch Embedding Layer, $I∈\mathbb{R}^{H× W× 3}$ denotes the input image. Note that the visual encoder degenerate to a simple patch embedding layer when $d=0$ . For simplicity, we use the same architectures as the LLM for the visual encoder layers $\mathcal{F}$ but with bi-directional attention and vary the hyperparameters $d$ and $w$ . Here $\mathcal{C}$ is the connector which downsamples the encoded image embeddings through pixel shuffle VLM:InternVL and projects them to the LLM’s feature space by a MLP.
Experiment Settings. All the models are trained on web-scale, noisy image-caption pair data Datasets:Laion-5b with Next-Token-Prediction (NTP) and an image captioning task. We use a held-out subset of the multimodal dataset to calculate the validation teacher-forcing loss for measuring and comparing different design choices. Models with LLM initializations are initialize from InternLM2-Base cai2024internlm2 .
3.2 Exploring the Optimal Design of Architecture Components
In this section, we explore the design choices of three key components: 1) the initialization of the LLM; 2) the effectiveness of MoEs; 3) the optimal architecture of the visual encoder.
3.2.1 Initialization of LLM
A straightforward way to construct native MLLMs is to train all modalities from scratch with mixed corpora, as shown in prior work shukor2025scaling . While this approach theoretically offers the highest performance ceiling given ample data and computational resources, practical limitations such as data scarcity and large-scale optimization challenges hinder its feasibility. Alternatively, initializing the model from a pre-trained LLM effectively leverages linguistic prior knowledge, significantly reducing data and computational demands.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Chart: Validation Loss and COCO Caption vs. Training Data Size
### Overview
The image presents two line charts comparing the performance of a model with and without LLM initialization. The left chart displays the validation loss as a function of training data size, while the right chart shows the COCO caption CIDEr score versus training data size. Both charts share the same x-axis representing training data size on a logarithmic scale.
### Components/Axes
**Left Chart: Validation Loss**
* **Title:** Validation Loss
* **Y-axis:** Validation Loss (linear scale, ranging from 1.0 to 2.5 in increments of 0.5)
* **X-axis:** Training Data Size (logarithmic scale, ranging from 10^7 to 10^9)
* **Data Series:**
* "w/ LLM init" (dark blue line with triangle markers)
* "w/o LLM init" (light blue line with circle markers)
**Right Chart: COCO Caption**
* **Title:** COCO Caption
* **Y-axis:** CIDEr (linear scale, ranging from 10 to 60 in increments of 10)
* **X-axis:** Training Data Size (logarithmic scale, ranging from 10^7 to 10^9)
* **Data Series:**
* "w/ LLM init" (dark blue line with triangle markers)
* "w/o LLM init" (light blue line with circle markers)
**Legend:**
* Located at the bottom of the image.
* "w/ LLM init": Dark blue line with triangle markers.
* "w/o LLM init": Light blue line with circle markers.
### Detailed Analysis
**Left Chart: Validation Loss**
* **"w/ LLM init" (dark blue, triangles):** The validation loss decreases as the training data size increases.
* At 10^7, the validation loss is approximately 1.8.
* At 10^8, the validation loss is approximately 0.8.
* At 10^9, the validation loss is approximately 0.7 (extrapolated, dashed line).
* **"w/o LLM init" (light blue, circles):** The validation loss also decreases as the training data size increases, but at a slower rate compared to "w/ LLM init".
* At 10^7, the validation loss is approximately 2.7.
* At 10^8, the validation loss is approximately 0.9.
* At 10^9, the validation loss is approximately 0.7 (extrapolated, dashed line).
**Right Chart: COCO Caption**
* **"w/ LLM init" (dark blue, triangles):** The CIDEr score increases rapidly with training data size until it plateaus.
* At 10^7, the CIDEr score is approximately 17.
* At 10^8, the CIDEr score is approximately 63.
* At 10^9, the CIDEr score is approximately 65 (extrapolated, dashed line).
* **"w/o LLM init" (light blue, circles):** The CIDEr score increases with training data size, but remains lower than "w/ LLM init" until it plateaus.
* At 10^7, the CIDEr score is approximately 7.
* At 10^8, the CIDEr score is approximately 55.
* At 10^9, the CIDEr score is approximately 57 (extrapolated, dashed line).
### Key Observations
* The model initialized with LLM ("w/ LLM init") consistently outperforms the model without LLM initialization ("w/o LLM init") in terms of both validation loss and COCO caption CIDEr score.
* The validation loss decreases with increasing training data size for both models.
* The CIDEr score increases with increasing training data size for both models, but plateaus at larger training data sizes.
* The "w/ LLM init" model reaches a higher CIDEr score plateau than the "w/o LLM init" model.
### Interpretation
The data suggests that initializing the model with a Large Language Model (LLM) significantly improves its performance, as evidenced by the lower validation loss and higher CIDEr scores compared to the model without LLM initialization. This indicates that LLM initialization provides a better starting point for training, leading to faster convergence and improved generalization. The plateauing of the CIDEr score at larger training data sizes suggests that there may be a limit to the performance improvement achievable with more data, and other techniques may be needed to further enhance the model's capabilities. The validation loss converging to similar values at 10^9 suggests that both models are approaching their optimal performance with sufficient training data, but the LLM-initialized model achieves this performance more efficiently.
</details>
Figure 2: Effectiveness of LLM initialization. Left: The validation loss. The LLM initialized one converges much faster. Right: The zero-shot caption performance. Due to the lack of textual knowledge, the uninitialized model continues to lag behind.
To evaluate the effectiveness of LLM initialization, we compare model performance in terms of loss and image captioning. As shown in Fig. 2 (left), the model trained from scratch performs significantly worse than the initialized model, requiring over 10x more data to reach comparable loss.
Further analysis of zero-shot image captioning (Fig. 2 (right)) reveals a substantial performance gap favoring the initialized model, even with significantly more data for the non-initialized model. This is likely due to the lower textual quality and diversity of multimodal training data compared to the LLM pre-training corpus, limiting the textual capability of models trained from scratch. These findings highlight the practical advantage of using LLM initialization in multimodal pre-training.
Observation 1:
Initializing from pre-trained LLM greatly benefits the convergence on multimodal data, and in most cases delivers better performance even with a large amount of multimodal data.
3.2.2 Effectiveness of MoEs
<details>
<summary>x3.png Details</summary>

### Visual Description
## Chart: Validation Loss vs. Training Data Size
### Overview
The image is a line chart comparing the validation loss of a model with and without the use of Mixture of Experts (MoE) as the training data size increases. The x-axis represents the training data size on a logarithmic scale, and the y-axis represents the validation loss.
### Components/Axes
* **Title:** Validation Loss
* **X-axis:** Training Data Size (logarithmic scale)
* Axis markers: 10<sup>7</sup>, 10<sup>8</sup>
* **Y-axis:** Validation Loss
* Axis markers: 1.0, 1.5, 2.0
* **Legend:** Located in the center of the chart.
* Light Green line with triangle markers: "w/o MoE" (without Mixture of Experts)
* Dark Green line with circle markers: "w/ MoE" (with Mixture of Experts)
### Detailed Analysis
* **"w/o MoE" (without Mixture of Experts) - Light Green Line:**
* Trend: The line slopes downward, indicating a decrease in validation loss as the training data size increases.
* Data Points:
* At 10<sup>7</sup> Training Data Size: Validation Loss ≈ 2.2
* At 10<sup>8</sup> Training Data Size: Validation Loss ≈ 1.8
* **"w/ MoE" (with Mixture of Experts) - Dark Green Line:**
* Trend: The line slopes downward, indicating a decrease in validation loss as the training data size increases. The slope appears steeper than the "w/o MoE" line.
* Data Points:
* At 10<sup>7</sup> Training Data Size: Validation Loss ≈ 1.7
* At 10<sup>8</sup> Training Data Size: Validation Loss ≈ 0.85
### Key Observations
* Both lines show a decrease in validation loss as the training data size increases.
* The model "w/ MoE" consistently has a lower validation loss than the model "w/o MoE" for the given range of training data sizes.
* The "w/ MoE" line shows a more significant decrease in validation loss as the training data size increases compared to the "w/o MoE" line.
### Interpretation
The chart suggests that using Mixture of Experts (MoE) in the model leads to a lower validation loss compared to not using MoE, given the same training data size. The steeper downward slope of the "w/ MoE" line indicates that the model with MoE benefits more from increased training data size, resulting in a more substantial reduction in validation loss. This implies that MoE is an effective technique for improving the model's performance, especially as the amount of training data grows.
</details>
Figure 3: The validation loss of adding MoE or not. Using MoE extension will cause the loss to decrease more quickly.
Mixture-of-Experts (MoEs) are effective for handling heterogeneous data and are widely used in native MLLMs. We evaluate the MoE architecture within our meta architecture by comparing two configurations: one with a visual encoder and a vanilla LLM, and another with a visual encoder and an MoE-extended LLM. We follow Mono-InternVL mono_internvl to adopt the modality-specific MoEs and training settings. However, we empirically found that using only the feed-forward network (FFN) expert would lead to a significant difference in feature scale between visual and language modalities. To mitigate this issue, we further introduced modality-specific attention experts, that is, using different projection layers (i.e. qkvo) in the self-attention layer to process visual and text features respectively, and then perform unified global attention calculation. Specifically, the output $x_{i,m}^{l}∈\mathbb{R}^{d}$ of the $i$ -th token with modality $m∈\{\text{visual},\text{linguistic}\}$ at the $l$ -th layer of the MoE-extended LLM can be defined as
$$
\displaystyle x_{i,m}^{l^{\prime}} \displaystyle=x_{i,m}^{l-1}+\text{MHA-MMoE}(\text{RMSNorm}(x_{i,m}^{l-1})), \displaystyle x_{i,m}^{l} \displaystyle=x_{i,m}^{l^{\prime}}+\text{FFN-MMoE}(\text{RMSNorm}(x_{i,m}^{l^{\prime}})), \tag{2}
$$
where $\text{RMSNorm}(·)$ is the layer normalization operation, and $\text{MHA-MMoE}(·)$ and $\text{FFN-MMoE}(·)$ are the modality-specific attention and FFN expert, respectively, formulated by
$$
\displaystyle\text{MHA-MMoE}(x_{i,m}) \displaystyle=(\text{softmax}(\frac{QK^{T}}{\sqrt{d}})V)W_{O}^{m}, \displaystyle Q_{i,m}=x_{i,m}W_{Q}^{m},K_{i,m} \displaystyle=x_{i,m}W_{K}^{m},V_{i,m}=x_{i,m}W_{V}^{m}, \displaystyle\text{FFN-MMoE}(x_{i,m}) \displaystyle=(\text{SiLU}(x_{i,m}W_{\text{gate}}^{m})\odot x_{i,m}W_{\text{up}}^{m})W_{\text{down}}^{m}. \tag{3}
$$
Here $W_{Q}^{m},W_{K}^{m},W_{V}^{m},W_{O}^{m}$ and $W_{\text{gate}}^{m},W_{\text{up}}^{m},W_{\text{down}}^{m}$ are all modality-specific projection matrices, and $\text{SiLU}(·)$ denotes the activation function, $\odot$ denotes the element-wise product operation. The number of activated experts is set to one to maintain consistent inference costs.
As shown in Fig. 3, the MoE architecture significantly accelerates model convergence compared to the vanilla LLM, achieving the same validation loss with only 1/10 of the data without increasing training or inference cost. This demonstrates that MoE enhances model capacity and effectively handles heterogeneous data, making it suitable for native MLLMs.
Observation 2:
MoEs significantly improve model performance without increasing the number of activated parameters.
3.2.3 Optimizing the Visual Encoder Architecture
<details>
<summary>x4.png Details</summary>

### Visual Description
## Chart Type: Multiple Line Charts
### Overview
The image presents four line charts arranged horizontally. Each chart displays the performance of different models based on varying training data sizes. The first chart shows "Validation Loss," while the other three charts show "CIDEr" scores for "COCO Caption," "Flickr Caption," and "NoCaps Caption" datasets. The x-axis represents the training data size, and a legend at the bottom indicates the different training data sizes (15M, 30M, 60M, and 120M) represented by different colored lines.
### Components/Axes
* **Chart Titles (Top):**
* Validation Loss
* COCO Caption
* Flickr Caption
* NoCaps Caption
* **Y-Axes Labels:**
* Validation Loss (Chart 1) - Scale ranges from approximately 0.8 to 1.6.
* CIDEr (Charts 2, 3, and 4) - Scale ranges from approximately 20 to 70.
* **X-Axes Labels:**
* All charts share the same x-axis labels: d3 w4096, d6 w2880, d12 w2048, d24 w1472, d48 w1024
* **Legend (Bottom):**
* Training Data Size
* Light Blue (Triangle Marker): 15M
* Blue (Circle Marker): 30M
* Dark Blue (Square Marker): 60M
* Darkest Blue (Diamond Marker): 120M
### Detailed Analysis
**1. Validation Loss Chart:**
* **15M (Light Blue, Triangle):** The validation loss starts at approximately 1.55, decreases slightly to around 1.5, and then increases to approximately 1.7.
* **30M (Blue, Circle):** The validation loss starts at approximately 1.2, decreases slightly to around 1.1, remains relatively stable, and then increases to approximately 1.35.
* **60M (Dark Blue, Square):** The validation loss starts at approximately 1.0, decreases slightly to around 0.95, remains relatively stable, and then increases to approximately 1.0.
* **120M (Darkest Blue, Diamond):** The validation loss starts at approximately 0.85, decreases slightly to around 0.8, remains relatively stable, and then increases to approximately 0.85.
**2. COCO Caption Chart:**
* **15M (Light Blue, Triangle):** The CIDEr score starts at approximately 28, increases to approximately 31, and then decreases to approximately 22.
* **30M (Blue, Circle):** The CIDEr score starts at approximately 41, increases to approximately 48, and then decreases to approximately 37.
* **60M (Dark Blue, Square):** The CIDEr score starts at approximately 52, increases to approximately 58, and then decreases to approximately 49.
* **120M (Darkest Blue, Diamond):** The CIDEr score starts at approximately 62, increases to approximately 68, and then decreases to approximately 60.
**3. Flickr Caption Chart:**
* **15M (Light Blue, Triangle):** The CIDEr score starts at approximately 26, increases to approximately 31, and then decreases to approximately 23.
* **30M (Blue, Circle):** The CIDEr score starts at approximately 42, increases to approximately 50, and then decreases to approximately 38.
* **60M (Dark Blue, Square):** The CIDEr score starts at approximately 48, increases to approximately 58, and then decreases to approximately 47.
* **120M (Darkest Blue, Diamond):** The CIDEr score starts at approximately 58, increases to approximately 64, and then decreases to approximately 57.
**4. NoCaps Caption Chart:**
* **15M (Light Blue, Triangle):** The CIDEr score starts at approximately 33, increases to approximately 35, and then decreases to approximately 24.
* **30M (Blue, Circle):** The CIDEr score starts at approximately 43, increases to approximately 48, and then decreases to approximately 40.
* **60M (Dark Blue, Square):** The CIDEr score starts at approximately 52, increases to approximately 54, and then decreases to approximately 51.
* **120M (Darkest Blue, Diamond):** The CIDEr score starts at approximately 61, increases to approximately 63, and then decreases to approximately 61.
### Key Observations
* **Validation Loss:** As the training data size increases, the validation loss generally decreases up to 'd24 w1472', indicating better model generalization. However, the validation loss increases at 'd48 w1024' for all training data sizes.
* **CIDEr Scores:** For all captioning datasets (COCO, Flickr, NoCaps), the CIDEr scores generally increase with larger training data sizes. The scores initially increase from 'd3 w4096' to 'd24 w1472', and then decrease at 'd48 w1024'.
* **Trend Reversal:** All CIDEr score charts show a decrease in performance at the 'd48 w1024' mark, suggesting a potential issue with the model's performance at this specific data size or configuration.
### Interpretation
The charts suggest that increasing the training data size generally improves the CIDEr scores for image captioning tasks on COCO, Flickr, and NoCaps datasets. This indicates that the models benefit from more data in terms of generating more accurate and relevant captions. However, the increase in validation loss at 'd48 w1024' suggests that the model might be overfitting or encountering some other issue at that specific configuration, leading to a decrease in CIDEr scores. The 'd48 w1024' configuration may require further investigation or optimization to improve overall performance.
</details>
Figure 4: The validation loss and zero-shot caption performance of different visual encoders. The loss and performance only differ when the visual encoder is extremely wide or shallow.
The visual encoder precedes the LLM to perform preliminary extraction of visual information, converting raw pixels into semantic visual features aligned with the textual embedding space. Due to its bidirectional attention mechanism and the increased capacity introduced by additional parameters, the visual encoder has the potential to enhance the model’s ability to represent visual information.
In this section, we investigate the optimal architecture of the visual encoder under a given parameter budget. The total parameter count $\mathcal{C}$ can be approximately calculated openai2020scaling as $\mathcal{N}=12× d× w^{2}$ . Given a fixed $\mathcal{N}$ , the structure of the visual encoder is mainly determined by its width $w$ and depth $d$ .
Depth ( $d$ ): Typically, deeper models can capture richer and more complex features, while also being more prone to gradient vanishing problems tan2019efficientnet . When it comes to MLLM, a visual encoder that is too shallow may not be able to extract enough high-level semantics, while a visual encoder that is too deep may cause low-level features to be lost, thus limiting the capture of fine-grained details.
Width ( $w$ ): Compared to depth, width has relatively little impact on visual transformer performance dosovitskiy2020image , as long as it does not cause additional information bottlenecks. That is, it cannot be lower than the total number of channels within a single image patch. Under this premise, the width of the visual encoder does not have to be the same as the hidden size of the LLM.
We train various MLLMs with different $\mathcal{V}_{d,w}$ configurations (combinations of depth and width) while keeping the pre-trained LLM and visual encoder parameter count fixed at 600M. The depth $d$ ranges from $\{3,6,12,24,48\}$ , and the width $w$ is adjusted as $\{4096,2880,2048,1472,1024\}$ to maintain a consistent parameter count. Fig. 4 shows the validation loss for different depth and width combinations as training data size varies. Models with extremely high or low depths perform worse than those with moderate configurations. Among reasonably configured models, shallower ones converge faster in the early phase (less than 30M data), but this advantage diminishes with more data. In zero-shot image captioning benchmarks, deeper visual encoders show slightly better performance, consistent with prior research on compute-optimal LLM architectures openai2020scaling , which suggests a wide range of optimal width and depth combinations.
Observation 3:
Visual encoders achieve near-optimal performance across a wide range of depth and width configurations. Shallower encoders converge faster in early training, while deeper encoders perform slightly better with larger datasets.
3.3 Scaling Up Native MLLMs
In this section, we consider the scaling properties of our meta architecture. Specifically, we investigate: 1) the impact of scaling up the visual encoder and the LLM independently; 2) the optimal way of scaling the visual encoder and the LLM simultaneously. All models follow the optimal architecture discovered in Sec. 3.2, i.e., with LLM initialization, MoEs, and optimal depth-to-width ratios of the visual encoders.
3.3.1 Scaling up Visual Encoder and LLM Independently
We first investigate the scaling properties of the visual encoder and the LLM independently, i.e., scaling up one component while keeping the other fixed. Specifically, we evaluate a series of LLMs with parameter sizes $\{0.5B,1.8B,7B\}$ and visual encoders with sizes $\{75M,150M,300M,600M,1.2B,2.4B\}$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Validation Loss vs LLM Size
### Overview
The image is a line chart that plots the validation loss against the LLM (Large Language Model) size. The chart shows how the validation loss decreases as the LLM size increases, for a fixed Visual Encoder Size of 600M.
### Components/Axes
* **Title:** Validation Loss vs LLM Size
* **X-axis:** LLM Size (B) - with values 0.5, 2, and 7
* **Y-axis:** Validation Loss - with values ranging from 0.60 to 1.20, in increments of 0.20.
* **Legend:** Located at the top-right of the chart. It indicates that the line represents "Visual Encoder Size: 600M". The line color is a muted purple.
### Detailed Analysis
The chart contains one data series, representing the validation loss for a visual encoder size of 600M.
* **Visual Encoder Size: 600M (muted purple line):**
* At LLM Size 0.5, the Validation Loss is approximately 1.05.
* At LLM Size 2, the Validation Loss is approximately 0.80.
* At LLM Size 7, the Validation Loss is approximately 0.67.
The line slopes downward, indicating a decreasing validation loss as the LLM size increases.
### Key Observations
* The validation loss decreases as the LLM size increases.
* The rate of decrease in validation loss appears to diminish as the LLM size gets larger.
### Interpretation
The chart suggests that increasing the LLM size (up to 7B) reduces the validation loss, indicating improved model performance. The diminishing rate of decrease suggests that there might be a point of diminishing returns, where further increases in LLM size yield smaller improvements in validation loss. The data is specific to a visual encoder size of 600M, and the relationship might differ for other encoder sizes.
</details>
Figure 5: The validation loss when scaling up LLMs. With the same visual encoder (i.e. 600M), the validation loss decreases log-linearly with the LLM size.
Scaling up LLMs. The results are shown in Fig. 5. Scaling up the LLM parameters in native MLLMs exhibits a pattern consistent with the conventional LLM scaling law, where the loss decreases linearly as the parameter size increases exponentially.
Scaling up Visual Encoder. The results are shown in Fig. 6. In contrast to the LLM scaling law, increasing the visual encoder size does not consistently enhance multimodal performance. Instead, with a fixed LLM, the performance gains achieved by enlarging the visual encoder diminish progressively. Beyond a certain encoder size, further scaling results in only marginal loss reduction, indicating that the performance upper limit of the MLLM is constrained by the LLM’s capacity.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Charts: Validation Loss vs. Visual Encoder Size for Different LLMs
### Overview
The image presents three line charts comparing the validation loss (on a logarithmic scale) against the visual encoder size for different Large Language Models (LLMs): LLM-0.5B, LLM-1.8B, and LLM-7B. Each chart displays the performance of the LLM with varying training data sizes (15M, 30M, 60M, and 120M).
### Components/Axes
* **Title:** Each chart has a title indicating the LLM being evaluated (LLM-0.5B, LLM-1.8B, LLM-7B).
* **X-axis:** "Visual Encoder Size" with different scales for each chart.
* LLM-0.5B: 75, 150, 300, 600
* LLM-1.8B: 150, 300, 600, 1200
* LLM-7B: 300, 600, 1200, 2400
* **Y-axis:** "Validation Loss (log scale)"
* LLM-0.5B: Scale from 1.4 to 2.0
* LLM-1.8B: Scale from 0.7 to 1.6
* LLM-7B: Scale from 0.6 to 1.0
* **Legend:** Located within each chart, indicating the training data size:
* 15M (lightest color, diamond marker)
* 30M (slightly darker, triangle marker)
* 60M (medium color, circle marker)
* 120M (darkest color, square marker)
### Detailed Analysis
**LLM-0.5B**
* **15M (lightest color, diamond marker):** The line is approximately flat at a validation loss of around 1.95 across all visual encoder sizes.
* **30M (slightly darker, triangle marker):** The line is approximately flat at a validation loss of around 1.7 across all visual encoder sizes.
* **60M (medium color, circle marker):** The line slightly decreases from approximately 1.5 to 1.4 as the visual encoder size increases from 75 to 600.
* **120M (darkest color, square marker):** The line decreases from approximately 1.2 to 1.1 as the visual encoder size increases from 75 to 600.
**LLM-1.8B**
* **15M (lightest color, diamond marker):** The line is approximately flat at a validation loss of around 1.5 across all visual encoder sizes.
* **30M (slightly darker, triangle marker):** The line is approximately flat at a validation loss of around 1.2 across all visual encoder sizes.
* **60M (medium color, circle marker):** The line decreases from approximately 0.95 to 0.85 as the visual encoder size increases from 150 to 1200.
* **120M (darkest color, square marker):** The line decreases from approximately 0.75 to 0.7 as the visual encoder size increases from 150 to 1200.
**LLM-7B**
* **30M (slightly darker, triangle marker):** The line decreases from approximately 1.0 to 0.9, then increases slightly to 0.92 as the visual encoder size increases from 300 to 2400.
* **60M (medium color, circle marker):** The line decreases from approximately 0.85 to 0.75, then increases slightly to 0.78 as the visual encoder size increases from 300 to 2400.
* **120M (darkest color, square marker):** The line decreases from approximately 0.68 to 0.63, then remains relatively flat as the visual encoder size increases from 300 to 2400.
### Key Observations
* For all LLMs, increasing the training data size (from 15M to 120M) generally reduces the validation loss, indicating better model performance.
* The validation loss tends to decrease as the visual encoder size increases, especially for the 60M and 120M training data sizes.
* The LLM-7B model shows a slight increase in validation loss at the largest visual encoder size (2400) for the 30M and 60M training data sizes.
### Interpretation
The data suggests that increasing both the training data size and the visual encoder size generally improves the performance of the LLMs, as indicated by the lower validation loss. The LLM-7B model, being the largest, benefits the most from increased training data and visual encoder size, achieving the lowest validation loss among the three models. The slight increase in validation loss for LLM-7B at the largest visual encoder size with smaller training datasets (30M and 60M) might indicate overfitting or the need for more data to fully utilize the larger encoder size.
</details>
Figure 6: The validation loss curves of different LLMs with different training data sizes. As the training data size increases, the loss gap narrows to near zero when the visual encoder size reaches a certain threshold.
Observation 4:
Scaling the LLM consistently improves multimodal performance, following the typical LLM scaling law. However, increasing the visual encoder size shows diminishing returns, suggesting that the MLLM’s performance is limited by the LLM’s capacity.
3.3.2 Scaling up Visual Encoder and LLM Together
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Visual Encoder Size vs LLM Size
### Overview
The image is a line chart that plots the relationship between Visual Encoder Size (in Billions) and LLM Size (in Billions). The chart shows a positive correlation, indicating that as the LLM size increases, the Visual Encoder size also increases.
### Components/Axes
* **Title:** Visual Encoder Size vs LLM Size
* **X-axis:**
* Label: LLM Size (B)
* Scale: 0.5, 2, 7
* **Y-axis:**
* Label: Visual Encoder Size (B)
* Scale: 0.30, 0.60, 1.20
* **Data Series:** A single line in a muted purple color.
### Detailed Analysis
The line chart shows the following data points:
* **LLM Size 0.5B:** Visual Encoder Size is approximately 0.3B.
* **LLM Size 2B:** Visual Encoder Size is approximately 0.6B.
* **LLM Size 7B:** Visual Encoder Size is approximately 1.2B.
The line slopes upward, indicating a positive correlation between LLM Size and Visual Encoder Size. The slope appears steeper between LLM Size 2B and 7B compared to the slope between 0.5B and 2B.
### Key Observations
* There is a clear positive correlation between LLM Size and Visual Encoder Size.
* The increase in Visual Encoder Size is more pronounced as LLM Size increases from 2B to 7B.
### Interpretation
The chart suggests that larger Language Learning Models (LLMs) require larger Visual Encoders. The non-linear increase suggests that the relationship might not be directly proportional; the visual encoder size increases at a faster rate as the LLM size grows. This could be due to the increased complexity and data processing demands of larger LLMs, requiring a more powerful visual encoder to effectively handle visual inputs.
</details>
Figure 7: Relationship of visual encoder size and LLM size. The optimal visual encoder size increases log-linearly with the LLM size.
The diminishing returns from increasing the visual encoder size suggest the existence of an optimal encoder size for a given LLM. We define this optimal size as the smallest encoder whose loss difference compared to an encoder twice its size is less than $\lambda=1\%$ of the loss with the 75M encoder (the smallest used in our experiments). Fig. 7 shows the relationship between visual encoder size and LLM size.
The logarithm of the optimal visual encoder size scales linearly with the logarithm of the LLM size, indicating that both components should be scaled jointly for balanced performance. This highlights the suboptimality of compositional MLLMs, which typically use a fixed visual encoder size across varying LLM scales.
Observation 5:
The optimal size of the visual encoder scales proportionally with the LLM size in log scale, indicating that both components should be scaled jointly. This further implies that the pre-trained visual encoders using a single pre-trained visual encoder across a wide range of LLM scales like existing compositional MLLMs is suboptimal.
4 NaViL: A Novel Native MLLM with Strong Capabilities
4.1 Architecture
<details>
<summary>x8.png Details</summary>

### Visual Description
## Diagram: Multimodal Large Language Model Architecture
### Overview
The image presents a diagram of a multimodal large language model architecture. It illustrates how text and image data are processed and integrated within the model. The diagram includes components for text processing, image encoding, and multimodal fusion.
### Components/Axes
* **Legend:** Located at the top-right of the image.
* Blue square: "text token"
* Green square: "image token"
* **Main Components:**
* **Left Side:** Shows a repeating block of layers labeled "FFN-MMoE", "RMSNorm", "MHA-MMoE", and "RMSNorm", repeated N times.
* **Center:** Depicts the core architecture, including "Multimodal Large Language Models", "Visual Encoder", and input/output text boxes.
* **Right Side:** Shows a series of layers labeled "MLP Connector", "Transformer Layer d", "Transformer Layer 1", and "Patch Embed".
* **Input Text:** "Please provide a more detailed description of the cat in the picture."
* **Input Image:** Three images of a cartoon cat, labeled "Visual Multi-scale Packing" and "Pomi".
* **Output Text:** "The cat wears a yellow flower on its head, a golden necklace around its neck, and pink blushes on its cheeks."
### Detailed Analysis
* **Left Side (Text Processing):**
* A repeating block of layers is shown, with an input of mixed text and image tokens.
* The block consists of:
* "MHA-MMoE" (Multi-Head Attention - Mixture of Experts)
* "RMSNorm" (Root Mean Square Normalization)
* "FFN-MMoE" (Feed Forward Network - Mixture of Experts)
* "RMSNorm" (Root Mean Square Normalization)
* The output of the block is fed back into the input via a skip connection (addition).
* The entire block is repeated N times, as indicated by "x N".
* **Center (Multimodal Fusion):**
* The input text "Please provide a more detailed description of the cat in the picture" is fed into the "Multimodal Large Language Models" block.
* The "Visual Encoder" processes the input images (Visual Multi-scale Packing) and feeds the encoded image tokens into the "Multimodal Large Language Models" block.
* The "Multimodal Large Language Models" block outputs the text "The cat wears a yellow flower on its head, a golden necklace around its neck, and pink blushes on its cheeks."
* **Right Side (Image Processing):**
* The input image of the cat is processed by a "Patch Embed" layer.
* The output of the "Patch Embed" layer is fed into a series of transformer layers, starting with "Transformer Layer 1" and ending with "Transformer Layer d".
* The output of the final transformer layer is fed into an "MLP Connector" layer.
* The output of the "MLP Connector" layer is the image token representation.
### Key Observations
* The diagram illustrates a multimodal model that combines text and image data.
* The model uses a visual encoder to process images and generate image tokens.
* The model uses a repeating block of layers (MHA-MMoE, RMSNorm, FFN-MMoE, RMSNorm) for text processing.
* The model uses transformer layers for image processing.
* The model uses a mixture of experts (MMoE) architecture in both the text and image processing components.
### Interpretation
The diagram illustrates a multimodal large language model designed to generate text descriptions from images. The model takes both text prompts and visual inputs, processes them through separate encoders (text and visual), and then fuses the information to generate a coherent text description. The use of Mixture of Experts (MMoE) suggests that the model can selectively activate different parts of the network based on the input, allowing it to handle a wide range of image and text combinations. The skip connection in the text processing block likely helps with gradient flow during training and allows the model to retain information from earlier layers. The visual multi-scale packing suggests that the model is designed to handle images of different sizes and resolutions.
</details>
Figure 8: Architecture of NaViL. As a native MoE-extended MLLM, NaViL can be trained end-to-end and supports input images of any resolution.
Based on above studies, we construct NaViL with the optimal settings in Sec. 3.1. The architecture is shown in Fig. 8. NaViL inherently supports input images of any resolution. These images are first encoded into visual tokens by the visual encoder and the MLP projector, and then concatenated with the textual tokens to formulate the multimodal token sequence and fed into the LLM. Special tokens <begin_of_image> and <end_of_image> are inserted before and after each image token subsequence to indicate the beginning and end of the image, respectively. Special token <end_of_line> is inserted at the end of each row of image tokens to indicate the corresponding spatial position information.
Visual Multi-scale Packing is further introduced to improve the model performance during inference. Specifically, given an input image $I_{0}∈\mathbb{R}^{H_{0}× W_{0}× 3}$ and downsampling rate $\tau$ , a multi-scale image sequence $\{I_{i}∈\mathbb{R}^{H_{i}× W_{i}× 3}\}_{i=0}^{n}$ is obtained by continuously downsampling the original image (i.e. $H_{i}=\tau^{i}H_{0},W_{i}=\tau^{i}W_{0}$ ) until its area is smaller than a given threshold. These images in the sequence are processed separately by the visual encoder. The obtained visual token embeddings $\{{x_{i,v}}\}_{i=0}^{n}$ are then concatenated and fed to the LLM. Special token <end_of_scale> is inserted after each scale image to indicate the end of different scales.
4.2 Training
Stage 1: Multi-modal Generative Pre-training. In this stage, the model is initially trained on 500 million image-text pairs to develop comprehensive multimodal representations. Of these training samples, 300 million are directly sampled from web-scale datasets (i.e. Laion-2B Datasets:Laion-5b , Coyo-700M kakaobrain2022coyo-700m , Wukong gu2022wukong and SA-1B TransF:SAM ) while the remaining 200 million consist of images from these datasets paired with captions synthesized by existing MLLMs (i.e. InternVL-8B VLM:InternVL ). During this process, the textual parameters of the model remain frozen, with only the newly-added vision-specific parameters (i.e., the visual encoder, MLP projector, and MoE visual experts) being trainable.
To enhance the alignment between visual and textual features in more complex multimodal contexts, the model is subsequently trained on 185 million high-quality data consisting of both multimodal alignment samples and pure language data. In this phase, the textual parameters within the self-attention layers are also unfrozen, enabling more refined cross-modal integration.
Stage 2: Supervised Fine-tuning. Following common practice in developing MLLM, an additional supervised fine-tuning stage is adopted. In this stage, all parameters are unfrozen and trained using a relatively smaller (i.e. 68 million) but higher quality multimodal dataset.
5 Experiment
5.1 Experimental Setups
Table 1: Comparison with existing MLLMs on general MLLM benchmarks. “#A-Param” denotes the number of activated parameters. † InternVL-2.5-2B adopts the same LLM and high-quality data with NaViL, so we mark it as the compositional counterpart. Note that its 300M visual encoder is distilled from another 6B large encoder. Bold and underline indicate the best and the second-best performance among native MLLMs, respectively. * denotes our reproduced results. For MME, we sum the perception and cognition scores. Average scores are computed by normalizing each metric to a range between 0 and 100.
| Model | #A-Param | Avg | MMVet | MMMU | MMB | MME | MathVista | OCRBench | CCB |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Compositional MLLMs: | | | | | | | | | |
| MobileVLM-V2-1.7B chu2024mobilevlm | 1.7B | $-$ | $-$ | $-$ | 57.7 | $-$ | $-$ | $-$ | $-$ |
| MobileVLM-V2-3B chu2024mobilevlm | 3.0B | $-$ | $-$ | $-$ | 63.2 | $-$ | $-$ | $-$ | $-$ |
| Mini-Gemini-2B VLM:MiniGemini | 3.5B | $-$ | 31.1 | 31.7 | 59.8 | 1653 | 29.4 | $-$ | $-$ |
| MM1-3B-MoE-Chat VLM:MM1 | 3.5B | $-$ | 42.2 | 38.6 | 70.8 | 1772 | 32.6 | $-$ | $-$ |
| DeepSeek-VL-1.3B lu2024deepseekvl | 2.0B | 42.3 | 34.8 | 32.2 | 64.6 | 1532 | 31.1 | 409 | 37.6 |
| PaliGemma-3B beyer2024paligemma | 2.9B | 45.6 | 33.1 | 34.9 | 71.0 | 1686 | 28.7 | 614 | 29.6 |
| MiniCPM-V-2 yao2024minicpm | 2.8B | 51.1 | 41.0 | 38.2 | 69.1 | 1809 | 38.7 | 605 | 45.3 |
| InternVL-1.5-2B VLM:InternVL-1.5 | 2.2B | 54.7 | 39.3 | 34.6 | 70.9 | 1902 | 41.1 | 654 | 63.5 |
| Qwen2VL-2B Qwen2vl | 2.1B | 58.6 | 49.5 | 41.1 | 74.9 | 1872 | 43.0 | 809 | 53.7 |
| † InternVL-2.5-2B chen2024expanding | 2.2B | 67.0 | 60.8 | 43.6 | 74.7 | 2138 | 51.3 | 804 | 81.7 |
| Native MLLMs: | | | | | | | | | |
| Fuyu-8B (HD) VLM:Fuyu-8b | 8B | $-$ | 21.4 | $-$ | 10.7 | $-$ | $-$ | $-$ | $-$ |
| SOLO solo | 7B | $-$ | $-$ | $-$ | $-$ | 1260 | 34.4 | $-$ | $-$ |
| Chameleon-7B The performance of Chameleon-7B is from mono_internvl . team2024chameleon | 7B | 13.9 | 8.3 | 25.4 | 31.1 | 170 | 22.3 | 7 | 3.5 |
| EVE-7B diao2024EVE | 7B | 33.0 | 25.6 | 32.3 | 49.5 | 1483 | 25.2 | 327 | 12.4 |
| EVE-7B (HD) diao2024EVE | 7B | 37.0 | 25.7 | 32.6 | 52.3 | 1628 | 34.2 | 398 | 16.3 |
| Emu3 emu3 | 8B | $-$ | 37.2 | 31.6 | 58.5 | $-$ | $-$ | 687 | $-$ |
| VoRA vora | 7B | $-$ | 33.7 | 32.2 | 64.2 | 1674 | $-$ | $-$ | $-$ |
| VoRA-AnyRes vora | 7B | $-$ | 33.7 | 32.0 | 61.3 | 1655 | $-$ | $-$ | $-$ |
| EVEv2 diao2025evev2 | 7B | 53.2 | 45.0 | 39.3 | 66.3 | 1709 | 60.0 * | 702 | 30.8* |
| SAIL lei2025sail | 7B | 53.7 | 46.3 | 38.6* | 70.1 | 1719 | 57.0 | 783 | 24.3* |
| Mono-InternVL mono_internvl | 1.8B | 56.4 | 40.1 | 33.7 | 65.5 | 1875 | 45.7 | 767 | 66.3 |
| NaViL-2B (ours) | 2.4B | 67.1 | 78.3 | 41.8 | 71.2 | 1822 | 50.0 | 796 | 83.9 |
Evaluation Benchmarks. We evaluate NaViL and existing MLLMs on a broad range of multimodal benchmarks. Specifically, MLLM benchmarks encompass MMVet Datasets:MM-vet , MMMU val Datasets:MMMU , MMBench-EN test Datasets:MMBench , MME Datasets:MME , MathVista MINI Datasets:Mathvista , OCRBench liu2023ocrbench , and CCBench Datasets:MMBench . Visual question answering benchmarks include TextVQA val Datasets:TextVQA , ScienceQA-IMG test Datasets:ScienceQA , GQA test dev Datasets:GQA , DocVQA test mathew2021docvqa , AI2D test Datasets:AI2D , ChartQA test Datasets:ChartQA , and InfographicVQA test mathew2022infographicvqa . These benchmarks cover various domains, such as optical character recognition (OCR), chart and document understanding, multi-image understanding, real-world comprehension, etc.
Implementation Details. By default, NaViL-2B is implemented upon InternLM2-1.8B 2023internlm , using its weights as initialization for the text part parameters. The text tokenizer and conversation format are also the same. The total number of parameters is 4.2B, of which the number of activation parameters is 2.4B (including 0.6B of visual encoder). The input images are first padded to ensure its length and width are multiples of 32. The stride of Patch Embedding layer is set to 16. The visual encoder adopts bidirectional attention and 2D-RoPE to capture global spatial relationships, while the LLM adopts causal attention and 1D-RoPE to better inherit its capabilities. In the pre-training phase, the global batch size is 7000 for stage 1 and 4614 for stage 2, respectively. The downsampling rate $\tau$ of visual multi-scale packing is set to $\sqrt{2}/{2}$ . To demonstrate the scaling capability of our approach, we also trained NaViL-9B based on Qwen3-8B TransF:Qwen3 . More details are given in the appendix.
5.2 Main Results
Table 2: Comparison with existing MLLMs on visual question answering benchmarks. † InternVL-2.5-2B adopts the same LLM and high-quality data with NaViL, so we mark it as the compositional counterpart. Note that its 300M visual encoder is distilled from another 6B large encoder. * denotes our reproduced results. Bold and underline indicate the best and the second-best performance among native MLLMs, respectively.
| Model Compositional MLLMs: MobileVLM-V2-3B chu2024mobilevlm | #A-Param 3.0B | Avg $-$ | TextVQA 57.5 | SQA-I 70.0 | GQA 66.1 | DocVQA $-$ | AI2D $-$ | ChartQA $-$ | InfoVQA $-$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Mini-Gemini-2B VLM:MiniGemini | 3.5B | $-$ | 56.2 | $-$ | $-$ | 34.2 | $-$ | $-$ | $-$ |
| MM1-3B-MoE-Chat VLM:MM1 | 3.5B | $-$ | 72.9 | 76.1 | $-$ | $-$ | $-$ | $-$ | $-$ |
| DeepSeek-VL-1.3B lu2024deepseekvl | 2.0B | $-$ | 57.8 | $-$ | $-$ | $-$ | 51.5 | $-$ | $-$ |
| PaliGemma-3B beyer2024paligemma | 2.9B | $-$ | 68.1 | $-$ | $-$ | $-$ | 68.3 | $-$ | $-$ |
| MiniCPM-V-2 yao2024minicpm | 2.8B | $-$ | 74.1 | $-$ | $-$ | 71.9 | 62.9 | $-$ | $-$ |
| InternVL-1.5-2B VLM:InternVL-1.5 | 2.2B | 71.7 | 70.5 | 84.9 | 61.6 | 85.0 | 69.8 | 74.8 | 55.4 |
| Qwen2VL-2B Qwen2vl | 2.1B | 73.1 | 79.7 | 78.2* | 60.3* | 90.1 | 74.7 | 73.5 | 65.5 |
| † InternVL-2.5-2B chen2024expanding | 2.2B | 76.5 | 74.3 | 96.2 | 61.2 | 88.7 | 74.9 | 79.2 | 60.9 |
| Native MLLMs: | | | | | | | | | |
| Fuyu-8B (HD) VLM:Fuyu-8b | 8B | $-$ | $-$ | $-$ | $-$ | $-$ | 64.5 | $-$ | $-$ |
| SOLO solo | 7B | $-$ | $-$ | 73.3 | $-$ | $-$ | 61.4 | $-$ | $-$ |
| Chameleon-7B The performance of Chameleon-7B is from mono_internvl . team2024chameleon | 7B | 17.9 | 4.8 | 47.2 | $-$ | 1.5 | 46.0 | 2.9 | 5.0 |
| EVE-7B diao2024EVE | 7B | 40.8 | 51.9 | 63.0 | 60.8 | 22.0 | 48.5 | 19.5 | 20.0 |
| EVE-7B (HD) diao2024EVE | 7B | 54.6 | 56.8 | 64.9 | 62.6 | 53.0 | 61.0 | 59.1 | 25.0 |
| Emu3 emu3 | 8B | 67.6 | 64.7 | 89.2 | 60.3 | 76.3 | 70.0 | 68.6 | 43.8 |
| VoRA vora | 7B | $-$ | 56.3 | 75.9 | $-$ | $-$ | 65.6 | $-$ | $-$ |
| VoRA-AnyRes vora | 7B | $-$ | 58.7 | 72.0 | $-$ | $-$ | 61.1 | $-$ | $-$ |
| EVEv2 diao2025evev2 | 7B | 71.7 | 71.1 | 96.2 | 62.9 | 77.4* | 74.8 | 73.9 | 45.8* |
| SAIL lei2025sail | 7B | 71.5 | 77.1 | 93.3 | 58.0* | 78.4* | 76.7 | 69.7* | 47.3 * |
| Mono-InternVL mono_internvl | 1.8B | 70.1 | 72.6 | 93.6 | 59.5 | 80.0 | 68.6 | 73.7 | 43.0 |
| NaViL-2B (ours) | 2.4B | 75.1 | 76.9 | 95.0 | 59.8 | 85.4 | 74.6 | 78.0 | 56.0 |
In Tab. 1, we compare the performance of our model with existing MLLMs across 7 multimodal benchmarks. Compared to native MLLMs, compositional MLLMs demonstrate superior overall performance. For example, InternVL-2.5-2B outperforms existing native MLLMs on most MLLM benchmarks. This indicates that current native MLLMs still have significant room for performance improvement. In contrast, our proposed NaViL achieves overall performance exceeding all existing native MLLMs with a relatively small paramter size. Compared to the compositional baseline model InternVL-2.5-2B that uses the same LLM, NaViL also achieves comparable performance on most benchmarks. It is worth noting that the 300M visual encoder used by InternVL-2.5-2B is distilled from another pre-trained encoder InternViT-6B VLM:InternVL with a significantly larger parameter size. This demonstrates the superiority of our visual design methods and visual parameter scaling strategies.
In Tab. 2, we further compare the performance of our model with existing MLLMs on mainstream visual question answering tasks. NaViL’s average performance still leads previous state-of-the-art native MLLMs and is roughly on par with compositional baselines that require pre-trained encoders. Specifically, in tests such as DocVQA Datasets:OCRVQA , ChartQA Datasets:ChartQA and InfoVQA mathew2022infographicvqa , NaViL significantly outperforms the previous state-of-the-art native MLLM, demonstrating the superiority of using an optimal size visual encoder in processing high-resolution images. However, NaViL’s performance still has some gap compared to the best compositional MLLMs. We believe that higher-quality instruction data and more powerful LLMs will further narrow this gap.
5.3 Qualitative Experiments
To further analyze the characteristics of native MLLM, we visualized the attention maps of different LLM layers when using encoders of 150M and 1.2B sizes, as shown in Fig. 9. Two findings can be drawn from the figure. First, similar to previous native-MLLMs mono_internvl , despite having an encoder, the attention patterns in shallow layers still exhibit obvious locality, gradually shifting toward global information as the depth increases. For example, when using a 150M encoder, image tokens in the first layer tend to attend to spatially adjacent tokens. However, we observe that when the visual encoder is scaled up to 1.2B, visual tokens in shallow layers already begin to attend more to global information. This indicates that a sufficiently large visual encoder can better pre-extract high-level semantic information from the entire image.
Secondly, from a cross-modal interaction perspective, a larger visual encoder also facilitates earlier interaction between visual and language features. When using a 1.2B visual encoder, the attention weights between visual tokens and text tokens in the first layer are significantly higher than those in the 150M counterpart. Earlier interaction is more beneficial for feature alignment between modalities, thus providing an explanatory perspective for the improved performance achieved when using larger encoder sizes. We believe these findings will provide beneficial insights for developing native MLLMs. More visualizations can be found in the supplementary materials.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Heatmap Grid: Layer Similarity
### Overview
The image presents a grid of heatmaps, arranged in two rows and four columns. The heatmaps visualize the similarity between layers in two different models (150M and 1.2B). Each column represents a specific layer (1, 5, 14, and 23). The heatmaps are triangular, suggesting a comparison of each element with every other element within the layer. The color intensity indicates the degree of similarity, with lighter shades representing lower similarity and darker shades representing higher similarity. To the left of the heatmaps is an image of a traffic light.
### Components/Axes
* **Legend:** Located at the top-left of the image.
* Brown dashed box: 150M model
* Blue dashed box: 1.2B model
* **Columns (Layers):**
* Layer 1
* Layer 5
* Layer 14
* Layer 23
* **Heatmaps:** Triangular matrices representing layer similarity. The color gradient ranges from light yellow to red, indicating increasing similarity.
* **Image:** A photograph of a traffic light showing red lights against a blue sky with clouds.
* **Text:**
* "Provide a one-sentence caption for the image:"
* "A traffic light showing red lights against a blue sky with clouds."
### Detailed Analysis
The heatmaps are arranged as follows:
* **Row 1 (150M Model):**
* **Column 1 (Layer 1):** The heatmap shows a distinct pattern of horizontal lines, indicating similarity between specific elements within the layer.
* **Column 2 (Layer 5):** The heatmap shows a more diffused pattern, with higher similarity concentrated near the top-left corner.
* **Column 3 (Layer 14):** The heatmap shows a pattern similar to Layer 5, but with slightly higher overall similarity.
* **Column 4 (Layer 23):** The heatmap shows the highest overall similarity, with a darker red color across most of the triangle.
* **Row 2 (1.2B Model):**
* **Column 1 (Layer 1):** The heatmap shows a pattern similar to the 150M model, but with less distinct horizontal lines.
* **Column 2 (Layer 5):** The heatmap shows a more diffused pattern, with higher similarity concentrated near the top-left corner.
* **Column 3 (Layer 14):** The heatmap shows a pattern similar to Layer 5, but with slightly higher overall similarity.
* **Column 4 (Layer 23):** The heatmap shows the highest overall similarity, with a darker red color across most of the triangle.
### Key Observations
* The similarity within layers generally increases from Layer 1 to Layer 23 in both models.
* The 1.2B model tends to show slightly higher similarity within layers compared to the 150M model.
* Layer 1 exhibits a distinct pattern of horizontal lines in the 150M model, which is less pronounced in the 1.2B model.
### Interpretation
The heatmaps visualize the internal representations learned by two different sized models (150M and 1.2B) at different layers. The increasing similarity from Layer 1 to Layer 23 suggests that later layers learn more coherent and similar representations. The higher similarity in the 1.2B model may indicate that larger models are better at learning consistent representations. The distinct pattern in Layer 1 of the 150M model could be due to the model's smaller size, leading to more specialized representations in early layers. The image of the traffic light and the associated text are likely used as an example input to the models, but their direct relationship to the heatmaps is not explicitly stated.
</details>
Figure 9: Visualization of attention maps in LLM-1.8B with different encoder sizes (i.e. 150M and 1.2B). Text and image tokens are in blue and green, respectively. Larger encoder allows LLMs to attend to global patterns at shallow layers while maintaining higher attention to textual tokens.
6 Conclusion
This paper systematically investigates native end-to-end training for MLLMs, examining its design space and scaling properties under data constraints. Our study reveals three key insights: 1) Initialization with pre-trained LLMs, combined with visual encoders and MoE architecture, significantly improves performance; 2) Visual encoder scaling is limited by the LLM’s capacity, unlike traditional LLM scaling; 3) The optimal encoder size scales log-proportionally with the LLM size. Based on these findings, we propose NaViL, a native MLLM that achieves competitive performance on diverse multimodal benchmarks, outperforming existing compositional MLLMs. We hope these insights will inspire future research on next-generation MLLMs.
Limitations and Broader Impacts. Due to limited computation resources, this paper only investigates the scaling properties of native MLLMs up to 9B parameters. Subsequent experiments with larger scales (e.g., 30 billion, 70 billion, 100 billion, etc.) can be conducted to further validate this scaling trend. In addition, this paper focuses only on visual and linguistic modalities. Future research may explore broader modalities and provide more in-depth insights beyond the current visual-linguistic paradigm.
Acknowledgments
The work is supported by the National Key R&D Program of China (NO. 2022ZD0161300, and NO. 2022ZD0160102), by the National Natural Science Foundation of China (U24A20325, 62321005, 62376134), and by the China Postdoctoral Science Foundation (No. BX20250384).
References
- [1] Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
- [2] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In ICCV, pages 8948–8957, 2019.
- [3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- [4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- [5] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023.
- [6] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024.
- [7] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- [8] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- [9] ChameleonTeam. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
- [10] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- [11] Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling. arXiv preprint arXiv:2407.06438, 2024.
- [12] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
- [13] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
- [14] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv:2404.16821, 2024.
- [15] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv: 2312.14238, 2023.
- [16] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
- [17] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In ACL, pages 845–855, 2018.
- [18] Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- [19] Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024.
- [20] Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. arXiv preprint arXiv:2502.06788, 2025.
- [21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- [22] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv: 2306.13394, 2023.
- [23] Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021.
- [24] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- [25] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35:26418–26431, 2022.
- [26] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- [27] Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- [28] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. Zenodo. Version 0.1. https://doi.org/10.5281/zenodo.5143773, 2021. DOI: 10.5281/zenodo.5143773.
- [29] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- [30] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016.
- [31] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. arXiv: 2304.02643, 2023.
- [32] Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. arXiv preprint arXiv:2504.10462, 2025.
- [33] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022.
- [34] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv: 2403.18814, 2024.
- [35] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv: 2310.03744, 2023.
- [36] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
- [37] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv: 2307.06281, 2023.
- [38] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
- [39] Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
- [40] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
- [41] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv: 2310.02255, 2023.
- [42] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- [43] Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In CVPR, 2025.
- [44] Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024.
- [45] Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022.
- [46] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022.
- [47] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021.
- [48] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, and Yinfei Yang. MM1: methods, analysis & insights from multimodal LLM pre-training. arXiv: 2403.09611, 2024.
- [49] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- [50] OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
- [51] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023.
- [52] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- [53] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763, 2021.
- [54] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- [55] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- [56] Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, and Alaaeldin El-Nouby. Scaling laws for native multimodal models. arXiv preprint arXiv:2504.07951, 2025.
- [57] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In CVPR, 2019.
- [58] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- [59] InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- [60] Qwen Team. Qwen3 blog. https://qwenlm.github.io/blog/qwen3/, 2025.
- [61] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [62] Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora. arXiv preprint arXiv:2503.20680, 2025.
- [63] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
- [64] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, pages 14408–14419, 2023.
- [65] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need. arXiv: 2409.18869, 2024.
- [66] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- [67] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
- [68] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- [69] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv: 2308.02490, 2023.
- [70] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv: 2311.16502, 2023.
- [71] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In CVPR, pages 12104–12113, 2022.
- [72] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023.
Technical Appendices and Supplementary Material
Appendix A NaViL-9B: Scaling up to 9B parameters
To further demonstrate the scaling capability of our method, we trained NaViL-9B based on Qwen3-8B TransF:Qwen3 . The total number of activation parameters is 9.2B, of which 1.2B belongs to the visual encoder. The training recipe is similar to NaViL-2B, as shown in Tab. 8, except the visual multi-scaling packing is disabled in the first sub-stage of pre-training for acceleration.
Tab. 3 presents a comparison of the total training tokens required by our method versus two compositional counterparts. Notably, our approach achieves comparable performance while using substantially fewer training tokens, demonstrating improved training efficiency.
Table 3: Comparison between NaViL and existing MLLMs on the number of training tokens.
| Qwen2.5VL bai2025qwen2 InternVL2.5-8B InternVL-2.5 NaViL-2B (ours) | unknown >3.3T 0 | 4.1T 140B 800B | >4.1T >3.5T 800B |
| --- | --- | --- | --- |
| NaViL-9B (ours) | 0 | 450B Due to limited computational resource and time, current version of NaViL-9B in this paper is only trained with 450B tokens. | 450B |
The performance results on multimodal and visual question answering benchmarks are shown in Tab. 4. With a similar parameter size, our NaViL-9B outperforms all existing native MLLMs by a large margin on almost all benchmarks. Besides that, compared to the compositional baseline model InternVL-2.5-8B with a similar parameter size, NaViL-9B also achieves competitive performance. Such results show that our proposed native MLLM can be scaled up to larger parameter sizes and achieve consistent performance gains.
Appendix B More discussions on Compositional MLLMs and Native MLLMs
<details>
<summary>x10.png Details</summary>

### Visual Description
## Diagram: Compositional vs. Native Multimodal Large Language Models
### Overview
The image presents two diagrams illustrating different architectures for Multimodal Large Language Models (MLLMs). Diagram (a) depicts a "Compositional" approach without Mixture of Experts (MoE), while diagram (b) shows a "Native" approach with MoE. Both diagrams outline the flow of information and the interaction between visual encoders and the MLLM.
### Components/Axes
**Diagram (a): Compositional MLLMs without MoE**
* **Visual Encoder (Top-Left):** A green rounded rectangle labeled "Visual Encoder."
* **Contrastive Loss (Top-Center):** An orange dashed rounded rectangle labeled "Contrastive Loss."
* **Visual/Text Encoder (Top-Right):** A green rounded rectangle labeled "Visual / Text Encoder."
* **Copy Weights (Arrow):** A downward arrow labeled "Copy Weights" connecting the top "Visual Encoder" to a second "Visual Encoder" below.
* **Visual Encoder (Middle-Left):** A green rounded rectangle labeled "Visual Encoder."
* **Multimodal Large Language Models (Center):** A blue rounded rectangle labeled "Multimodal Large Language Models."
* **Next Token Prediction (Bottom-Center):** An orange dashed rounded rectangle labeled "Next Token Prediction."
* **Diagram Title:** "(a) Compositional MLLMs without MoE"
**Diagram (b): Native MLLMs with MoE**
* **Randomly Initialized (Text):** Text above the "Visual Encoder" stating "Randomly Initialized."
* **Visual Encoder (Top-Left):** A green rounded rectangle labeled "Visual Encoder."
* **Visual Expert (Middle-Left):** A green rounded rectangle labeled "Visual Expert" nested within the "Multimodal Large Language Models" block.
* **Multimodal Large Language Models (Center):** A blue rounded rectangle labeled "Multimodal Large Language Models."
* **Next Token Prediction (Bottom-Center):** An orange dashed rounded rectangle labeled "Next Token Prediction."
* **Diagram Title:** "(b) Native MLLMs with MoE"
### Detailed Analysis or Content Details
**Diagram (a):**
1. A "Visual Encoder" processes visual data.
2. The output is used to calculate "Contrastive Loss" in conjunction with a "Visual / Text Encoder."
3. The weights from the initial "Visual Encoder" are copied to a second "Visual Encoder."
4. This second "Visual Encoder" feeds into "Multimodal Large Language Models."
5. The MLLM produces a "Next Token Prediction."
**Diagram (b):**
1. A "Visual Encoder" is randomly initialized.
2. The output of the "Visual Encoder" is fed into "Multimodal Large Language Models."
3. A "Visual Expert" component is integrated within the MLLM.
4. The MLLM produces a "Next Token Prediction."
### Key Observations
* Diagram (a) emphasizes a compositional approach with weight sharing and contrastive learning.
* Diagram (b) highlights a native approach with a randomly initialized visual encoder and a "Visual Expert" module integrated into the MLLM.
* Both diagrams aim to achieve "Next Token Prediction" using multimodal information.
### Interpretation
The diagrams illustrate two distinct strategies for incorporating visual information into large language models. The "Compositional" approach (a) leverages pre-training and weight sharing to align visual and textual representations, potentially improving sample efficiency and generalization. The "Native" approach (b) integrates a "Visual Expert" directly into the MLLM, allowing the model to learn visual representations end-to-end. The choice between these architectures depends on factors such as the availability of pre-trained visual encoders, the desired level of integration between visual and textual modalities, and the computational resources available for training. The "Contrastive Loss" in (a) suggests a method to align the visual and textual embeddings, while the "Randomly Initialized" encoder in (b) suggests learning visual features from scratch within the MLLM framework.
</details>
Figure 10: Paradigm Comparison between Compositional MLLMs and Native MLLMs. Compositional MLLMs adopt different training objectives and strategies (e.g. Contrastive Loss or Next-Token-Prediction) to pre-train the visual encoder and LLM separately, while native MLLMs optimize both image and text components in an end-to-end manner using a unified training objective (i.e. Next-Token-Prediction).
Fig. 10 further illustrates the difference between compositional MLLMs and native MLLMs. Compositional MLLMs typically have different components initialized by separate unimodal pre-training, where different training objectives and strategies are employed to train the LLM and visual encoder. For example, the visual encoder can be trained using an image-text contrastive learning objective (e.g., CLIP radford2021clip , SigLIP zhai2023siglip ) or a self-supervised learning objective (e.g., DINOv2 oquab2023dinov2 ). The complexity of such training process increases the difficulty of scalability. On the other hand, as discussed in shukor2025scaling , native MLLM optimizes both image and text modalities end-to-end using a unified training objective (i.e., next-token prediction (NTP)). This avoids introducing additional bias and significantly simplifies the scaling effort.
Appendix C More Related Works
Research on Neural Scaling Laws.
The foundational work on Neural Scaling Laws began in the Natural Language Processing (NLP) domain, where openai2020scaling established predictable power-law relationships demonstrating that performance loss ( $L$ ) scales reliably with model size ( $N$ ) and data size ( $D$ ), and that larger, decoder-only Transformer models are more compute-efficient. Following works ghorbani2021scaling further extended such research to encoder-decoder architectures, observing consistency in scaling exponents on Neural Machine Translation (NMT) tasks. Driven by these successes, in the vision domain, zhai2022scaling confirmed the applicability of scaling laws to Vision Transformers (ViT), systematically demonstrating continuous performance improvement by scaling both model size (up to 2 billion parameters) and training data. Most recently, these principles have been generalized to Large Multimodal Models, where aghajanyan2023scaling developed scaling laws that unify the contributions of text, image, and speech modalities by explicitly modeling synergy and competition as an additive term. Furthering this, shukor2025scaling explored Native Multimodal Models (NMMs) using Mixture of Experts (MoEs), finding an unbalanced scaling law that suggests scaling training tokens ( $D$ ) is more critical than scaling active parameters ( $N$ ) as the compute budget grows.
Appendix D Implementation Details
The hyperparameters of model architecture for NaViL-2B and NaViL-9B are listed in Tab. 6, while the hyperparameters of training recipe for NaViL-2B and NaViL-9B are provided in Tab. 7 and Tab. 8, respectively. The high-quality multimodal data used in Pre-training and Supervised Fine-tuning is from InternVL-2.5 InternVL-2.5 , which is sourced from various domains, such as image captioning, general question answering, multi-turn dialogue, charts, OCR, documents, and knowledge, etc.; while the pure language data is primarily from InternLM2.5 cai2024internlm2 .
Table 4: Comparison between NaViL-9B and existing MLLMs on multimodal benchmarks. “#A-Param” denotes the number of activated parameters. † InternVL-2.5-8B adopts the same high-quality data with NaViL-9B, so we mark it as the compositional counterpart. Note that its 300M visual encoder is distilled from another 6B large encoder. * denotes our reproduced results. Bold and underline indicate the best and the second-best performance among native MLLMs, respectively. For MME, we sum the perception and cognition scores. Average scores are computed by normalizing each metric to a range between 0 and 100.
| Compositional MLLMs: | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| MobileVLM-V2 chu2024mobilevlm | 1.7B | $-$ | $-$ | $-$ | 57.7 | $-$ | $-$ | $-$ | $-$ | $-$ | $-$ | $-$ | $-$ |
| MobileVLM-V2 chu2024mobilevlm | 3.0B | $-$ | $-$ | $-$ | 63.2 | $-$ | $-$ | $-$ | 57.5 | $-$ | $-$ | $-$ | $-$ |
| Mini-Gemini VLM:MiniGemini | 3.5B | $-$ | 31.1 | 31.7 | 59.8 | 1653 | 29.4 | $-$ | 56.2 | 34.2 | $-$ | $-$ | $-$ |
| MM1-MoE-Chat VLM:MM1 | 3.5B | $-$ | 42.2 | 38.6 | 70.8 | 1772 | 32.6 | $-$ | 72.9 | $-$ | $-$ | $-$ | $-$ |
| DeepSeek-VL lu2024deepseekvl | 2.0B | $-$ | 34.8 | 32.2 | 64.6 | 1532 | 31.1 | 409 | 57.8 | $-$ | 51.5 | $-$ | $-$ |
| PaliGemma beyer2024paligemma | 2.9B | $-$ | 33.1 | 34.9 | 71.0 | 1686 | 28.7 | 614 | 68.1 | $-$ | 68.3 | $-$ | $-$ |
| MiniCPM-V-2 yao2024minicpm | 2.8B | $-$ | 41.0 | 38.2 | 69.1 | 1809 | 38.7 | 605 | 74.1 | 71.9 | 62.9 | $-$ | $-$ |
| InternVL-1.5 VLM:InternVL-1.5 | 2.2B | 61.3 | 39.3 | 34.6 | 70.9 | 1902 | 41.1 | 654 | 70.5 | 85.0 | 69.8 | 74.8 | 55.4 |
| Qwen2VL Qwen2vl | 2.1B | 67.3 | 49.5 | 41.1 | 74.9 | 1872 | 43.0 | 809 | 79.7 | 90.1 | 74.7 | 73.5 | 65.5 |
| InternVL-2.5 chen2024expanding | 2.2B | 69.6 | 60.8 | 43.6 | 74.7 | 2138 | 51.3 | 804 | 74.3 | 88.7 | 74.9 | 79.2 | 60.9 |
| Qwen2VL Qwen2vl | 8.2B | 77.1 | 62.0 | 54.1 | 83.0 | 2327 | 58.2 | 866 | 84.3 | 94.5 | 83.0 | 83.0 | 76.5 |
| Qwen2.5-VL bai2025qwen2 | 8.2B | 80.2 | 67.1 | 58.6 | 83.5 | 2347 | 68.2 | 864 | 84.9 | 95.7 | 83.9 | 87.3 | 82.6 |
| † InternVL-2.5 chen2024expanding | 8.1B | 77.3 | 62.8 | 56.0 | 84.6 | 2344 | 64.4 | 822 | 79.1 | 91.9 | 84.5 | 84.8 | 75.7 |
| Native MLLMs: | | | | | | | | | | | | | |
| Fuyu-8B (HD) VLM:Fuyu-8b | 8B | $-$ | 21.4 | $-$ | 10.7 | $-$ | $-$ | $-$ | $-$ | $-$ | 64.5 | $-$ | $-$ |
| SOLO solo | 7B | $-$ | $-$ | $-$ | $-$ | 1260 | 34.4 | $-$ | $-$ | $-$ | 61.4 | $-$ | $-$ |
| Chameleon-7B The performance of Chameleon-7B is from mono_internvl . team2024chameleon | 7B | 14.0 | 8.3 | 25.4 | 31.1 | 170 | 22.3 | 7 | 4.8 | 1.5 | 46.0 | 2.9 | 5.0 |
| EVE-7B diao2024EVE | 7B | 34.6 | 25.6 | 32.3 | 49.5 | 1483 | 25.2 | 327 | 51.9 | 22.0 | 48.5 | 19.5 | 20.0 |
| EVE-7B (HD) diao2024EVE | 7B | 45.2 | 25.7 | 32.6 | 52.3 | 1628 | 34.2 | 398 | 56.8 | 53.0 | 61.0 | 59.1 | 25.0 |
| Emu3 emu3 | 8B | $-$ | 37.2 | 31.6 | 58.5 | $-$ | $-$ | 687 | 64.7 | 76.3 | 70.0 | 68.6 | 43.8 |
| VoRA vora | 7B | $-$ | 33.7 | 32.2 | 64.2 | 1674 | $-$ | $-$ | 56.3 | $-$ | 65.6 | $-$ | $-$ |
| VoRA-AnyRes vora | 7B | $-$ | 33.7 | 32.0 | 61.3 | 1655 | $-$ | $-$ | 58.7 | $-$ | 61.1 | $-$ | $-$ |
| EVEv2 diao2025evev2 | 7B | 62.3 | 45.0 | 39.3 | 66.3 | 1709 | 60.0* | 702 | 71.1 | 77.4* | 74.8 | 73.9 | 45.8* |
| SAIL lei2025sail | 7B | 63.7 | 46.3 | 38.6* | 70.1 | 1719 | 57.0 | 783 | 77.1 | 78.4* | 76.7 | 69.7* | 47.3* |
| Mono-InternVL mono_internvl | 1.8B | 60.6 | 40.1 | 33.7 | 65.5 | 1875 | 45.7 | 767 | 72.6 | 80.0 | 68.6 | 73.7 | 43.0 |
| NaViL-2B (ours) | 2.4B | 68.8 | 78.3 | 41.8 | 71.2 | 1822 | 50.0 | 796 | 76.9 | 85.4 | 74.6 | 78.0 | 56.0 |
| NaViL-9B (ours) | 9.2B | 77.0 | 79.6 | 54.7 | 76.5 | 2225 | 66.7 | 837 | 77.2 | 90.6 | 82.4 | 85.4 | 70.2 |
Table 5: Comparison of NaViL and existing native MLLMs on three common NLP tasks. Except for Chameleon, models are evaluated using OpenCompass toolkit opencompass2023 .
| InternLM2-Chat 2023internlm Qwen3-8B (non-thinking) TransF:Qwen3 EVE diao2024EVE | 1.8B 8B 7B | 47.1 76.5 43.9 | 46.1 76.8 33.4 | 13.9 71.1 0.7 |
| --- | --- | --- | --- | --- |
| Chameleon team2024chameleon | 7B | 52.1 | - | 11.5 |
| Mono-InternVL mono_internvl | 2B | 45.1 | 44.0 | 12.3 |
| NaViL-9B (ours) | 9.2B | 74.9 | 75.1 | 66.2 |
Table 6: Hyper-Parameters of Model Architecture.
| Component-visual encoder | Hyper-Parameter # Params depth | NaViL-2B 0.6B 24 | NaViL-9B 1.2B 32 |
| --- | --- | --- | --- |
| width | 1472 | 1792 | |
| MLP width | 5888 | 7168 | |
| # attention heads | 23 | 28 | |
| LLM (w/ MoE) | # experts | 2 | 2 |
| # A-Params | 1.8B | 8.0B | |
| depth | 24 | 36 | |
| width | 2048 | 4096 | |
| MLP width | 8192 | 12288 | |
| # attention heads | 16 | 32 | |
Table 7: Hyper-parameters for training NaViL-2B.
| Configuration | Multi-modal Generative Pre-training (S1) | Supervised | |
| --- | --- | --- | --- |
| S1.1 | S1.2 | Fine-tuning (S2) | |
| Maximum number of image patches | $4096$ | $12188$ | $24576$ |
| Training steps | $70$ k | $40$ k | $30$ k |
| Global batch size | $7,000$ | $4,614$ | $2,234$ |
| Weight decay | $0.05$ | $0.1$ | $0.01$ |
| Learning rate schedule | constant with warm-up | cosine decay | |
| Peak learning rate | $5e^{-5}$ | $2e^{-5}$ | |
| Visual Multi-scale Packing | ✓ | | |
| LLM max sequence length | $16,384$ | | |
| Warm-up steps | $200$ | | |
| Optimizer | AdamW | | |
| Optimizer hyperparameters | $\beta_{1}=0.9,\beta_{2}=0.95,eps=1e^{-8}$ | | |
| Gradient accumulation | $1$ | | |
| Numerical precision | $\mathtt{bfloat16}$ | | |
Table 8: Hyper-parameters for training NaViL-9B.
| Configuration | Multi-modal Generative Pre-training (S1) | Supervised | |
| --- | --- | --- | --- |
| S1.1 | S1.2 | Fine-tuning (S2) | |
| Maximum number of image patches | $4096$ | $12188$ | $24576$ |
| Training steps | $50$ k | $33$ k | $6$ k |
| Weight decay | $0.05$ | $0.1$ | $0.01$ |
| Global batch size | $10,300$ | $1,792$ | $3,520$ |
| Visual Multi-scale Packing | ✗ | ✓ | ✓ |
| Learning rate schedule | constant with warm-up | cosine decay | |
| Peak learning rate | $5e^{-5}$ | $2e^{-5}$ | |
| LLM max sequence length | $16,384$ | | |
| Warm-up steps | $200$ | | |
| Optimizer | AdamW | | |
| Optimizer hyperparameters | $\beta_{1}=0.9,\beta_{2}=0.95,eps=1e^{-8}$ | | |
| Gradient accumulation | $1$ | | |
| Numerical precision | $\mathtt{bfloat16}$ | | |
Appendix E The NLP capability
We also evaluate the NLP capability of our model on three popular NLP tasks, as shown in Tab. 5. Thanks to the modality-specific MoE architecture, NaViL maintains the NLP capabilities of its initialization LLM (Qwen3-8B). Despite not using a large amount of high-quality text data, NaViL performs well on the common NLP tasks and show much stronger NLP capabilities compared to other native MLLMs, showing its data efficiency.
Appendix F More Qualitative Results
More visualization results of multimodal understanding are provided below.
Image Captioning and Visual Question Answering
<details>
<summary>figures/vis/vis-understanding-conventional-center.jpg Details</summary>

### Visual Description
## Architectural Photograph: San Diego Convention Center
### Overview
The image is an aerial photograph of the San Diego Convention Center, showcasing its architectural design and surrounding environment. The convention center is a large, modern building complex located along the waterfront. The photograph captures the building's structure, including its distinctive glass-enclosed sections, geometric roofline, and surrounding landscaping. The background includes the San Diego Bay, cityscape, and clear blue sky.
### Components/Axes
* **Building Structure:** The convention center features a combination of flat-roofed sections and arched, glass-enclosed walkways. The building's exterior is primarily light-colored, possibly concrete or a similar material.
* **Surrounding Environment:** The convention center is bordered by palm tree-lined streets, green spaces, and the San Diego Bay. In the background, there is a cityscape with various buildings, including high-rise structures.
* **Architectural Details:** The building incorporates geometric shapes, such as triangular supports and circular windows. The glass-enclosed walkways provide natural light and views of the surrounding area.
* **Landscaping:** The grounds around the convention center are landscaped with grass, trees, and decorative elements, including a colorful sculpture.
### Detailed Analysis or Content Details
* **Convention Center Design:** The building's design includes a series of arched, glass-enclosed walkways that run along one side of the structure. These walkways are connected to the main building and provide access to different sections of the convention center.
* **Roofline:** The roofline of the convention center is characterized by a combination of flat sections and geometric shapes. The roof appears to be made of a light-colored material, possibly concrete or a similar material.
* **Surrounding Streets:** The streets surrounding the convention center are lined with palm trees, which are a common feature of the San Diego landscape. The streets appear to be well-maintained and provide access to the convention center and other nearby attractions.
* **Waterfront Location:** The convention center is located along the San Diego Bay, which provides scenic views and access to various water-based activities. The bay is visible in the background of the photograph, with boats and other watercraft visible on the water.
* **Cityscape:** The cityscape in the background of the photograph includes a variety of buildings, including high-rise structures. The cityscape provides a sense of the scale and density of the city of San Diego.
* **Sculpture:** A colorful, abstract sculpture is located on the grounds of the convention center. The sculpture adds a decorative element to the landscape and provides a visual focal point.
### Key Observations
* The San Diego Convention Center is a large, modern building complex that is well-integrated into its surrounding environment.
* The building's design incorporates a variety of architectural elements, including glass-enclosed walkways, geometric shapes, and a light-colored exterior.
* The convention center is located along the San Diego Bay, which provides scenic views and access to various water-based activities.
* The surrounding streets are lined with palm trees, which are a common feature of the San Diego landscape.
### Interpretation
The photograph showcases the San Diego Convention Center as a significant architectural landmark within the city. The building's design reflects a modern aesthetic, while its location along the waterfront provides a connection to the natural environment. The photograph highlights the convention center's role as a major event venue and its integration into the urban landscape of San Diego. The presence of palm trees, the bay, and the cityscape all contribute to the overall sense of place and identity of the city. The image suggests that the convention center is a key component of San Diego's economy and tourism industry.
</details>
Prompt: Describe this image in detail. Response: The image shows an aerial view of a modern architectural complex situated near a coastal area. The building has a distinctive design with a series of curved, wave-like structures on its roof, giving it a futuristic appearance. The exterior is primarily composed of light-colored materials, possibly concrete or stone, with large glass windows. In front of the building, there is a wide, multi-lane road lined with palm trees, suggesting a warm, coastal climate. The road appears to be a major thoroughfare, with designated bike lanes and pedestrian pathways. There are also train tracks running parallel to the road, indicating the presence of a public transportation system. The background features a large body of water, likely a bay or ocean, with several boats and yachts docked at a marina. The coastline is lined with more buildings, and there are hills or mountains visible in the distance, adding to the scenic view. The overall setting suggests a bustling urban area with a focus on modern infrastructure and a strong connection to the waterfront. Prompt: Which city is this building most likely in? Response: This building is most likely in San Diego, California. The architectural style and the surrounding landscape, including the water and palm trees, are characteristic of San Diego.
<details>
<summary>figures/vis/vis-understanding-worldcup.png Details</summary>

### Visual Description
## Photograph: Argentina World Cup Victory Celebration
### Overview
The image captures a jubilant scene of the Argentina national football team celebrating their victory, presumably at the World Cup. The players are tightly packed together, many with arms raised in triumph, and fireworks are visible in the background. They are wearing their team jerseys and gold medals.
### Components/Axes
* **Foreground:** The Argentina football team players.
* **Background:** Fireworks display.
* **Center:** Lionel Messi holding the World Cup trophy.
* **Attire:** The players are wearing the Argentina national team jersey (light blue and white stripes), gold medals, and some are draped in a black robe.
### Detailed Analysis or ### Content Details
* **Players:** A large group of players are visible, all wearing the Argentina jersey. Prominently featured is Lionel Messi, who is holding the World Cup trophy aloft.
* **Jerseys:** The jerseys are light blue and white striped, with numbers visible on some of them (e.g., 11, 17, 5).
* **Medals:** Each player is wearing a gold medal around their neck on a red lanyard.
* **Trophy:** Lionel Messi is holding the gold World Cup trophy above his head.
* **Fireworks:** Bright, golden fireworks are exploding in the background, adding to the celebratory atmosphere.
* **Facial Expressions:** The players exhibit expressions of joy, excitement, and triumph. Many are shouting and smiling.
* **Black Robe:** Lionel Messi is wearing a black robe over his jersey.
### Key Observations
* The image is a snapshot of a moment of intense celebration and victory.
* The presence of the World Cup trophy and the team's attire clearly indicate a significant achievement.
* The fireworks and jubilant expressions of the players amplify the sense of triumph.
### Interpretation
The photograph captures the essence of victory and celebration. The Argentina football team's expressions of joy, combined with the fireworks and the presence of the World Cup trophy, convey the magnitude of their achievement. The image is a powerful representation of teamwork, dedication, and the culmination of hard work. The black robe worn by Messi is likely a traditional garment presented to him in Qatar, the host country, adding a cultural element to the celebration.
</details>
Prompt: Who is the person in black in the middle? Response: The person in black in the middle is Lionel Messi. Prompt: What are they celebrating about? Response: The image shows a group of soccer players celebrating with a trophy, wearing medals and jerseys. They are likely celebrating a victory in a soccer tournament or championship.
<details>
<summary>figures/vis/vis-understanding-gta5.png Details</summary>

### Visual Description
## Photograph: Street Scene
### Overview
The image is a street scene, likely from a video game, showing a man standing on a sidewalk with buildings and vehicles in the background. The scene appears to be set in an urban environment during daylight.
### Components/Axes
* **Foreground:** A man standing on a sidewalk, facing away from the viewer.
* **Midground:** A street with a car parked on the side and another car driving in the distance. Buildings line the street.
* **Background:** More buildings, telephone poles, and trees.
* **Signs:** Several signs are visible on the buildings, including "WINE" and "ON THE BONE MEAT MARKET - BUTCHERS".
* **Mini-map:** A mini-map is visible in the bottom-left corner.
### Detailed Analysis or ### Content Details
* **Building on the Left:** The building on the left has a sign that reads "WINE".
* **Building in the Center:** The building in the center has multiple signs that read "ON THE BONE MEAT MARKET - BUTCHERS". There is also a sign that reads "ON THE BONE MEAT MARKET - DELI".
* **Vehicles:** A dark-colored car is parked on the side of the street. A white car is driving in the distance.
* **Mini-map:** The mini-map shows a street layout with a marker labeled "A". There are also icons indicating health, armor, and special ability.
* **Man:** The man is wearing a blue shirt and jeans.
* **Street:** The street is cracked and shows signs of wear.
* **Sidewalk:** The sidewalk is also cracked and shows signs of wear.
### Key Observations
* The scene appears to be set in a run-down urban environment.
* The signs on the buildings suggest that the area is a commercial district.
* The presence of vehicles indicates that the area is accessible by car.
* The mini-map suggests that the scene is from a video game.
### Interpretation
The image depicts a typical urban street scene, possibly in a low-income area. The presence of the "ON THE BONE" meat market suggests a local business catering to the community. The overall impression is one of urban decay and neglect. The mini-map confirms that this is a scene from a video game, likely Grand Theft Auto V, given the visual style.
</details>
Prompt: What is the content in this image? Response: The image appears to be a screenshot from a video game, likely Grand Theft Auto V (GTA V), given the visual style and interface elements. It shows a character standing on a sidewalk in a small town or suburban area. The environment includes a street with tram tracks, a few buildings with signs, and some cars parked along the road. The buildings have signs for businesses like "Wine" and "On The Bone." The scene has a somewhat deserted look, with no other characters or significant activity visible.
OCR
<details>
<summary>figures/vis/vis-understanding-nutritional.jpeg Details</summary>

### Visual Description
## Nutritional Information Table
### Overview
The image presents a nutritional information table for a food product, specifically honey. It details the average quantity of various nutrients per serving (10g) and per 100g. It also lists the ingredients and storage/usage instructions.
### Components/Axes
The table is structured with the following columns:
* **Nutrient:** Lists the nutrient type (Energy, Protein, Fat, Carbohydrates, Sugars, Sodium).
* **Average Quantity Per Serving:** Shows the quantity of each nutrient in a 10g serving.
* **Average Quantity Per 100g:** Shows the quantity of each nutrient per 100g of the product.
The table also includes the following information:
* **Serving size:** 10g
* **Servings per package:** 25
* **Ingredients:** 100% Pure New Zealand Honey
* **Storage Instructions:** Keep Stored Below 20 deg°C, BPA-Free Container.
* **Usage Instructions:** Not suitable for children under 1 year of age.
### Detailed Analysis or Content Details
Here's a breakdown of the nutritional information:
* **Energy:** 143kJ per serving, 1430kJ per 100g
* **Protein:** less than 0.1g per serving, 0.16g per 100g
* **Fat, total:** less than 0.1g per serving, 0.37g per 100g
* **- saturated:** less than 0.1g per serving, 0.26g per 100g
* **Carbohydrates:** 8.31g per serving, 83.1g per 100g
* **Sugars:** 8.11g per serving, 81.1g per 100g
* **Sodium:** 0.41mg per serving, 4.04mg per 100g
### Key Observations
* The nutritional information is provided for both a single serving (10g) and per 100g of the product.
* The primary components are carbohydrates and sugars, which are expected for honey.
* The amounts of protein and fat are relatively low.
* The ingredients list confirms that the product is 100% Pure New Zealand Honey.
* The storage instructions recommend keeping the product below 20 degrees Celsius in a BPA-free container.
* The product is not suitable for children under 1 year of age.
### Interpretation
The nutritional information table provides a clear overview of the composition of the honey product. The high carbohydrate and sugar content indicates that it is a significant source of energy. The low protein and fat content suggest that it is not a primary source of these nutrients. The additional information regarding ingredients, storage, and usage provides important context for consumers. The warning about children under 1 year of age is likely due to the risk of infant botulism associated with honey.
</details>
Prompt: Extract all the text in this image. Response: NUTRITIONAL INFORMATION Serving size: 10 g Servings per package: 25 Average Quantity Per Serving | Average Quantity Per 100g Energy | 143kJ | 1430 kJ Protein | less than 0.1g | 0.16g Fat, total | less than 0.1g | 0.37g - saturated | less than 0.1g | 0.26g Carbohydrates | 8.31g | 83.1g Sugars | 8.11g | 81.1g Sodium | 0.41mg | 4.04mg INGREDIENTS: 100% Pure New Zealand Honey Keep Stored Below 20 deg°C, BPA-Free Container. Not suitable for children under 1 year of age.
<details>
<summary>figures/vis/vis-understanding-receipt.jpeg Details</summary>

### Visual Description
## Receipt: Chapelbank Hotel Lunch
### Overview
The image is a receipt from the Chapelbank Hotel for a lunch order, presumably for 5 people. It lists the items ordered, their individual prices, and the total amount due. A handwritten note at the top indicates "Food = Lunch x 5 people".
### Components/Axes
* **Header:**
* Handwritten note: "Food = Lunch x 5 people"
* Hotel Name: CHAPELBANK HOTEL
* Address: 69 EAST HIGH STREET, FORFAR DD8 2EP
* Telephone Number: TEL NO 01307 463151
* Separator: Series of asterisks
* **Main Body:**
* Table Number: 5
* Restaurant Identifier: RESTAURANT
* Items Ordered:
* 3x SOUP @ 3.95 each
* Lunch Main Lasagne (V)
* MISC FOOD
* 2x CHIPS / ONION RINGS @ 1.75 each
* -FOOD
* Prices:
* Soup Total: 11.85
* Lasagne: 6.95
* Misc Food: 6.95
* Chips/Onion Rings Total: 3.50
* Food Deduction: 29.25--
* Previous Balance: 29.25
* **Footer:**
* Subtotal: 29.25
### Detailed Analysis or ### Content Details
* **Handwritten Note:** "Food = Lunch x 5 people" suggests the receipt is for a lunch order for 5 individuals.
* **Hotel Information:** The receipt clearly identifies the Chapelbank Hotel and provides its address and phone number.
* **Table Number:** The meal was served at table number 5.
* **Order Details:**
* 3 soups were ordered, each costing 3.95, totaling 11.85.
* One order of Lasagne (V) was ordered, costing 6.95. The (V) likely indicates that it is a vegetarian option.
* MISC FOOD was ordered, costing 6.95.
* Two orders of chips/onion rings were ordered, each costing 1.75, totaling 3.50.
* There is a food deduction of 29.25.
* **Financial Summary:**
* The previous balance was 29.25.
* The subtotal is 29.25.
### Key Observations
* The receipt includes a handwritten note indicating the order is for lunch for 5 people.
* The itemized list provides a breakdown of the food ordered and their respective prices.
* There is a food deduction of 29.25, which is unusual and could indicate a discount or correction.
* The previous balance is equal to the subtotal, which is also unusual.
### Interpretation
The receipt documents a lunch order at the Chapelbank Hotel for table 5. The handwritten note suggests the order was for 5 people. The itemized list provides a breakdown of the food ordered and their prices. The presence of a "FOOD" deduction of 29.25 is notable and could indicate a discount, a correction, or a credit applied to the bill. The previous balance being equal to the subtotal is also unusual and could indicate that the deduction was applied to a previous balance. Further context would be needed to fully understand the meaning of these entries.
</details>
Prompt: Describe the image. Response: The image is a receipt from Chapelbank Hotel for a lunch order for five people. The items listed on the receipt include 3x soup at £3.95 each, totaling £11.85; lunch main course at £6.95; lasagne (V) at £6.95; misc food at £6.95; and 2x chips/onion rings at £1.75 each, totaling £3.50. The subtotal for the food is £29.25. The receipt also includes a previous balance of £29.25, making the total amount due £29.25. The hotel’s address is 69 East High Street, Forfar DD8 2EP, and the telephone number is 01307 463151. Prompt: How many people are attending this lunch? Response: There are 5 people attending this lunch. This is indicated at the top of the receipt where it says "Food = Lunch x 5 people".
<details>
<summary>figures/vis/vis-understanding-road.jpeg Details</summary>

### Visual Description
## Road Signage Image: Highway Exit Signs
### Overview
The image is a photograph taken from a vehicle on a highway, showing a series of overhead road signs indicating upcoming exits. The signs provide information about exit numbers, destinations, and distances. The image also captures other vehicles on the highway and the surrounding environment.
### Components/Axes
* **Road Signs (Overhead):**
* **Sign 1 (Left):**
* "LEFT" (small sign above)
* "EXIT 45" (small sign above)
* "267 WEST"
* "TOLL"
* "Reston"
* "Herndon"
* "Dulles Airport"
* "NO TOLL"
* "1 MILE"
* **Sign 2 (Center):**
* "EXIT 46 A"
* "123 SOUTH"
* "Chain Br Rd"
* "Tysons Corner"
* "1/2 MILE"
* **Sign 3 (Right):**
* "EXIT 46 B"
* "123 NORTH"
* "Chain Br Rd"
* "McLean"
* "EXIT" (with a downward arrow) "ONLY" (yellow background)
* **Sign 4 (Far Right):**
* "Leesburg"
* "EXIT 45"
* **Vehicles:** Cars and a bus are visible on the highway.
* **Environment:** The background includes buildings, trees, and a cloudy sky. A sound barrier wall is visible on the right side of the road.
* **Road Markings:** White lane markings are visible on the highway.
### Detailed Analysis or Content Details
* **Exit 45 (Left):** Indicates an upcoming exit to the left for Route 267 West, which is a toll road, leading to Reston, Herndon, and Dulles Airport. The sign specifies that the Dulles Airport route is a "NO TOLL" route and is 1 mile away.
* **Exit 46 A (Center):** Indicates an upcoming exit for Route 123 South, leading to Chain Bridge Road and Tysons Corner, which is 1/2 mile away.
* **Exit 46 B (Right):** Indicates an upcoming exit for Route 123 North, leading to Chain Bridge Road and McLean. The sign specifies that this is an "EXIT ONLY" lane.
* **Exit 45 (Far Right):** Indicates an upcoming exit for Leesburg.
* **Vehicle Details:**
* A dark-colored Honda Odyssey is in the foreground with a license plate that appears to read "226404".
* A blue bus is visible in the right lane with text on the back that includes "WWW.PMTGHOME.COM" and "DRIVE FOR PAITG".
* **Environmental Details:** The sky is partly cloudy. Buildings are visible in the distance on the left. A sound barrier wall is present on the right side of the highway.
### Key Observations
* The signs provide clear directional information for drivers.
* The exits are numbered sequentially (45, 46A, 46B).
* Distances to the exits are provided in miles.
* The image captures a typical highway scene with multiple vehicles and road infrastructure.
### Interpretation
The image provides a snapshot of highway signage, demonstrating how drivers are guided to specific destinations. The signs are designed to be easily readable and provide essential information such as exit numbers, destinations, and distances. The presence of multiple signs indicates a complex interchange with various route options. The "EXIT ONLY" designation for Exit 46B suggests a dedicated lane for exiting the highway. The image highlights the importance of clear and concise signage for safe and efficient navigation on highways.
</details>
Prompt: Where is the left exit going to? Response: The left exit is going to Reston, Herndon, and Dulles Airport. Prompt: Which road leads to the EXIT 46 B? Response: The road that leads to EXIT 46 B is Chain Br Rd North.
Multi-lingual OCR
<details>
<summary>figures/vis/vis-understanding-zuowen.jpeg Details</summary>

### Visual Description
## Handwritten Text: Essay
### Overview
The image contains a handwritten essay in Chinese, likely a school assignment. The essay discusses themes of focus, goals, and enjoying life, using metaphors of moonlight and stars. The text is written within a grid, suggesting it's a structured writing exercise.
### Components/Axes
* **Header:** "作文(60分)" - "Essay (60 points)"
"题目专注目标,享受生命," - "Topic: Focus on goals, enjoy life."
* **Body:** The main text of the essay, written in Chinese characters within a grid.
### Detailed Analysis or ### Content Details
The essay text is as follows:
* "目标之光,如月光清辉;生命之" - "The light of goals is like the clear radiance of moonlight; the life of"
* "辉,如群星闪耀。失去了群星,月色" - "radiance is like the glittering of stars. If you lose the stars, the moonlight"
* "便请冷孤寂;不见了明月,群星也势" - "becomes cold and lonely; if you don't see the bright moon, the stars will also"
* "必黯然失色。只有专注于月光,方有" - "inevitably lose their color. Only by focusing on the moonlight can you have"
* "方向;只有欣赏群星,方可享受生命" - "direction; only by appreciating the stars can you enjoy life."
* "星月交辉,夜空撩人;专注目标,享" - "The stars and moon shine together, the night sky is alluring; focus on goals, enjoy"
* "受生命,方可体味人间清欢。" - "life, and you can experience the joy of the world."
### Key Observations
* The essay uses the metaphor of moonlight (月光) to represent goals and stars (群星) to represent the enjoyment of life.
* The essay emphasizes the importance of both focusing on goals and appreciating the joys of life.
* The handwriting is neat and legible, suggesting a careful effort.
### Interpretation
The essay explores the balance between pursuing goals and enjoying life. It suggests that focusing on goals (moonlight) provides direction, while appreciating the joys of life (stars) brings fulfillment. The essay implies that both are necessary for a complete and satisfying existence. The concluding lines emphasize the allure of the world when both goals and enjoyment are present.
</details>
Prompt: 请先提取里面的文本,再翻译为英文。 Response: 文本内容: 作文(60分) 题目:专注目标,享受生命。 目标之光,如月光清辉;生命之辉,如群星闪耀。失去了群星,月色便显得冷寂;不见了明月,群星也势必黯然失色。只有专注于月光,方有方向;只有欣赏群星,方可享受生命。星月交辉,夜空撩人;专注目标,享受生命,方可体味人间清欢。 翻译为英文: Essay (60 points) Title: Focus on your goal, enjoy life. The light of the goal is like the clear glow of moonlight; the brilliance of life is like the dazzling stars. Without the stars, the moonlight becomes cold and desolate; without the moon, the stars will inevitably become dull. Only by focusing on the moonlight can you find direction; only by appreciating the stars can you enjoy life. The stars and moon shine together, making the night sky enchanting; focusing on your goal, enjoying life, allows you to savor the quiet joy of life.
Document Understanding
<details>
<summary>figures/vis/vis-understanding-docvqa.png Details</summary>

### Visual Description
## Contract: KBA Marketing Group - 2001 Venue Sponsorship Agreement
### Overview
This document is a venue sponsorship agreement between KBA Marketing Group and Vitucci's, Inc. for the year 2001. It outlines the terms of the agreement, including sponsorship funds, extra value items, and venue information.
### Components/Axes
* **Header:** KBA MARKETING GROUP - 2001 VENUE SPONSORSHIP AGREEMENT
* **Agreement Details:**
* Date of Agreement: 3/1/01
* Term of Agreement: January 1, 2001 - December 31, 2001
* Sponsorship Program: Camel Club Program ("CCP")
* **I. VENUE INFORMATION:**
* Venue Name: Vitucci's
* Venue Address: 1832 East North Avenue
* City / State / Zip Code: Milwaukee WI 53202
* FEIN or SSN: PERSONAL/CONFIDENTIAL MATERIAL REDACTED
* Venue Owner Name: Vitucci's, Inc.
* Venue Owner Category: Corporation
* Venue Code: 4388
* **II. SPONSORSHIP FUNDS AND EXTRA VALUE ITEMS:**
* Special Events Fund:
* $1,595.00 to offset costs for a minimum of 1 Special Events from Jan. 1 - June 30, 2001.
* $1,595.00 to offset costs for a minimum of 1 Special Events from July 1 - Dec. 31, 2001.
* Printed Materials Fund: $810.00 to offset costs for the printing and shipping of custom flyers, as well as the purchases of local media, including quarter-page advertisements, and additional printed material and display advertising, as available, at the discretion of KBA, from Jan. 1 - Dec. 31, 2001.
* Total Direct Payments: $2,000.00, payable on or before the following payment dates in the following amounts:
* March 31, 2001: $500.00
* June 30, 2001: $500.00
* September 30, 2001: $500.00
* December 31, 2001: $500.00
* TOTAL SPONSORSHIP FUNDS: $6,000.00 \[This amount includes all sub-totals set forth above]
* **Extra Value Items (at no cost to Venue):**
* Tobacco License Fees: All state and local retail tobacco licensing fees for Venue (if Venue is currently without a license), plus assistance with tobacco licensing application procedures;
* Bar Essentials: Branded ashtrays, beverage napkins, generic branded matchbooks, and other bar essentials delivered to Venue for use within the Venue, as available, at the discretion of KBA;
* Custom Matchbooks: 3 cases of custom branded matchbooks that display the Venue's name and logo, delivered to Venue;
* Flyer Distribution: Public distribution of flyers for Special Events and activities at Venue, at discretion of KBA;
* Cigarette Delivery: Free delivery of Camel cigarette brand styles (the "Product") purchased by Venue Owner for resale in Venue; Free (Gratis) Product that KBA will distribute in Venue to adult smokers (as permitted by law).
* **Footer:**
* COPY ORIGINAL
* -1-
* Initials: KBA [Signature] Venue Owner [Signature]
* Source: [https://www.industrydocuments.ucsf.edu/docs/fpmj0226](https://www.industrydocuments.ucsf.edu/docs/fpmj0226)
* 52579 6435
### Detailed Analysis or Content Details
The agreement outlines the financial contributions from KBA to Vitucci's, Inc. for the year 2001. These contributions are divided into several categories:
* **Special Events Fund:** A total of $3,190.00 is allocated for special events, split evenly between the first and second halves of the year.
* **Printed Materials Fund:** $810.00 is allocated for printing and shipping of custom flyers.
* **Total Direct Payments:** $2,000.00 is allocated, paid in four installments of $500.00 each, at the end of March, June, September, and December.
* **Total Sponsorship Funds:** The total sponsorship funds amount to $6,000.00, which includes all the sub-totals mentioned above.
In addition to the financial contributions, KBA will also provide several "Extra Value Items" at no cost to Vitucci's, including tobacco license fees, bar essentials, custom matchbooks, flyer distribution, and cigarette delivery.
### Key Observations
* The agreement is for a one-year period, from January 1, 2001, to December 31, 2001.
* The sponsorship program is specifically for the "Camel Club Program (CCP)".
* The agreement includes both financial contributions and in-kind services (Extra Value Items).
* The FEIN or SSN of the venue owner is redacted, indicating confidential information.
### Interpretation
This agreement represents a comprehensive sponsorship arrangement between KBA Marketing Group and Vitucci's, Inc. The financial contributions and extra value items provided by KBA are intended to support Vitucci's participation in the Camel Club Program. The agreement is structured to provide both direct financial support and in-kind services, such as tobacco license fees and promotional materials. The inclusion of specific payment dates and deliverables suggests a well-defined and mutually beneficial partnership. The redacted FEIN/SSN highlights the importance of protecting sensitive personal or business information within such agreements. The focus on cigarette and tobacco-related items indicates the nature of the Camel Club Program and the target audience.
</details>
Prompt: What is the duration of this venue sponsorship agreement? Response: January 1, 2001 - December 31, 2001 Prompt: Where is Vitucci’s located (complete address)? Response: 1832 East North Avenue, Milwaukee WI 53202 Prompt: What cigarette brand is mentioned in the agreement for delivery and distribution? Response: Camel cigarette brand styles purchased by Venue Owner for resale in Venue Prompt: Who has discretion over the distribution of flyers for special events? Response: KBA Marketing Group ("KBA") has discretion over the distribution of flyers for special events.
LaTeX and Math
<details>
<summary>figures/vis/vis-understanding-math.png Details</summary>

### Visual Description
## Equation: GRPO Objective Function
### Overview
The image presents a mathematical equation, specifically the objective function for GRPO (likely an algorithm or model). It involves expectation, probabilities, summations, minimum functions, clipping, and a KL divergence term.
### Components/Axes
* **Left-hand side:** $\mathcal{J}_{GRPO}(\theta)$ - This represents the GRPO objective function parameterized by $\theta$.
* **Expectation:** $\mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]$ - The expectation is taken over the distribution $P(Q)$ for variable $q$, and the distribution $\pi_{\theta_{old}}(O|q)$ for the set of variables $\{o_i\}_{i=1}^G$.
* **Summation:** $\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|}$ - A double summation is performed. The outer sum is over $i$ from 1 to $G$, and the inner sum is over $t$ from 1 to $|o_i|$.
* **Minimum Function:** $\min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\epsilon, 1+\epsilon \right) \hat{A}_{i,t} \right)$ - This part calculates the minimum of two terms. The first term involves a ratio of probabilities $\pi_\theta$ and $\pi_{\theta_{old}}$, multiplied by $\hat{A}_{i,t}$. The second term involves clipping the same probability ratio between $1-\epsilon$ and $1+\epsilon$, and then multiplying by $\hat{A}_{i,t}$.
* **KL Divergence:** $-\beta D_{KL}[\pi_\theta || \pi_{ref}]$ - This term subtracts the KL divergence between $\pi_\theta$ and $\pi_{ref}$, scaled by a factor $\beta$.
* **(3)** - Equation number.
### Detailed Analysis or ### Content Details
The equation can be broken down as follows:
1. **$\mathcal{J}_{GRPO}(\theta)$**: The objective function to be optimized.
2. **$\mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]$**: The expectation is taken over trajectories or samples drawn from a distribution $P(Q)$ and a policy $\pi_{\theta_{old}}$.
3. **$\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|}$**: This is an average over $G$ trajectories, where each trajectory $i$ has length $|o_i|$. The inner sum averages over the time steps $t$ within each trajectory.
4. **$\frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}$**: This is the probability ratio between the current policy $\pi_\theta$ and the old policy $\pi_{\theta_{old}}$ for a given state-action pair $(o_{i,t}, q)$ at time $t$. $o_{i,<t}$ represents the history of observations up to time $t$.
5. **$\hat{A}_{i,t}$**: This represents the estimated advantage function for the state-action pair $(o_{i,t}, q)$ at time $t$.
6. **$\text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\epsilon, 1+\epsilon \right)$**: This clips the probability ratio to be within the range $[1-\epsilon, 1+\epsilon]$, where $\epsilon$ is a hyperparameter.
7. **$-\beta D_{KL}[\pi_\theta || \pi_{ref}]$**: This is a KL divergence penalty that encourages the policy $\pi_\theta$ to stay close to a reference policy $\pi_{ref}$. $\beta$ is a hyperparameter that controls the strength of the penalty.
### Key Observations
* The objective function aims to maximize the expected return while keeping the policy close to the old policy and a reference policy.
* The clipping function limits the change in the policy, preventing large updates that could destabilize training.
* The KL divergence penalty further regularizes the policy updates.
### Interpretation
The equation represents the objective function for a policy optimization algorithm, likely a variant of Trust Region Policy Optimization (TRPO) or Proximal Policy Optimization (PPO). The goal is to find a policy $\pi_\theta$ that maximizes the expected reward while ensuring that the policy updates are not too large, thus promoting stable learning. The clipping and KL divergence terms serve as regularization techniques to prevent drastic changes in the policy during training. The GRPO objective function balances exploration and exploitation by encouraging the agent to explore new actions while staying close to its previous behavior and a reference policy.
</details>
Prompt: Convert the formula in the figure to latex format. Response: ‘‘‘latex \begin{equation*} \begin{split} \mathcal{J}_{\text{GRPO}}(\theta) &= \mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}ˆG \sim \pi_{\theta_{old}}(O|q)] \\ & \frac{1}{G} \sum_{i=1}ˆG \frac{1}{|o_i|} \sum_{t=1}ˆ{|o_i|} \left\{ \min \left[ \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \varepsilon, 1 + \varepsilon \right) \hat{A}_{i,t} \right] - \beta \mathcal{D}_{KL} \left[ \pi_{\theta} || \pi_{ref} \right] \right\}, \end{split} \end{equation*} ‘‘‘
<details>
<summary>figures/vis/vis-understanding-math-7.png Details</summary>

### Visual Description
## Diagram: Parallelogram with Height
### Overview
The image shows a parallelogram with its base labeled as 8 meters and its height labeled as 'x' meters. A dashed line indicates the height, forming a right angle with the base.
### Components/Axes
* **Shape:** Parallelogram, filled with a light blue color and outlined with a darker blue.
* **Base:** Labeled as "8 m" along the bottom edge of the parallelogram.
* **Height:** Indicated by a dashed magenta line, perpendicular to the base. Labeled as "x m". A small magenta square indicates the right angle.
### Detailed Analysis
* The base of the parallelogram is explicitly given as 8 meters.
* The height of the parallelogram is represented by the variable 'x' and is measured in meters.
* The dashed line representing the height is perpendicular to the base, forming a right angle.
### Key Observations
* The diagram provides the base length of the parallelogram and represents the height as a variable.
* The height is visually represented as the perpendicular distance from the base to the opposite side.
### Interpretation
The diagram illustrates a parallelogram and provides information about its dimensions, specifically the base and height. The height is represented as a variable 'x', suggesting that it is an unknown value that could be calculated or provided in a related problem. The diagram serves as a visual aid for understanding the geometric properties of a parallelogram and its area calculation (Area = base * height).
</details>
Prompt: Find $x$ . $A=148$ $m^{2}$ . Please think step by step. Response: To find the value of $x$ in the parallelogram, we can use the formula for the area of a parallelogram: $A=\text{base}×\text{height}$ Given: - Area $A=148\,\text{m}^{2}$ - Base $=8\,\text{m}$ - Height $=x\,\text{m}$ Substitute the given values into the formula: $148=8× x$ Solve for $x$ : $x=\frac{148}{8}$ $x=18.5\,\text{m}$ So, $x=18.5\,\text{m}$ .
<details>
<summary>figures/vis/vis-understanding-math-10.png Details</summary>

### Visual Description
## Table: Weight of Winning Dog at a Dog Show
### Overview
The image presents a table showing the weight of the winning dog at a dog show from 2013 to 2017. The table has two columns: "Year" and "Weight (kg)".
### Components/Axes
* **Title:** Weight of winning dog at a dog show
* **Column 1:** Year (2013, 2014, 2015, 2016, 2017)
* **Column 2:** Weight (kg)
### Detailed Analysis
The table provides the following data:
| Year | Weight (kg) |
|---|---|
| 2013 | 27 |
| 2014 | 28 |
| 2015 | 34 |
| 2016 | 29 |
| 2017 | 40 |
### Key Observations
The weight of the winning dog fluctuates over the years. There is a notable increase in weight from 2016 (29 kg) to 2017 (40 kg).
### Interpretation
The data suggests that the weight of the winning dog at the dog show varies from year to year. The increase in weight in 2017 could be due to various factors, such as a change in the breed of the winning dog or a general trend towards larger dogs being favored in the competition. The data is limited to five years, so it is difficult to draw any long-term conclusions.
</details>
Prompt: A dog show enthusiast recorded the weight of the winning dog at recent dog shows. According to the table, what was the rate of change between 2016 and 2017? (Unit: kilograms per year) Response: To determine the rate of change between 2016 and 2017, we use the formula for rate of change: $\text{Rate of Change}=\frac{\text{Weight in 2017}-\text{Weight in 2016}}{\text{Year 2017}-\text{Year 2016}}$ From the table: - Weight in 2016 = 29 kg - Weight in 2017 = 40 kg Substitute these values into the formula: $\text{Rate of Change}=\frac{40\,\text{kg}-29\,\text{kg}}{2017-2016}\\
=\frac{11\,\text{kg}}{1\,\text{year}}$ $\text{Rate of Change}=11\,\text{kg/year}$ Thus, the rate of change between 2016 and 2017 is $\boxed{11\,\text{kg/year}}$ .