# NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
**Authors**: Code: https://github.com/OpenGVLab/NaViL
## Abstract
Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained visual encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs. * Equal contribution. 🖂 Corresponding to Jifeng Dai <daijifeng@tsinghua.edu.cn>. † Work was done when Changyao Tian, Hao Li, and Jie Shao were interns at Shanghai AI Laboratory.
## 1 Introduction
Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in computer vision InternVL-2.5 ; mono_internvl ; Qwen2vl ; gpt4v ; reid2024gemini1_5 , continuously breaking through the upper limits of various multimodal tasks mathew2021docvqa ; yu2023mmvet ; liu2023mmbench ; Datasets:ChartQA . The great success of MLLM is inseparable from its compositional training paradigm, which independently pre-trains visual encoders openclip and LLMs touvron2023llama , and then integrates them through additional multimodal training. Due to the engineering simplicity and effectiveness, this paradigm has dominated MLLM area over the past few years. However, the shortcomings of compositional training have been gradually recognized by the community recently, e.g., unclear multimodal scaling property diao2024EVE ; shukor2025scaling .
Therefore, increasing attention has been directed toward the development of more native MLLMs. As illustrated in Fig. 1, native MLLMs aim to jointly optimize both visual and language spaces in an end-to-end manner, thereby maximizing vision-language alignment. Compared to the compositional paradigm, existing native MLLM methods demonstrate a promising scaling law and a significantly simplified training process team2024chameleon ; shukor2025scaling . Despite these advancements, the primary benefits of native MLLMs are often evaluated under the assumption of infinite training resources, overlooking the substantial challenges posed by limited data and large-scale training. Consequently, a critical practical question remains: whether and how native MLLMs can feasibly achieve or even surpass the performance upper bound of top-tier MLLMs at an acceptable cost.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Charts: Native MLLMs Performance Analysis
### Overview
The image presents three charts comparing the performance of different Native Mixture-of-Experts Language Models (MLLMs). The first chart examines the impact of design choices on validation loss. The second chart investigates the scaling properties of these models with varying training data sizes. The third chart is a radar chart visualizing performance across multiple benchmark datasets.
### Components/Axes
**Chart 1: Design Choices of Native MLLMs**
* **X-axis:** Training Data Size (log scale, from 10<sup>5</sup> to 10<sup>8</sup>)
* **Y-axis:** Validation Loss (from 0.75 to 2.25)
* **Legend:**
* base (Light Green)
* w/ MoE (Green)
* w/ r<sub>g,w</sub>(·) (Blue)
**Chart 2: Scaling Properties of Native MLLMs**
* **X-axis:** Training Data Size (log scale, from 10<sup>5</sup> to 10<sup>8</sup>)
* **Y-axis:** Validation Loss (from 0.75 to 2.25)
* **Legend:**
* 0.5B parameters (Light Green)
* 2B parameters (Green)
* 7B parameters (Blue)
**Chart 3: Radar Chart - Benchmark Performance**
* **Axes:** ChartQA, MMVet, MMML, OCRBench, MathVista, CCBench, TextVQA, InfoVQA, A12D, GQA, SQA-I
* **Legend (positioned bottom-right):**
* Emu3 (Dark Blue)
* Mono-InternVL (Light Blue)
* InternVL-2.5-2B (Purple)
* DoeVQA (Red)
* Chameleon-7B (Orange)
* NaViL-2B (Ours) (Pink)
### Detailed Analysis or Content Details
**Chart 1: Design Choices of Native MLLMs**
* **Base (Light Green):** Starts at approximately 2.15 validation loss at 10<sup>5</sup> training data size, decreasing to approximately 0.95 at 10<sup>8</sup>. The line slopes downward, indicating decreasing loss with increasing data.
* **w/ MoE (Green):** Starts at approximately 2.15 validation loss at 10<sup>5</sup> training data size, decreasing to approximately 0.85 at 10<sup>8</sup>. The line slopes downward, indicating decreasing loss with increasing data.
* **w/ r<sub>g,w</sub>(·) (Blue):** Starts at approximately 2.15 validation loss at 10<sup>5</sup> training data size, decreasing to approximately 0.80 at 10<sup>8</sup>. The line slopes downward, indicating decreasing loss with increasing data.
**Chart 2: Scaling Properties of Native MLLMs**
* **0.5B parameters (Light Green):** Starts at approximately 2.2 validation loss at 10<sup>5</sup> training data size, decreasing to approximately 1.1 at 10<sup>8</sup>. The line slopes downward.
* **2B parameters (Green):** Starts at approximately 2.2 validation loss at 10<sup>5</sup> training data size, decreasing to approximately 0.8 at 10<sup>8</sup>. The line slopes downward.
* **7B parameters (Blue):** Starts at approximately 2.2 validation loss at 10<sup>5</sup> training data size, decreasing to approximately 0.75 at 10<sup>8</sup>. The line slopes downward.
**Chart 3: Radar Chart - Benchmark Performance**
* **Emu3 (Dark Blue):** Scores are approximately: ChartQA (74.7), MMVet (78.3), MMML (43.6), OCRBench (80.4), MathVista (51.3), CCBench (83.9), TextVQA (76.9), InfoVQA (60.9), A12D (74.9), GQA (62.9), SQA-I (96.2).
* **Mono-InternVL (Light Blue):** Scores are approximately: ChartQA (79.2), MMVet (74.7), MMML (41.6), OCRBench (76.1), MathVista (50.8), CCBench (80.8), TextVQA (73.8), InfoVQA (58.8), A12D (74.9), GQA (62.9), SQA-I (96.2).
* **InternVL-2.5-2B (Purple):** Scores are approximately: ChartQA (79.2), MMVet (74.7), MMML (41.6), OCRBench (76.1), MathVista (50.8), CCBench (80.8), TextVQA (73.8), InfoVQA (58.8), A12D (74.9), GQA (62.9), SQA-I (96.2).
* **DoeVQA (Red):** Scores are approximately: ChartQA (79.2), MMVet (74.7), MMML (41.6), OCRBench (76.1), MathVista (50.8), CCBench (80.8), TextVQA (73.8), InfoVQA (58.8), A12D (74.9), GQA (62.9), SQA-I (96.2).
* **Chameleon-7B (Orange):** Scores are approximately: ChartQA (79.2), MMVet (74.7), MMML (41.6), OCRBench (76.1), MathVista (50.8), CCBench (80.8), TextVQA (73.8), InfoVQA (58.8), A12D (74.9), GQA (62.9), SQA-I (96.2).
* **NaViL-2B (Ours) (Pink):** Scores are approximately: ChartQA (79.2), MMVet (78.3), MMML (43.6), OCRBench (80.4), MathVista (51.3), CCBench (83.9), TextVQA (76.9), InfoVQA (60.9), A12D (74.9), GQA (62.9), SQA-I (96.2).
### Key Observations
* In Chart 1, adding MoE and r<sub>g,w</sub>(·) consistently improves performance (lower validation loss) compared to the base model.
* In Chart 2, increasing the number of parameters (from 0.5B to 7B) consistently improves performance.
* In Chart 3, the radar chart shows that NaViL-2B (Ours) performs competitively across most benchmarks, with particularly strong performance on OCRBench and CCBench. The shapes of the radar charts for Emu3, Mono-InternVL, InternVL-2.5-2B, DoeVQA, and Chameleon-7B are very similar.
### Interpretation
The data suggests that incorporating Mixture-of-Experts (MoE) and the r<sub>g,w</sub>(·) mechanism are effective design choices for Native MLLMs, leading to lower validation loss. Scaling the model size (increasing the number of parameters) also demonstrably improves performance. The radar chart highlights the multi-faceted performance of NaViL-2B, indicating it is a strong contender in the field, achieving competitive results across a diverse set of benchmarks. The similarity in the radar chart shapes for the other models suggests they have comparable strengths and weaknesses. The consistent downward trend in validation loss across all charts with increasing data and model size indicates that these models benefit from both larger datasets and increased model capacity.
</details>
Figure 1: Comparison of design choices, scaling properties, and performance of our native MLLMs. We systematically investigate the designs and the scaling properties of native MLLMs under data constraints and yield valuable findings for building native MLLMs. After adopting these findings, our native MLLMs achieve competitive performance with top-tier MLLMs. $\mathcal{V}^{*}_{d,w}(\cdot)$ denotes the visual encoder with optimal parameter size.
To answer this question, in this paper, we aim to systematically investigate the designs and the scaling properties of native MLLMs under data constraint. Specifically, we first explore the choices of key components in the native architecture including the mixture-of-experts, the visual encoder and the initialization of the LLM. Our findings can be summarized in two folds. Firstly, an appropriate pre-training initialization (e.g., the base LLM) of the LLM greatly benefits the training convergence on multimodal data. Secondly, combining visual encoder architectures and MoEs results in obvious gains against the vanilla decoder-only LLM. Following these findings, we build a meta architecture that optimally balances performance and training cost.
Based on the optimal meta architecture, we further explore the scaling properties of the visual encoder, the LLM and the entire native MLLM. Specifically, we first scale up the LLM and the visual encoder independently and observe different scaling properties: while scaling LLM exhibits similar patterns as the conventional language scaling laws, scaling visual encoder shows an upper bound in return due to the limitation of the LLM’s capacity, suggesting that the optimal encoder size varies with the LLM size. Further analysis reveals that the optimal encoder size increases approximately proportionally with the LLM size in log scale. This observation yields a different guidance against compositional paradigm, which employs a visual encoder of one size across all LLM scales.
Based on above principles, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. To validate our approach, we conduct extensive experiments across diverse benchmarks to evaluate its multimodal capabilities including image captioning chen2015cococaption ; Datasets:Flickr30k ; agrawal2019nocaps , optical character recognition (OCR) Datasets:TextVQA ; Datasets:DocVQA ; liu2023ocrbench , etc. Experimental results reveal that with ~600M pre-training image-text pairs, NaViL achieves competitive performance compared to current top-tier compositional MLLMs, highlighting the great practicality and capabilities of NaViL. In summary, our contributions are as follows:
- We systematically explore the design space and the optimal choice in native MLLMs under data constraint, including the LLM initialization, the visual encoder and the MoEs, and draw three critical findings that greatly benefit the training of native MLLMs.
- Based on above findings, we construct a novel native MLLM called NaViL. In NaViL, we explore the scaling properties of the visual encoder and the LLM and indicate their positively correlated scaling relationship.
- We conduct large-scale pre-training and fine-tuning experiments on NaViL. Experimental results show that NaViL can achieve top-tier performance with nearly 600M pre-training data. Our findings and results will encourage future work for native MLLMs in the community.
## 2 Related Work
Multimodal Large Language Models. Recent years have witnessed the significant progresses of Multimodal Large Language Models (MLLMs) llava-hr ; VLM:LLaVA ; VLM:LLaVA-1.5 ; Qwen2vl ; InternVL-2.5 , which have dominated various downstream tasks goyal2017vqav2 ; hudson2019gqa ; Datasets:TextVQA ; Datasets:AI2D . Starting from LLaVA VLM:LLaVA , most existing MLLMs adopt the compositional paradigm, which connects the pre-trained visual encoder VLP:CLIP and LLM qwen through a projector and finetune them on for alignment. Then, the whole structure will be further fine-tuned on multimodal data for alignment. Based on this paradigm, existing works mainly focus on the improvement of visual encoders Qwen2vl ; wang2023internimage ; llava-hr and the design of connectors li2022blip ; VLM:LLaVA . Despite the progress, such paradigm struggles to explore the joint scaling properties of vision and language. Their potential limitations in training pipeline shukor2025scaling and vision-language alignment diao2024EVE are also gradually recognized by the community.
Native Multimodal Large Language Models. To overcome the limitations of compositional paradigm, native MLLMs have emerged as another candidate solution diao2025evev2 ; diao2024EVE ; mono_internvl ; lei2025sail ; vora ; shukor2025scaling ; team2024chameleon . Compared to compositional paradigm, native MLLMs aim to pre-train both vision and language parameters in an end-to-end manner, thus achieving better alignment. The most representative methodology shukor2025scaling ; team2024chameleon is to directly pre-train the LLM from scratch on large-scale multimodal corpora, which typically requires expensive training costs. To address this issue, recent attempt initialize the LLM with a pre-trained checkpoint to facilitate training convergence diao2025evev2 ; diao2024EVE ; mono_internvl ; lei2025sail ; vora . Nevertheless, current research still lacks systematic investigation into the architectural design and scaling characteristics of native MLLMs, limiting their performance.
## 3 Visual Design Principles for native-MLLM
### 3.1 Problem Setup
We define native MLLMs as models that jointly optimize vision and language capabilities in an end-to-end manner. Dispite recent progress that shows promising scaling law and potential better performance compard with their compositional counterparts, how to build competitive native MLLMs compare to the state-of-the-art MLLMs with a practical data scale remains underexplored. In particular, there are two problems requiring to be investigated:
- (Sec. 3.2) How to choose the optimal architectures of the visual and linguistic components?
- (Sec. 3.3) How to optimally scale up the visual and linguistic components?
Meta Architecture. To study these two questions, we first define a general meta architecture of native MLLMs consisting of a visual encoder, an LLM, and a mixture-of-expert architecture injected to the LLM. The visual encoder $\mathcal{V}$ consists of a series of transformer layers and can be defined as
$$
\mathcal{V}_{d,w}(I)=\mathcal{C}\odot\mathcal{F}_{d}^{w}\odot\cdots\odot\mathcal{F}_{2}^{w}\odot\mathcal{F}_{1}^{w}\odot\mathcal{P}(I)=\mathcal{C}\bigodot_{i=1...d}\mathcal{F}_{i}^{w}\odot\mathcal{P}(I), \tag{1}
$$
where $\mathcal{F}_{i}^{w}$ denotes the $i$ -th transformer layer (out of $d$ layers) with hidden dimension $w$ , $\mathcal{P}$ denotes the Patch Embedding Layer, $I\in\mathbb{R}^{H\times W\times 3}$ denotes the input image. Note that the visual encoder degenerate to a simple patch embedding layer when $d=0$ . For simplicity, we use the same architectures as the LLM for the visual encoder layers $\mathcal{F}$ but with bi-directional attention and vary the hyperparameters $d$ and $w$ . Here $\mathcal{C}$ is the connector which downsamples the encoded image embeddings through pixel shuffle VLM:InternVL and projects them to the LLM’s feature space by a MLP.
Experiment Settings. All the models are trained on web-scale, noisy image-caption pair data Datasets:Laion-5b with Next-Token-Prediction (NTP) and an image captioning task. We use a held-out subset of the multimodal dataset to calculate the validation teacher-forcing loss for measuring and comparing different design choices. Models with LLM initializations are initialize from InternLM2-Base cai2024internlm2 .
### 3.2 Exploring the Optimal Design of Architecture Components
In this section, we explore the design choices of three key components: 1) the initialization of the LLM; 2) the effectiveness of MoEs; 3) the optimal architecture of the visual encoder.
#### 3.2.1 Initialization of LLM
A straightforward way to construct native MLLMs is to train all modalities from scratch with mixed corpora, as shown in prior work shukor2025scaling . While this approach theoretically offers the highest performance ceiling given ample data and computational resources, practical limitations such as data scarcity and large-scale optimization challenges hinder its feasibility. Alternatively, initializing the model from a pre-trained LLM effectively leverages linguistic prior knowledge, significantly reducing data and computational demands.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Charts: Validation Loss and COCO Caption Performance
### Overview
The image presents two line charts side-by-side. The left chart displays "Validation Loss" against "Training Data Size". The right chart shows "COCO Caption" performance, measured by "CIDEr" score, also against "Training Data Size". Both charts compare two conditions: with LLM initialization ("w/ LLM init") and without LLM initialization ("w/o LLM init"). The x-axis for both charts is logarithmic.
### Components/Axes
* **Left Chart:**
* Title: "Validation Loss"
* X-axis: "Training Data Size" (Logarithmic scale, ranging approximately from 10<sup>7</sup> to 10<sup>9</sup>)
* Y-axis: "Validation Loss" (Linear scale, ranging from 0 to 2.5)
* **Right Chart:**
* Title: "COCO Caption"
* X-axis: "Training Data Size" (Logarithmic scale, ranging approximately from 10<sup>7</sup> to 10<sup>9</sup>)
* Y-axis: "CIDEr" (Linear scale, ranging from 0 to 60)
* **Legend (Bottom-Center):**
* "w/ LLM init" - Dark Blue, represented by a solid line with triangle markers.
* "w/o LLM init" - Light Blue, represented by a dashed line with circle markers.
### Detailed Analysis or Content Details
**Left Chart (Validation Loss):**
* **w/ LLM init (Dark Blue):** The line starts at approximately 1.7 at 10<sup>7</sup>, decreases to around 0.8 at 10<sup>8</sup>, and then plateaus around 0.6-0.7 for data sizes greater than 10<sup>8</sup>.
* **w/o LLM init (Light Blue):** The line begins at approximately 2.3 at 10<sup>7</sup>, decreases to around 1.0 at 10<sup>8</sup>, and then continues to decrease, reaching approximately 0.8 at 10<sup>9</sup>.
**Right Chart (COCO Caption - CIDEr):**
* **w/ LLM init (Dark Blue):** The line starts at approximately 25 at 10<sup>7</sup>, increases sharply to around 62 at 10<sup>8</sup>, and then plateaus around 60-63 for data sizes greater than 10<sup>8</sup>.
* **w/o LLM init (Light Blue):** The line begins at approximately 8 at 10<sup>7</sup>, increases to around 55 at 10<sup>8</sup>, and then continues to increase, reaching approximately 58 at 10<sup>9</sup>.
### Key Observations
* **Validation Loss:** LLM initialization consistently results in lower validation loss across all training data sizes. The benefit of LLM initialization is most pronounced between 10<sup>7</sup> and 10<sup>8</sup> training data size.
* **COCO Caption (CIDEr):** LLM initialization significantly improves CIDEr scores, especially as the training data size increases. The performance gap between the two conditions widens with larger datasets.
* **Plateaus:** Both metrics show a tendency to plateau after a certain training data size (around 10<sup>8</sup> for Validation Loss and 10<sup>8</sup> for CIDEr).
### Interpretation
The data strongly suggests that initializing the model with a Large Language Model (LLM) significantly improves both validation performance (lower loss) and captioning quality (higher CIDEr score). The benefits of LLM initialization are particularly evident with larger training datasets. The plateaus observed in both charts indicate diminishing returns from increasing the training data size beyond a certain point. This could be due to the model reaching its capacity or the data becoming redundant. The consistent performance advantage of LLM initialization suggests that the pre-trained knowledge embedded in the LLM is effectively transferred to the captioning task, leading to better generalization and performance. The logarithmic scale on the x-axis highlights the rapid gains achieved with relatively small increases in training data size, particularly in the early stages of training.
</details>
Figure 2: Effectiveness of LLM initialization. Left: The validation loss. The LLM initialized one converges much faster. Right: The zero-shot caption performance. Due to the lack of textual knowledge, the uninitialized model continues to lag behind.
To evaluate the effectiveness of LLM initialization, we compare model performance in terms of loss and image captioning. As shown in Fig. 2 (left), the model trained from scratch performs significantly worse than the initialized model, requiring over 10x more data to reach comparable loss.
Further analysis of zero-shot image captioning (Fig. 2 (right)) reveals a substantial performance gap favoring the initialized model, even with significantly more data for the non-initialized model. This is likely due to the lower textual quality and diversity of multimodal training data compared to the LLM pre-training corpus, limiting the textual capability of models trained from scratch. These findings highlight the practical advantage of using LLM initialization in multimodal pre-training.
Observation 1:
Initializing from pre-trained LLM greatly benefits the convergence on multimodal data, and in most cases delivers better performance even with a large amount of multimodal data.
#### 3.2.2 Effectiveness of MoEs
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Validation Loss vs. Training Data Size
### Overview
This line chart depicts the relationship between validation loss and training data size for two models: one without Mixture of Experts (MoE) and one with MoE. The chart illustrates how validation loss changes as the amount of training data increases.
### Components/Axes
* **Title:** "Validation Loss" (centered at the top)
* **X-axis:** "Training Data Size" (horizontal axis) - Scale is logarithmic, with markings at 10<sup>7</sup> and 10<sup>8</sup>.
* **Y-axis:** "Validation Loss" (vertical axis) - Scale ranges from approximately 0.8 to 2.4.
* **Legend:** Located in the top-left corner.
* "w/o MoE" (light green triangle marker)
* "w/ MoE" (dark green circle marker)
* **Gridlines:** Horizontal gray gridlines are present to aid in reading values.
### Detailed Analysis
**w/o MoE (Light Green Line):**
The light green line representing the model without MoE starts at approximately 2.3 validation loss at the leftmost data point (corresponding to 10<sup>7</sup> training data size). The line slopes downward, but at a decreasing rate, reaching approximately 1.7 validation loss at the rightmost data point (corresponding to 10<sup>8</sup> training data size).
* Data Point 1 (10<sup>7</sup>): Validation Loss ≈ 2.3
* Data Point 2 (between 10<sup>7</sup> and 10<sup>8</sup>): Validation Loss ≈ 2.0
* Data Point 3 (10<sup>8</sup>): Validation Loss ≈ 1.7
**w/ MoE (Dark Green Line):**
The dark green line representing the model with MoE begins at approximately 2.0 validation loss at the leftmost data point (corresponding to 10<sup>7</sup> training data size). This line exhibits a steeper downward slope than the "w/o MoE" line, decreasing rapidly to approximately 1.2 validation loss at an intermediate point, and finally reaching approximately 0.9 validation loss at the rightmost data point (corresponding to 10<sup>8</sup> training data size).
* Data Point 1 (10<sup>7</sup>): Validation Loss ≈ 2.0
* Data Point 2 (between 10<sup>7</sup> and 10<sup>8</sup>): Validation Loss ≈ 1.2
* Data Point 3 (10<sup>8</sup>): Validation Loss ≈ 0.9
### Key Observations
* The model *with* MoE consistently exhibits lower validation loss than the model *without* MoE across all training data sizes.
* The benefit of using MoE is more pronounced as the training data size increases. The gap between the two lines widens as you move towards the right of the chart.
* Both models demonstrate diminishing returns in terms of validation loss reduction as the training data size increases. The slope of both lines decreases as data size grows.
### Interpretation
The data strongly suggests that incorporating Mixture of Experts (MoE) into the model leads to improved performance, as measured by validation loss. This improvement becomes more significant with larger training datasets. The steeper decline in validation loss for the MoE model indicates that it is better able to generalize from the data, potentially due to its increased capacity and ability to specialize different parts of the model to different aspects of the data. The diminishing returns observed for both models suggest that there may be a point beyond which adding more training data provides only marginal improvements in validation loss. This could be due to the model reaching its capacity or the data becoming saturated with information. The logarithmic scale on the x-axis is important to note, as it indicates that the benefit of increased data size is not linear.
</details>
Figure 3: The validation loss of adding MoE or not. Using MoE extension will cause the loss to decrease more quickly.
Mixture-of-Experts (MoEs) are effective for handling heterogeneous data and are widely used in native MLLMs. We evaluate the MoE architecture within our meta architecture by comparing two configurations: one with a visual encoder and a vanilla LLM, and another with a visual encoder and an MoE-extended LLM. We follow Mono-InternVL mono_internvl to adopt the modality-specific MoEs and training settings. However, we empirically found that using only the feed-forward network (FFN) expert would lead to a significant difference in feature scale between visual and language modalities. To mitigate this issue, we further introduced modality-specific attention experts, that is, using different projection layers (i.e. qkvo) in the self-attention layer to process visual and text features respectively, and then perform unified global attention calculation. Specifically, the output $x_{i,m}^{l}\in\mathbb{R}^{d}$ of the $i$ -th token with modality $m\in\{\text{visual},\text{linguistic}\}$ at the $l$ -th layer of the MoE-extended LLM can be defined as
$$
\displaystyle x_{i,m}^{l^{\prime}} \displaystyle=x_{i,m}^{l-1}+\text{MHA-MMoE}(\text{RMSNorm}(x_{i,m}^{l-1})), \displaystyle x_{i,m}^{l} \displaystyle=x_{i,m}^{l^{\prime}}+\text{FFN-MMoE}(\text{RMSNorm}(x_{i,m}^{l^{\prime}})), \tag{2}
$$
where $\text{RMSNorm}(\cdot)$ is the layer normalization operation, and $\text{MHA-MMoE}(\cdot)$ and $\text{FFN-MMoE}(\cdot)$ are the modality-specific attention and FFN expert, respectively, formulated by
$$
\displaystyle\text{MHA-MMoE}(x_{i,m}) \displaystyle=(\text{softmax}(\frac{QK^{T}}{\sqrt{d}})V)W_{O}^{m}, \displaystyle Q_{i,m}=x_{i,m}W_{Q}^{m},K_{i,m} \displaystyle=x_{i,m}W_{K}^{m},V_{i,m}=x_{i,m}W_{V}^{m}, \displaystyle\text{FFN-MMoE}(x_{i,m}) \displaystyle=(\text{SiLU}(x_{i,m}W_{\text{gate}}^{m})\odot x_{i,m}W_{\text{up}}^{m})W_{\text{down}}^{m}. \tag{3}
$$
Here $W_{Q}^{m},W_{K}^{m},W_{V}^{m},W_{O}^{m}$ and $W_{\text{gate}}^{m},W_{\text{up}}^{m},W_{\text{down}}^{m}$ are all modality-specific projection matrices, and $\text{SiLU}(\cdot)$ denotes the activation function, $\odot$ denotes the element-wise product operation. The number of activated experts is set to one to maintain consistent inference costs.
As shown in Fig. 3, the MoE architecture significantly accelerates model convergence compared to the vanilla LLM, achieving the same validation loss with only 1/10 of the data without increasing training or inference cost. This demonstrates that MoE enhances model capacity and effectively handles heterogeneous data, making it suitable for native MLLMs.
Observation 2:
MoEs significantly improve model performance without increasing the number of activated parameters.
#### 3.2.3 Optimizing the Visual Encoder Architecture
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Charts: Model Performance vs. Training Data Size
### Overview
The image presents four line charts comparing model performance across different training data sizes. The charts track Validation Loss, COCO Caption CIDEr score, Flickr Caption CIDEr score, and NoCaps Caption CIDEr score. The x-axis represents different training data sizes and epochs (d3, d6, d12, d24, d48), while the y-axis represents the corresponding metric values.
### Components/Axes
* **X-axis:** Training Data Size/Epochs. Markers include: d3 w4096, d6 w2880, d12 w2048, d24 w1472, d48 w1024.
* **Y-axis (Left):** Validation Loss (ranging from approximately 0.8 to 1.7).
* **Y-axis (Middle & Right):** CIDEr score (ranging from approximately 20 to 70).
* **Legend:** Located at the bottom-center of the image.
* 15M (Light Blue)
* 30M (Blue)
* 60M (Dark Blue)
* 120M (Very Dark Blue)
### Detailed Analysis or Content Details
**1. Validation Loss Chart (Leftmost):**
* The 15M line (light blue) starts at approximately 1.65, decreases to around 1.15, then increases slightly to 1.2.
* The 30M line (blue) starts at approximately 1.5, decreases to around 0.95, then increases slightly to 1.0.
* The 60M line (dark blue) starts at approximately 1.4, decreases to around 0.85, then increases slightly to 0.9.
* The 120M line (very dark blue) starts at approximately 1.3, decreases to around 0.8, then increases slightly to 0.85.
* Overall trend: All lines decrease initially, then plateau or slightly increase. The 120M model consistently exhibits the lowest validation loss.
**2. COCO Caption Chart (Center-Left):**
* The 15M line (light blue) starts at approximately 68, decreases to around 55, then increases to 60.
* The 30M line (blue) starts at approximately 65, increases to around 60, then decreases to 55.
* The 60M line (dark blue) starts at approximately 62, increases to around 60, then decreases to 50.
* The 120M line (very dark blue) starts at approximately 60, increases to around 65, then decreases to 55.
* Overall trend: The 15M and 30M lines show more fluctuation, while the 60M and 120M lines are more stable.
**3. Flickr Caption Chart (Center-Right):**
* The 15M line (light blue) starts at approximately 65, decreases to around 50, then increases to 55.
* The 30M line (blue) starts at approximately 62, increases to around 58, then decreases to 52.
* The 60M line (dark blue) starts at approximately 60, increases to around 55, then decreases to 48.
* The 120M line (very dark blue) starts at approximately 58, increases to around 60, then decreases to 52.
* Overall trend: Similar to the COCO Caption chart, the 15M and 30M lines are more volatile.
**4. NoCaps Caption Chart (Rightmost):**
* The 15M line (light blue) starts at approximately 55, decreases to around 30, then increases to 40.
* The 30M line (blue) starts at approximately 52, decreases to around 35, then increases to 45.
* The 60M line (dark blue) starts at approximately 50, decreases to around 40, then increases to 42.
* The 120M line (very dark blue) starts at approximately 48, decreases to around 38, then increases to 45.
* Overall trend: The 15M and 30M lines show a more pronounced decrease and subsequent increase.
### Key Observations
* Increasing the training data size generally leads to lower validation loss (leftmost chart).
* The 120M training data size consistently performs best in terms of validation loss.
* The CIDEr scores (middle and right charts) exhibit more variability, and the impact of training data size is less clear-cut.
* The NoCaps Caption chart shows the most significant initial decrease in CIDEr score with increasing training data size.
* The COCO and Flickr caption charts show a more stable performance with larger training datasets.
### Interpretation
The data suggests that increasing the training data size improves model generalization, as evidenced by the decreasing validation loss. However, the impact on captioning performance (CIDEr scores) is more nuanced and depends on the specific captioning dataset (COCO, Flickr, NoCaps). The NoCaps dataset appears to benefit the most from increased training data, while the COCO and Flickr datasets show more stable performance. The initial decrease in CIDEr scores followed by a slight increase could indicate overfitting or the need for further model tuning. The consistent performance of the 120M model across all metrics suggests that it has reached a point of diminishing returns, and further increasing the training data size may not yield significant improvements. The differences in trends between the captioning datasets suggest that the difficulty and characteristics of each dataset influence the model's learning process.
</details>
Figure 4: The validation loss and zero-shot caption performance of different visual encoders. The loss and performance only differ when the visual encoder is extremely wide or shallow.
The visual encoder precedes the LLM to perform preliminary extraction of visual information, converting raw pixels into semantic visual features aligned with the textual embedding space. Due to its bidirectional attention mechanism and the increased capacity introduced by additional parameters, the visual encoder has the potential to enhance the model’s ability to represent visual information.
In this section, we investigate the optimal architecture of the visual encoder under a given parameter budget. The total parameter count $\mathcal{C}$ can be approximately calculated openai2020scaling as $\mathcal{N}=12\times d\times w^{2}$ . Given a fixed $\mathcal{N}$ , the structure of the visual encoder is mainly determined by its width $w$ and depth $d$ .
Depth ( $d$ ): Typically, deeper models can capture richer and more complex features, while also being more prone to gradient vanishing problems tan2019efficientnet . When it comes to MLLM, a visual encoder that is too shallow may not be able to extract enough high-level semantics, while a visual encoder that is too deep may cause low-level features to be lost, thus limiting the capture of fine-grained details.
Width ( $w$ ): Compared to depth, width has relatively little impact on visual transformer performance dosovitskiy2020image , as long as it does not cause additional information bottlenecks. That is, it cannot be lower than the total number of channels within a single image patch. Under this premise, the width of the visual encoder does not have to be the same as the hidden size of the LLM.
We train various MLLMs with different $\mathcal{V}_{d,w}$ configurations (combinations of depth and width) while keeping the pre-trained LLM and visual encoder parameter count fixed at 600M. The depth $d$ ranges from $\{3,6,12,24,48\}$ , and the width $w$ is adjusted as $\{4096,2880,2048,1472,1024\}$ to maintain a consistent parameter count. Fig. 4 shows the validation loss for different depth and width combinations as training data size varies. Models with extremely high or low depths perform worse than those with moderate configurations. Among reasonably configured models, shallower ones converge faster in the early phase (less than 30M data), but this advantage diminishes with more data. In zero-shot image captioning benchmarks, deeper visual encoders show slightly better performance, consistent with prior research on compute-optimal LLM architectures openai2020scaling , which suggests a wide range of optimal width and depth combinations.
Observation 3:
Visual encoders achieve near-optimal performance across a wide range of depth and width configurations. Shallower encoders converge faster in early training, while deeper encoders perform slightly better with larger datasets.
### 3.3 Scaling Up Native MLLMs
In this section, we consider the scaling properties of our meta architecture. Specifically, we investigate: 1) the impact of scaling up the visual encoder and the LLM independently; 2) the optimal way of scaling the visual encoder and the LLM simultaneously. All models follow the optimal architecture discovered in Sec. 3.2, i.e., with LLM initialization, MoEs, and optimal depth-to-width ratios of the visual encoders.
#### 3.3.1 Scaling up Visual Encoder and LLM Independently
We first investigate the scaling properties of the visual encoder and the LLM independently, i.e., scaling up one component while keeping the other fixed. Specifically, we evaluate a series of LLMs with parameter sizes $\{0.5B,1.8B,7B\}$ and visual encoders with sizes $\{75M,150M,300M,600M,1.2B,2.4B\}$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Validation Loss vs LLM Size
### Overview
This chart depicts the relationship between Validation Loss and LLM Size, specifically for a Visual Encoder Size of 600M. The chart shows a decreasing trend, indicating that as the LLM size increases, the validation loss decreases.
### Components/Axes
* **Title:** "Validation Loss vs LLM Size" (centered at the top)
* **X-axis:** "LLM Size (B)" - ranging from approximately 0.5 to 7.0, with markers at 0.5, 2, and 7.
* **Y-axis:** "Validation Loss" - ranging from approximately 0.6 to 1.2, with markers at 0.6, 0.8, 1.0, and 1.2.
* **Legend:** Located in the top-right corner.
* "Visual Encoder Size: 600M" - associated with a solid purple line with square markers.
### Detailed Analysis
The chart contains a single data series representing the "Visual Encoder Size: 600M". The line slopes downward from left to right, indicating a negative correlation between LLM size and validation loss.
Here's a breakdown of the data points:
* **LLM Size = 0.5B:** Validation Loss ≈ 1.04 (± 0.02)
* **LLM Size = 2B:** Validation Loss ≈ 0.81 (± 0.02)
* **LLM Size = 7B:** Validation Loss ≈ 0.68 (± 0.02)
The data points are connected by a smooth, solid purple line.
### Key Observations
* The validation loss decreases significantly as the LLM size increases from 0.5B to 2B.
* The rate of decrease slows down as the LLM size increases from 2B to 7B.
* There are no apparent outliers or anomalies in the data.
### Interpretation
The data suggests that increasing the size of the LLM (up to 7B in this experiment) leads to a reduction in validation loss, indicating improved model performance. This is a common observation in deep learning, as larger models generally have a greater capacity to learn complex patterns. The diminishing returns observed between 2B and 7B suggest that there may be a point of diminishing returns, where further increasing the model size yields smaller improvements in performance. The consistent performance of the Visual Encoder at 600M suggests that this configuration is effective across the tested LLM sizes. This chart is likely part of a larger study investigating the optimal LLM size for a given visual encoding architecture.
</details>
Figure 5: The validation loss when scaling up LLMs. With the same visual encoder (i.e. 600M), the validation loss decreases log-linearly with the LLM size.
Scaling up LLMs. The results are shown in Fig. 5. Scaling up the LLM parameters in native MLLMs exhibits a pattern consistent with the conventional LLM scaling law, where the loss decreases linearly as the parameter size increases exponentially.
Scaling up Visual Encoder. The results are shown in Fig. 6. In contrast to the LLM scaling law, increasing the visual encoder size does not consistently enhance multimodal performance. Instead, with a fixed LLM, the performance gains achieved by enlarging the visual encoder diminish progressively. Beyond a certain encoder size, further scaling results in only marginal loss reduction, indicating that the performance upper limit of the MLLM is constrained by the LLM’s capacity.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Validation Loss vs. Visual Encoder Size for Different LLM Sizes
### Overview
This image presents three line charts, each depicting the relationship between Validation Loss (on a log scale) and Visual Encoder Size. Each chart corresponds to a different Large Language Model (LLM) size: 0.5B, 1.8B, and 7B. Within each chart, multiple lines represent different numbers of training samples (15M, 30M, 60M, and 120M). The charts aim to illustrate how validation loss changes with increasing visual encoder size for various LLM and training data configurations.
### Components/Axes
* **X-axis (all charts):** Visual Encoder Size. Scales vary per chart:
* LLM-0.5B: 75, 150, 300, 600
* LLM-1.8B: 150, 300, 600, 1200
* LLM-7B: 300, 600, 1200, 2400
* **Y-axis (all charts):** Validation Loss (log scale). Scales vary per chart:
* LLM-0.5B: 1.3 to 2.0
* LLM-1.8B: 0.7 to 1.4
* LLM-7B: 0.6 to 1.1
* **Legends (all charts):**
* 15M (Light Blue)
* 30M (Gray)
* 60M (Dark Gray)
* 120M (Black)
* **Titles (each chart):** LLM-0.5B, LLM-1.8B, LLM-7B. Positioned at the top-center of each chart.
### Detailed Analysis
**LLM-0.5B (Left Chart):**
* The 15M line starts at approximately 1.85 and decreases to around 1.65.
* The 30M line starts at approximately 1.75 and decreases to around 1.55.
* The 60M line is relatively flat, starting at approximately 1.45 and remaining around 1.4.
* The 120M line is also relatively flat, starting at approximately 1.4 and remaining around 1.4.
* Overall trend: Validation loss generally decreases with increasing visual encoder size, especially for the 15M and 30M training data sizes.
**LLM-1.8B (Middle Chart):**
* The 15M line starts at approximately 1.35 and decreases sharply to around 0.8.
* The 30M line starts at approximately 1.2 and decreases to around 0.9.
* The 60M line starts at approximately 1.0 and decreases to around 0.8.
* The 120M line starts at approximately 0.9 and decreases to around 0.7.
* Overall trend: Validation loss decreases significantly with increasing visual encoder size for all training data sizes. The decrease appears more pronounced for the 15M training data.
**LLM-7B (Right Chart):**
* The 30M line starts at approximately 1.05 and decreases to around 0.9.
* The 60M line starts at approximately 0.95 and decreases to around 0.8.
* The 120M line starts at approximately 0.85 and decreases to around 0.7.
* Overall trend: Validation loss decreases with increasing visual encoder size, but the decrease is less dramatic than in the 1.8B chart. The lines are relatively close together.
### Key Observations
* Larger LLMs (1.8B and 7B) generally exhibit lower validation loss compared to the smaller LLM (0.5B).
* Increasing the visual encoder size generally leads to a decrease in validation loss, suggesting improved performance with larger encoders.
* The impact of visual encoder size on validation loss appears to be more significant for smaller LLMs and smaller training datasets.
* The 120M training data size consistently results in the lowest validation loss across all LLM sizes.
* The 0.5B model shows less sensitivity to the visual encoder size compared to the 1.8B and 7B models.
### Interpretation
The data suggests that increasing the size of the visual encoder and the amount of training data generally improves the performance of the LLM, as measured by validation loss. The effect is most pronounced for smaller LLMs and smaller training datasets. This indicates that larger models and more data benefit more from larger visual encoders. The flattening of the curves for the 0.5B model with larger encoder sizes suggests a point of diminishing returns, where further increasing the encoder size does not significantly reduce validation loss. The consistent performance of the 120M training data across all LLM sizes highlights the importance of data quantity in achieving optimal performance. The log scale on the Y-axis emphasizes the relative changes in validation loss, making it easier to compare the performance of different configurations. The charts provide valuable insights into the trade-offs between model size, data quantity, and visual encoder size in the context of LLM training.
</details>
Figure 6: The validation loss curves of different LLMs with different training data sizes. As the training data size increases, the loss gap narrows to near zero when the visual encoder size reaches a certain threshold.
Observation 4:
Scaling the LLM consistently improves multimodal performance, following the typical LLM scaling law. However, increasing the visual encoder size shows diminishing returns, suggesting that the MLLM’s performance is limited by the LLM’s capacity.
#### 3.3.2 Scaling up Visual Encoder and LLM Together
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Chart: Visual Encoder Size vs LLM Size
### Overview
The image presents a line chart illustrating the relationship between Visual Encoder Size and LLM (Large Language Model) Size. Both sizes are measured in billions (B). The chart shows a generally positive correlation between the two variables, indicating that as the LLM size increases, the Visual Encoder size also tends to increase.
### Components/Axes
* **Title:** "Visual Encoder Size vs LLM Size" - positioned at the top-center of the chart.
* **X-axis:** "LLM Size (B)" - represents the size of the Large Language Model in billions. The axis has markers at 0.5, 2, and 7.
* **Y-axis:** "Visual Encoder Size (B)" - represents the size of the Visual Encoder in billions. The axis has markers at 0.30, 0.60, and 1.20.
* **Data Series:** A single line representing the relationship between the two variables. The line is gray.
### Detailed Analysis
The line slopes upward, indicating a positive correlation. Let's extract approximate data points:
* When LLM Size is 0.5 (B), Visual Encoder Size is approximately 0.30 (B).
* When LLM Size is 2 (B), Visual Encoder Size is approximately 0.60 (B).
* When LLM Size is 7 (B), Visual Encoder Size is approximately 1.20 (B).
The increase between 0.5 and 2 on the x-axis results in an increase of 0.3 on the y-axis. The increase between 2 and 7 on the x-axis results in an increase of 0.6 on the y-axis.
### Key Observations
The relationship appears to be non-linear. The slope of the line increases as the LLM size increases, suggesting a potentially accelerating relationship between the two variables. The data points are relatively sparse, making it difficult to determine the exact nature of the relationship.
### Interpretation
The chart suggests that larger LLMs generally require larger Visual Encoders. This is likely due to the increased complexity of the tasks that larger LLMs are capable of performing, which necessitates a more powerful visual processing component. The non-linear relationship suggests that the benefit of increasing the Visual Encoder size may diminish at some point, or that there may be other factors influencing the optimal size of the Visual Encoder. The chart does not provide information about the specific architectures or training data used for the LLMs and Visual Encoders, which could also influence the relationship between their sizes. The data suggests a scaling relationship, but further investigation is needed to understand the underlying mechanisms and potential limitations.
</details>
Figure 7: Relationship of visual encoder size and LLM size. The optimal visual encoder size increases log-linearly with the LLM size.
The diminishing returns from increasing the visual encoder size suggest the existence of an optimal encoder size for a given LLM. We define this optimal size as the smallest encoder whose loss difference compared to an encoder twice its size is less than $\lambda=1\$ of the loss with the 75M encoder (the smallest used in our experiments). Fig. 7 shows the relationship between visual encoder size and LLM size.
The logarithm of the optimal visual encoder size scales linearly with the logarithm of the LLM size, indicating that both components should be scaled jointly for balanced performance. This highlights the suboptimality of compositional MLLMs, which typically use a fixed visual encoder size across varying LLM scales.
Observation 5:
The optimal size of the visual encoder scales proportionally with the LLM size in log scale, indicating that both components should be scaled jointly. This further implies that the pre-trained visual encoders using a single pre-trained visual encoder across a wide range of LLM scales like existing compositional MLLMs is suboptimal.
## 4 NaViL: A Novel Native MLLM with Strong Capabilities
### 4.1 Architecture
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Diagram: Multimodal Large Language Models Architecture
### Overview
The image depicts a diagram illustrating the architecture of Multimodal Large Language Models. It showcases the interaction between a visual encoder and a transformer-based language model, with a focus on tokenization and processing of both text and image data. The diagram is segmented into two main processing paths: one for text and one for images, converging into a multimodal model.
### Components/Axes
The diagram consists of the following key components:
* **Visual Encoder:** Processes image data. Includes "Visual Multi-scale Packing" and outputs to the Multimodal Large Language Model.
* **Multimodal Large Language Models:** The central processing unit, receiving input from both the visual encoder and text tokens.
* **Text Tokenization:** A series of blocks representing text token processing.
* **Image Tokenization:** A series of blocks representing image token processing.
* **Transformer Layers:** Represented as stacked blocks, with "Transformer Layer d" and "Transformer Layer 1" labeled.
* **MLP Connector:** Connects the transformer layers.
* **Patch Embed:** Initial embedding layer for image data.
* **RMSNorm:** Layer normalization blocks.
* **MHA-MMoE:** Multi-Head Attention with Mixture of Experts.
* **FFN-MMoE:** Feed Forward Network with Mixture of Experts.
* **Legend:** Distinguishes between "text token" (blue) and "image token" (green).
Additionally, there are text annotations:
* "The cat wears a yellow flower on its head, a golden necklace around its neck, and pink blushes on its cheeks."
* "Please provide a more detailed description of the cat in the picture."
* "Visual Multi-scale Packing"
* "x N"
### Detailed Analysis or Content Details
The diagram illustrates a data flow from image input to multimodal processing.
**Image Processing Path (Right to Left):**
1. **Patch Embed:** Image data is initially processed by a "Patch Embed" layer.
2. **Transformer Layer 1:** The embedded image data then passes through "Transformer Layer 1".
3. **Transformer Layer d:** Subsequent layers are represented as "Transformer Layer d" (with ellipses indicating multiple layers).
4. **MLP Connector:** The output of the transformer layers is connected via an "MLP Connector".
5. **Image Token:** The processed image data is represented as "image token" (green) and fed into the Multimodal Large Language Model.
**Text Processing Path (Left Side):**
1. **Text Token:** The input is represented as "text token" (blue).
2. **RMSNorm:** The text tokens pass through an "RMSNorm" layer.
3. **MHA-MMoE:** Then through a "MHA-MMoE" layer.
4. **RMSNorm:** Another "RMSNorm" layer.
5. **FFN-MMoE:** Finally, a "FFN-MMoE" layer.
6. The output of this path is fed into the Multimodal Large Language Model.
**Multimodal Large Language Model (Center):**
* The "Multimodal Large Language Models" block receives input from both the image and text processing paths.
* The connections between the processing paths and the multimodal model are indicated by dashed arrows.
**Annotations:**
* The annotation "The cat wears a yellow flower on its head, a golden necklace around its neck, and pink blushes on its cheeks." describes the content of the images used in the "Visual Multi-scale Packing" block.
* The annotation "Please provide a more detailed description of the cat in the picture." is a prompt or instruction related to the image content.
* "Visual Multi-scale Packing" indicates a method for processing images at different scales.
* "x N" likely represents a repetition or scaling factor for the text processing path.
### Key Observations
* The diagram emphasizes the parallel processing of text and image data.
* The use of "RMSNorm", "MHA-MMoE", and "FFN-MMoE" suggests a transformer-based architecture with advanced attention mechanisms and mixture of experts.
* The legend clearly distinguishes between text and image tokens, highlighting the multimodal nature of the model.
* The diagram does not provide specific numerical values or performance metrics.
### Interpretation
The diagram illustrates a modern approach to multimodal learning, where language models are augmented with visual processing capabilities. The architecture leverages transformer networks, known for their effectiveness in natural language processing, and extends them to handle image data. The "Visual Encoder" transforms images into a tokenized representation that can be integrated with text tokens, allowing the model to reason about both modalities simultaneously. The use of Mixture of Experts (MoE) in the attention and feed-forward layers suggests an attempt to increase model capacity and improve performance. The annotations highlight the importance of visual context and the need for detailed image descriptions. The overall design suggests a system capable of understanding and generating content based on both textual and visual information. The diagram is conceptual and does not provide quantitative data, but it effectively conveys the key components and data flow of a multimodal large language model.
</details>
Figure 8: Architecture of NaViL. As a native MoE-extended MLLM, NaViL can be trained end-to-end and supports input images of any resolution.
Based on above studies, we construct NaViL with the optimal settings in Sec. 3.1. The architecture is shown in Fig. 8. NaViL inherently supports input images of any resolution. These images are first encoded into visual tokens by the visual encoder and the MLP projector, and then concatenated with the textual tokens to formulate the multimodal token sequence and fed into the LLM. Special tokens <begin_of_image> and <end_of_image> are inserted before and after each image token subsequence to indicate the beginning and end of the image, respectively. Special token <end_of_line> is inserted at the end of each row of image tokens to indicate the corresponding spatial position information.
Visual Multi-scale Packing is further introduced to improve the model performance during inference. Specifically, given an input image $I_{0}\in\mathbb{R}^{H_{0}\times W_{0}\times 3}$ and downsampling rate $\tau$ , a multi-scale image sequence $\{I_{i}\in\mathbb{R}^{H_{i}\times W_{i}\times 3}\}_{i=0}^{n}$ is obtained by continuously downsampling the original image (i.e. $H_{i}=\tau^{i}H_{0},W_{i}=\tau^{i}W_{0}$ ) until its area is smaller than a given threshold. These images in the sequence are processed separately by the visual encoder. The obtained visual token embeddings $\{{x_{i,v}}\}_{i=0}^{n}$ are then concatenated and fed to the LLM. Special token <end_of_scale> is inserted after each scale image to indicate the end of different scales.
### 4.2 Training
Stage 1: Multi-modal Generative Pre-training. In this stage, the model is initially trained on 500 million image-text pairs to develop comprehensive multimodal representations. Of these training samples, 300 million are directly sampled from web-scale datasets (i.e. Laion-2B Datasets:Laion-5b , Coyo-700M kakaobrain2022coyo-700m , Wukong gu2022wukong and SA-1B TransF:SAM ) while the remaining 200 million consist of images from these datasets paired with captions synthesized by existing MLLMs (i.e. InternVL-8B VLM:InternVL ). During this process, the textual parameters of the model remain frozen, with only the newly-added vision-specific parameters (i.e., the visual encoder, MLP projector, and MoE visual experts) being trainable.
To enhance the alignment between visual and textual features in more complex multimodal contexts, the model is subsequently trained on 185 million high-quality data consisting of both multimodal alignment samples and pure language data. In this phase, the textual parameters within the self-attention layers are also unfrozen, enabling more refined cross-modal integration.
Stage 2: Supervised Fine-tuning. Following common practice in developing MLLM, an additional supervised fine-tuning stage is adopted. In this stage, all parameters are unfrozen and trained using a relatively smaller (i.e. 68 million) but higher quality multimodal dataset.
## 5 Experiment
### 5.1 Experimental Setups
Table 1: Comparison with existing MLLMs on general MLLM benchmarks. “#A-Param” denotes the number of activated parameters. † InternVL-2.5-2B adopts the same LLM and high-quality data with NaViL, so we mark it as the compositional counterpart. Note that its 300M visual encoder is distilled from another 6B large encoder. Bold and underline indicate the best and the second-best performance among native MLLMs, respectively. * denotes our reproduced results. For MME, we sum the perception and cognition scores. Average scores are computed by normalizing each metric to a range between 0 and 100.
| Model | #A-Param | Avg | MMVet | MMMU | MMB | MME | MathVista | OCRBench | CCB |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Compositional MLLMs: | | | | | | | | | |
| MobileVLM-V2-1.7B chu2024mobilevlm | 1.7B | $-$ | $-$ | $-$ | 57.7 | $-$ | $-$ | $-$ | $-$ |
| MobileVLM-V2-3B chu2024mobilevlm | 3.0B | $-$ | $-$ | $-$ | 63.2 | $-$ | $-$ | $-$ | $-$ |
| Mini-Gemini-2B VLM:MiniGemini | 3.5B | $-$ | 31.1 | 31.7 | 59.8 | 1653 | 29.4 | $-$ | $-$ |
| MM1-3B-MoE-Chat VLM:MM1 | 3.5B | $-$ | 42.2 | 38.6 | 70.8 | 1772 | 32.6 | $-$ | $-$ |
| DeepSeek-VL-1.3B lu2024deepseekvl | 2.0B | 42.3 | 34.8 | 32.2 | 64.6 | 1532 | 31.1 | 409 | 37.6 |
| PaliGemma-3B beyer2024paligemma | 2.9B | 45.6 | 33.1 | 34.9 | 71.0 | 1686 | 28.7 | 614 | 29.6 |
| MiniCPM-V-2 yao2024minicpm | 2.8B | 51.1 | 41.0 | 38.2 | 69.1 | 1809 | 38.7 | 605 | 45.3 |
| InternVL-1.5-2B VLM:InternVL-1.5 | 2.2B | 54.7 | 39.3 | 34.6 | 70.9 | 1902 | 41.1 | 654 | 63.5 |
| Qwen2VL-2B Qwen2vl | 2.1B | 58.6 | 49.5 | 41.1 | 74.9 | 1872 | 43.0 | 809 | 53.7 |
| † InternVL-2.5-2B chen2024expanding | 2.2B | 67.0 | 60.8 | 43.6 | 74.7 | 2138 | 51.3 | 804 | 81.7 |
| Native MLLMs: | | | | | | | | | |
| Fuyu-8B (HD) VLM:Fuyu-8b | 8B | $-$ | 21.4 | $-$ | 10.7 | $-$ | $-$ | $-$ | $-$ |
| SOLO solo | 7B | $-$ | $-$ | $-$ | $-$ | 1260 | 34.4 | $-$ | $-$ |
| Chameleon-7B The performance of Chameleon-7B is from mono_internvl . team2024chameleon | 7B | 13.9 | 8.3 | 25.4 | 31.1 | 170 | 22.3 | 7 | 3.5 |
| EVE-7B diao2024EVE | 7B | 33.0 | 25.6 | 32.3 | 49.5 | 1483 | 25.2 | 327 | 12.4 |
| EVE-7B (HD) diao2024EVE | 7B | 37.0 | 25.7 | 32.6 | 52.3 | 1628 | 34.2 | 398 | 16.3 |
| Emu3 emu3 | 8B | $-$ | 37.2 | 31.6 | 58.5 | $-$ | $-$ | 687 | $-$ |
| VoRA vora | 7B | $-$ | 33.7 | 32.2 | 64.2 | 1674 | $-$ | $-$ | $-$ |
| VoRA-AnyRes vora | 7B | $-$ | 33.7 | 32.0 | 61.3 | 1655 | $-$ | $-$ | $-$ |
| EVEv2 diao2025evev2 | 7B | 53.2 | 45.0 | 39.3 | 66.3 | 1709 | 60.0 * | 702 | 30.8* |
| SAIL lei2025sail | 7B | 53.7 | 46.3 | 38.6* | 70.1 | 1719 | 57.0 | 783 | 24.3* |
| Mono-InternVL mono_internvl | 1.8B | 56.4 | 40.1 | 33.7 | 65.5 | 1875 | 45.7 | 767 | 66.3 |
| NaViL-2B (ours) | 2.4B | 67.1 | 78.3 | 41.8 | 71.2 | 1822 | 50.0 | 796 | 83.9 |
Evaluation Benchmarks. We evaluate NaViL and existing MLLMs on a broad range of multimodal benchmarks. Specifically, MLLM benchmarks encompass MMVet Datasets:MM-vet , MMMU val Datasets:MMMU , MMBench-EN test Datasets:MMBench , MME Datasets:MME , MathVista MINI Datasets:Mathvista , OCRBench liu2023ocrbench , and CCBench Datasets:MMBench . Visual question answering benchmarks include TextVQA val Datasets:TextVQA , ScienceQA-IMG test Datasets:ScienceQA , GQA test dev Datasets:GQA , DocVQA test mathew2021docvqa , AI2D test Datasets:AI2D , ChartQA test Datasets:ChartQA , and InfographicVQA test mathew2022infographicvqa . These benchmarks cover various domains, such as optical character recognition (OCR), chart and document understanding, multi-image understanding, real-world comprehension, etc.
Implementation Details. By default, NaViL-2B is implemented upon InternLM2-1.8B 2023internlm , using its weights as initialization for the text part parameters. The text tokenizer and conversation format are also the same. The total number of parameters is 4.2B, of which the number of activation parameters is 2.4B (including 0.6B of visual encoder). The input images are first padded to ensure its length and width are multiples of 32. The stride of Patch Embedding layer is set to 16. The visual encoder adopts bidirectional attention and 2D-RoPE to capture global spatial relationships, while the LLM adopts causal attention and 1D-RoPE to better inherit its capabilities. In the pre-training phase, the global batch size is 7000 for stage 1 and 4614 for stage 2, respectively. The downsampling rate $\tau$ of visual multi-scale packing is set to $\sqrt{2}/{2}$ . To demonstrate the scaling capability of our approach, we also trained NaViL-9B based on Qwen3-8B TransF:Qwen3 . More details are given in the appendix.
### 5.2 Main Results
Table 2: Comparison with existing MLLMs on visual question answering benchmarks. † InternVL-2.5-2B adopts the same LLM and high-quality data with NaViL, so we mark it as the compositional counterpart. Note that its 300M visual encoder is distilled from another 6B large encoder. * denotes our reproduced results. Bold and underline indicate the best and the second-best performance among native MLLMs, respectively.
| Model Compositional MLLMs: MobileVLM-V2-3B chu2024mobilevlm | #A-Param 3.0B | Avg $-$ | TextVQA 57.5 | SQA-I 70.0 | GQA 66.1 | DocVQA $-$ | AI2D $-$ | ChartQA $-$ | InfoVQA $-$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Mini-Gemini-2B VLM:MiniGemini | 3.5B | $-$ | 56.2 | $-$ | $-$ | 34.2 | $-$ | $-$ | $-$ |
| MM1-3B-MoE-Chat VLM:MM1 | 3.5B | $-$ | 72.9 | 76.1 | $-$ | $-$ | $-$ | $-$ | $-$ |
| DeepSeek-VL-1.3B lu2024deepseekvl | 2.0B | $-$ | 57.8 | $-$ | $-$ | $-$ | 51.5 | $-$ | $-$ |
| PaliGemma-3B beyer2024paligemma | 2.9B | $-$ | 68.1 | $-$ | $-$ | $-$ | 68.3 | $-$ | $-$ |
| MiniCPM-V-2 yao2024minicpm | 2.8B | $-$ | 74.1 | $-$ | $-$ | 71.9 | 62.9 | $-$ | $-$ |
| InternVL-1.5-2B VLM:InternVL-1.5 | 2.2B | 71.7 | 70.5 | 84.9 | 61.6 | 85.0 | 69.8 | 74.8 | 55.4 |
| Qwen2VL-2B Qwen2vl | 2.1B | 73.1 | 79.7 | 78.2* | 60.3* | 90.1 | 74.7 | 73.5 | 65.5 |
| † InternVL-2.5-2B chen2024expanding | 2.2B | 76.5 | 74.3 | 96.2 | 61.2 | 88.7 | 74.9 | 79.2 | 60.9 |
| Native MLLMs: | | | | | | | | | |
| Fuyu-8B (HD) VLM:Fuyu-8b | 8B | $-$ | $-$ | $-$ | $-$ | $-$ | 64.5 | $-$ | $-$ |
| SOLO solo | 7B | $-$ | $-$ | 73.3 | $-$ | $-$ | 61.4 | $-$ | $-$ |
| Chameleon-7B The performance of Chameleon-7B is from mono_internvl . team2024chameleon | 7B | 17.9 | 4.8 | 47.2 | $-$ | 1.5 | 46.0 | 2.9 | 5.0 |
| EVE-7B diao2024EVE | 7B | 40.8 | 51.9 | 63.0 | 60.8 | 22.0 | 48.5 | 19.5 | 20.0 |
| EVE-7B (HD) diao2024EVE | 7B | 54.6 | 56.8 | 64.9 | 62.6 | 53.0 | 61.0 | 59.1 | 25.0 |
| Emu3 emu3 | 8B | 67.6 | 64.7 | 89.2 | 60.3 | 76.3 | 70.0 | 68.6 | 43.8 |
| VoRA vora | 7B | $-$ | 56.3 | 75.9 | $-$ | $-$ | 65.6 | $-$ | $-$ |
| VoRA-AnyRes vora | 7B | $-$ | 58.7 | 72.0 | $-$ | $-$ | 61.1 | $-$ | $-$ |
| EVEv2 diao2025evev2 | 7B | 71.7 | 71.1 | 96.2 | 62.9 | 77.4* | 74.8 | 73.9 | 45.8* |
| SAIL lei2025sail | 7B | 71.5 | 77.1 | 93.3 | 58.0* | 78.4* | 76.7 | 69.7* | 47.3 * |
| Mono-InternVL mono_internvl | 1.8B | 70.1 | 72.6 | 93.6 | 59.5 | 80.0 | 68.6 | 73.7 | 43.0 |
| NaViL-2B (ours) | 2.4B | 75.1 | 76.9 | 95.0 | 59.8 | 85.4 | 74.6 | 78.0 | 56.0 |
In Tab. 1, we compare the performance of our model with existing MLLMs across 7 multimodal benchmarks. Compared to native MLLMs, compositional MLLMs demonstrate superior overall performance. For example, InternVL-2.5-2B outperforms existing native MLLMs on most MLLM benchmarks. This indicates that current native MLLMs still have significant room for performance improvement. In contrast, our proposed NaViL achieves overall performance exceeding all existing native MLLMs with a relatively small paramter size. Compared to the compositional baseline model InternVL-2.5-2B that uses the same LLM, NaViL also achieves comparable performance on most benchmarks. It is worth noting that the 300M visual encoder used by InternVL-2.5-2B is distilled from another pre-trained encoder InternViT-6B VLM:InternVL with a significantly larger parameter size. This demonstrates the superiority of our visual design methods and visual parameter scaling strategies.
In Tab. 2, we further compare the performance of our model with existing MLLMs on mainstream visual question answering tasks. NaViL’s average performance still leads previous state-of-the-art native MLLMs and is roughly on par with compositional baselines that require pre-trained encoders. Specifically, in tests such as DocVQA Datasets:OCRVQA , ChartQA Datasets:ChartQA and InfoVQA mathew2022infographicvqa , NaViL significantly outperforms the previous state-of-the-art native MLLM, demonstrating the superiority of using an optimal size visual encoder in processing high-resolution images. However, NaViL’s performance still has some gap compared to the best compositional MLLMs. We believe that higher-quality instruction data and more powerful LLMs will further narrow this gap.
### 5.3 Qualitative Experiments
To further analyze the characteristics of native MLLM, we visualized the attention maps of different LLM layers when using encoders of 150M and 1.2B sizes, as shown in Fig. 9. Two findings can be drawn from the figure. First, similar to previous native-MLLMs mono_internvl , despite having an encoder, the attention patterns in shallow layers still exhibit obvious locality, gradually shifting toward global information as the depth increases. For example, when using a 150M encoder, image tokens in the first layer tend to attend to spatially adjacent tokens. However, we observe that when the visual encoder is scaled up to 1.2B, visual tokens in shallow layers already begin to attend more to global information. This indicates that a sufficiently large visual encoder can better pre-extract high-level semantic information from the entire image.
Secondly, from a cross-modal interaction perspective, a larger visual encoder also facilitates earlier interaction between visual and language features. When using a 1.2B visual encoder, the attention weights between visual tokens and text tokens in the first layer are significantly higher than those in the 150M counterpart. Earlier interaction is more beneficial for feature alignment between modalities, thus providing an explanatory perspective for the improved performance achieved when using larger encoder sizes. We believe these findings will provide beneficial insights for developing native MLLMs. More visualizations can be found in the supplementary materials.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Diagram: Traffic Light and Layered Visualizations
### Overview
The image presents a visual comparison of a photograph of a traffic light with four layered visualizations, likely representing some form of data processing or transformation applied to the image. The visualizations appear to show a pattern of diagonal lines with varying degrees of intensity or detail across different layers. Numerical values are provided in the top-left corner.
### Components/Axes
* **Photograph:** A color photograph of a traffic light with red lights illuminated against a blue sky with clouds.
* **Layer Labels:** "Layer 1", "Layer 5", "Layer 14", "Layer 23" are positioned horizontally above each visualization.
* **Visualization Panels:** Four panels, arranged in a 2x2 grid, each displaying a visualization. Each panel contains a diagonal pattern.
* **Numerical Values:** "150M" and "1.2B" are displayed in the top-left corner, enclosed in a dashed box.
* **Caption Request:** A text box prompts for a one-sentence caption.
* **Example Caption:** "A traffic light showing red lights against a blue sky with clouds."
### Detailed Analysis or Content Details
The visualizations show a repeating diagonal pattern. The intensity of the pattern changes across the layers.
* **Layer 1:** The visualization is very faint, with the diagonal lines barely visible. The background color is a very pale yellow.
* **Layer 5:** The diagonal lines are more pronounced than in Layer 1, but still relatively faint. The background color is a pale yellow.
* **Layer 14:** The diagonal lines are significantly more visible and have a reddish hue. The background color is a pale yellow.
* **Layer 23:** The diagonal lines are the most prominent and have a strong reddish hue. The background color is a pale yellow.
The diagonal lines in all layers appear to run from the top-left to the bottom-right of each panel. The pattern seems consistent in direction but varies in intensity.
### Key Observations
* The intensity of the diagonal pattern increases with layer number.
* The color of the pattern shifts from faint to reddish as the layer number increases.
* The numerical values "150M" and "1.2B" are present but their relationship to the visualizations is unclear without further context.
### Interpretation
The image likely demonstrates a process where data is progressively transformed or filtered through multiple layers. The traffic light photograph serves as the initial input, and the subsequent layers represent stages of processing. The increasing intensity and color change of the diagonal pattern suggest that certain features or signals are being amplified or highlighted as the data moves through the layers.
The numerical values "150M" and "1.2B" could represent data sizes, processing times, or other relevant metrics associated with the layers. Without additional information, it's difficult to determine their precise meaning. The prompt for a caption suggests the image is intended to be self-explanatory, but the layered visualizations require further context to fully understand their purpose and significance. The diagonal pattern could represent edge detection, feature extraction, or some other form of image processing. The increasing intensity might indicate the accumulation of these features across layers.
</details>
Figure 9: Visualization of attention maps in LLM-1.8B with different encoder sizes (i.e. 150M and 1.2B). Text and image tokens are in blue and green, respectively. Larger encoder allows LLMs to attend to global patterns at shallow layers while maintaining higher attention to textual tokens.
## 6 Conclusion
This paper systematically investigates native end-to-end training for MLLMs, examining its design space and scaling properties under data constraints. Our study reveals three key insights: 1) Initialization with pre-trained LLMs, combined with visual encoders and MoE architecture, significantly improves performance; 2) Visual encoder scaling is limited by the LLM’s capacity, unlike traditional LLM scaling; 3) The optimal encoder size scales log-proportionally with the LLM size. Based on these findings, we propose NaViL, a native MLLM that achieves competitive performance on diverse multimodal benchmarks, outperforming existing compositional MLLMs. We hope these insights will inspire future research on next-generation MLLMs.
Limitations and Broader Impacts. Due to limited computation resources, this paper only investigates the scaling properties of native MLLMs up to 9B parameters. Subsequent experiments with larger scales (e.g., 30 billion, 70 billion, 100 billion, etc.) can be conducted to further validate this scaling trend. In addition, this paper focuses only on visual and linguistic modalities. Future research may explore broader modalities and provide more in-depth insights beyond the current visual-linguistic paradigm.
Acknowledgments
The work is supported by the National Key R&D Program of China (NO. 2022ZD0161300, and NO. 2022ZD0160102), by the National Natural Science Foundation of China (U24A20325, 62321005, 62376134), and by the China Postdoctoral Science Foundation (No. BX20250384).
## References
- [1] Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
- [2] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In ICCV, pages 8948–8957, 2019.
- [3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- [4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- [5] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023.
- [6] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024.
- [7] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- [8] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- [9] ChameleonTeam. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
- [10] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- [11] Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling. arXiv preprint arXiv:2407.06438, 2024.
- [12] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
- [13] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
- [14] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv:2404.16821, 2024.
- [15] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv: 2312.14238, 2023.
- [16] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
- [17] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In ACL, pages 845–855, 2018.
- [18] Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- [19] Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024.
- [20] Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. arXiv preprint arXiv:2502.06788, 2025.
- [21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- [22] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv: 2306.13394, 2023.
- [23] Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021.
- [24] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- [25] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35:26418–26431, 2022.
- [26] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- [27] Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- [28] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. Zenodo. Version 0.1. https://doi.org/10.5281/zenodo.5143773, 2021. DOI: 10.5281/zenodo.5143773.
- [29] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- [30] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016.
- [31] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. arXiv: 2304.02643, 2023.
- [32] Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. arXiv preprint arXiv:2504.10462, 2025.
- [33] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022.
- [34] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv: 2403.18814, 2024.
- [35] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv: 2310.03744, 2023.
- [36] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
- [37] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv: 2307.06281, 2023.
- [38] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
- [39] Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
- [40] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
- [41] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv: 2310.02255, 2023.
- [42] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- [43] Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In CVPR, 2025.
- [44] Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024.
- [45] Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022.
- [46] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022.
- [47] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021.
- [48] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, and Yinfei Yang. MM1: methods, analysis & insights from multimodal LLM pre-training. arXiv: 2403.09611, 2024.
- [49] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- [50] OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
- [51] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023.
- [52] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- [53] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763, 2021.
- [54] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- [55] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- [56] Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, and Alaaeldin El-Nouby. Scaling laws for native multimodal models. arXiv preprint arXiv:2504.07951, 2025.
- [57] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In CVPR, 2019.
- [58] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- [59] InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- [60] Qwen Team. Qwen3 blog. https://qwenlm.github.io/blog/qwen3/, 2025.
- [61] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [62] Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora. arXiv preprint arXiv:2503.20680, 2025.
- [63] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
- [64] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, pages 14408–14419, 2023.
- [65] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need. arXiv: 2409.18869, 2024.
- [66] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- [67] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
- [68] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- [69] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv: 2308.02490, 2023.
- [70] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv: 2311.16502, 2023.
- [71] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In CVPR, pages 12104–12113, 2022.
- [72] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023.
## Technical Appendices and Supplementary Material
## Appendix A NaViL-9B: Scaling up to 9B parameters
To further demonstrate the scaling capability of our method, we trained NaViL-9B based on Qwen3-8B TransF:Qwen3 . The total number of activation parameters is 9.2B, of which 1.2B belongs to the visual encoder. The training recipe is similar to NaViL-2B, as shown in Tab. 8, except the visual multi-scaling packing is disabled in the first sub-stage of pre-training for acceleration.
Tab. 3 presents a comparison of the total training tokens required by our method versus two compositional counterparts. Notably, our approach achieves comparable performance while using substantially fewer training tokens, demonstrating improved training efficiency.
Table 3: Comparison between NaViL and existing MLLMs on the number of training tokens.
| Qwen2.5VL bai2025qwen2 InternVL2.5-8B InternVL-2.5 NaViL-2B (ours) | unknown >3.3T 0 | 4.1T 140B 800B | >4.1T >3.5T 800B |
| --- | --- | --- | --- |
| NaViL-9B (ours) | 0 | 450B Due to limited computational resource and time, current version of NaViL-9B in this paper is only trained with 450B tokens. | 450B |
The performance results on multimodal and visual question answering benchmarks are shown in Tab. 4. With a similar parameter size, our NaViL-9B outperforms all existing native MLLMs by a large margin on almost all benchmarks. Besides that, compared to the compositional baseline model InternVL-2.5-8B with a similar parameter size, NaViL-9B also achieves competitive performance. Such results show that our proposed native MLLM can be scaled up to larger parameter sizes and achieve consistent performance gains.
## Appendix B More discussions on Compositional MLLMs and Native MLLMs
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Diagram: Multimodal Large Language Model Architectures
### Overview
The image presents a diagram illustrating two different architectures for Multimodal Large Language Models (MLLMs): compositional MLLMs without Mixture of Experts (MoE), and native MLLMs with MoE. The diagram uses boxes and arrows to represent components and data flow.
### Components/Axes
The diagram consists of the following components:
* **Visual Encoder:** Represented by a yellow box.
* **Visual/Text Encoder:** Represented by a blue box.
* **Multimodal Large Language Models:** Represented by a light blue box.
* **Visual Expert:** Represented by a green box.
* **Arrows:** Represent data flow and transformations.
* **Labels:** "Contrastive Loss", "Copy Weights", "Randomly Initialized", "Next Token Prediction".
* **Sub-captions:** "(a) Compositional MLLMs without MoE", "(b) Native MLLMs with MoE".
* **Dashed Lines:** Indicate optional or secondary pathways.
### Detailed Analysis or Content Details
**Diagram (a): Compositional MLLMs without MoE**
1. A "Visual Encoder" (yellow) is connected via a dashed red line labeled "Contrastive Loss" to a "Visual/Text Encoder" (blue).
2. An arrow labeled "Copy Weights" points downwards from the "Visual/Text Encoder" to another "Visual Encoder" (yellow).
3. The second "Visual Encoder" (yellow) is connected to "Multimodal Large Language Models" (light blue) via a solid arrow.
4. The "Multimodal Large Language Models" (light blue) has a dashed red line pointing downwards labeled "Next Token Prediction".
**Diagram (b): Native MLLMs with MoE**
1. A "Visual Encoder" (yellow) is labeled "Randomly Initialized".
2. The "Randomly Initialized Visual Encoder" is connected to a "Visual Expert" (green) via a solid arrow.
3. The "Visual Expert" (green) is connected to "Multimodal Large Language Models" (light blue) via a solid arrow.
4. The "Multimodal Large Language Models" (light blue) has a dashed red line pointing downwards labeled "Next Token Prediction".
### Key Observations
* Diagram (a) shows a process of initializing a visual encoder through contrastive loss and copying weights, while diagram (b) shows a randomly initialized visual encoder directly feeding into a visual expert.
* Both diagrams share a common output: "Next Token Prediction" from the "Multimodal Large Language Models".
* The use of dashed lines suggests that "Contrastive Loss" and "Next Token Prediction" are not necessarily core components but rather auxiliary processes.
* The "Visual Expert" component is unique to the MoE architecture.
### Interpretation
The diagram illustrates two distinct approaches to building multimodal large language models. The first (a) represents a compositional approach where a visual encoder is pre-trained using contrastive loss and its weights are copied to another visual encoder before being integrated into the larger language model. This suggests a transfer learning strategy. The second (b) represents a native approach where a visual encoder is randomly initialized and then refined by a "Visual Expert" component before being integrated into the language model. This suggests a more end-to-end learning strategy.
The "Next Token Prediction" output in both diagrams highlights the core function of these models: generating text based on multimodal input. The difference lies in how the visual information is initially processed and integrated into the language model. The MoE architecture (b) potentially allows for more specialized processing of visual information through the "Visual Expert", while the compositional approach (a) relies on pre-trained weights and transfer learning. The diagram suggests that the MoE approach may be more flexible and adaptable, but potentially requires more training data.
</details>
Figure 10: Paradigm Comparison between Compositional MLLMs and Native MLLMs. Compositional MLLMs adopt different training objectives and strategies (e.g. Contrastive Loss or Next-Token-Prediction) to pre-train the visual encoder and LLM separately, while native MLLMs optimize both image and text components in an end-to-end manner using a unified training objective (i.e. Next-Token-Prediction).
Fig. 10 further illustrates the difference between compositional MLLMs and native MLLMs. Compositional MLLMs typically have different components initialized by separate unimodal pre-training, where different training objectives and strategies are employed to train the LLM and visual encoder. For example, the visual encoder can be trained using an image-text contrastive learning objective (e.g., CLIP radford2021clip , SigLIP zhai2023siglip ) or a self-supervised learning objective (e.g., DINOv2 oquab2023dinov2 ). The complexity of such training process increases the difficulty of scalability. On the other hand, as discussed in shukor2025scaling , native MLLM optimizes both image and text modalities end-to-end using a unified training objective (i.e., next-token prediction (NTP)). This avoids introducing additional bias and significantly simplifies the scaling effort.
## Appendix C More Related Works
Research on Neural Scaling Laws.
The foundational work on Neural Scaling Laws began in the Natural Language Processing (NLP) domain, where openai2020scaling established predictable power-law relationships demonstrating that performance loss ( $L$ ) scales reliably with model size ( $N$ ) and data size ( $D$ ), and that larger, decoder-only Transformer models are more compute-efficient. Following works ghorbani2021scaling further extended such research to encoder-decoder architectures, observing consistency in scaling exponents on Neural Machine Translation (NMT) tasks. Driven by these successes, in the vision domain, zhai2022scaling confirmed the applicability of scaling laws to Vision Transformers (ViT), systematically demonstrating continuous performance improvement by scaling both model size (up to 2 billion parameters) and training data. Most recently, these principles have been generalized to Large Multimodal Models, where aghajanyan2023scaling developed scaling laws that unify the contributions of text, image, and speech modalities by explicitly modeling synergy and competition as an additive term. Furthering this, shukor2025scaling explored Native Multimodal Models (NMMs) using Mixture of Experts (MoEs), finding an unbalanced scaling law that suggests scaling training tokens ( $D$ ) is more critical than scaling active parameters ( $N$ ) as the compute budget grows.
## Appendix D Implementation Details
The hyperparameters of model architecture for NaViL-2B and NaViL-9B are listed in Tab. 6, while the hyperparameters of training recipe for NaViL-2B and NaViL-9B are provided in Tab. 7 and Tab. 8, respectively. The high-quality multimodal data used in Pre-training and Supervised Fine-tuning is from InternVL-2.5 InternVL-2.5 , which is sourced from various domains, such as image captioning, general question answering, multi-turn dialogue, charts, OCR, documents, and knowledge, etc.; while the pure language data is primarily from InternLM2.5 cai2024internlm2 .
Table 4: Comparison between NaViL-9B and existing MLLMs on multimodal benchmarks. “#A-Param” denotes the number of activated parameters. † InternVL-2.5-8B adopts the same high-quality data with NaViL-9B, so we mark it as the compositional counterpart. Note that its 300M visual encoder is distilled from another 6B large encoder. * denotes our reproduced results. Bold and underline indicate the best and the second-best performance among native MLLMs, respectively. For MME, we sum the perception and cognition scores. Average scores are computed by normalizing each metric to a range between 0 and 100.
| Compositional MLLMs: | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| MobileVLM-V2 chu2024mobilevlm | 1.7B | $-$ | $-$ | $-$ | 57.7 | $-$ | $-$ | $-$ | $-$ | $-$ | $-$ | $-$ | $-$ |
| MobileVLM-V2 chu2024mobilevlm | 3.0B | $-$ | $-$ | $-$ | 63.2 | $-$ | $-$ | $-$ | 57.5 | $-$ | $-$ | $-$ | $-$ |
| Mini-Gemini VLM:MiniGemini | 3.5B | $-$ | 31.1 | 31.7 | 59.8 | 1653 | 29.4 | $-$ | 56.2 | 34.2 | $-$ | $-$ | $-$ |
| MM1-MoE-Chat VLM:MM1 | 3.5B | $-$ | 42.2 | 38.6 | 70.8 | 1772 | 32.6 | $-$ | 72.9 | $-$ | $-$ | $-$ | $-$ |
| DeepSeek-VL lu2024deepseekvl | 2.0B | $-$ | 34.8 | 32.2 | 64.6 | 1532 | 31.1 | 409 | 57.8 | $-$ | 51.5 | $-$ | $-$ |
| PaliGemma beyer2024paligemma | 2.9B | $-$ | 33.1 | 34.9 | 71.0 | 1686 | 28.7 | 614 | 68.1 | $-$ | 68.3 | $-$ | $-$ |
| MiniCPM-V-2 yao2024minicpm | 2.8B | $-$ | 41.0 | 38.2 | 69.1 | 1809 | 38.7 | 605 | 74.1 | 71.9 | 62.9 | $-$ | $-$ |
| InternVL-1.5 VLM:InternVL-1.5 | 2.2B | 61.3 | 39.3 | 34.6 | 70.9 | 1902 | 41.1 | 654 | 70.5 | 85.0 | 69.8 | 74.8 | 55.4 |
| Qwen2VL Qwen2vl | 2.1B | 67.3 | 49.5 | 41.1 | 74.9 | 1872 | 43.0 | 809 | 79.7 | 90.1 | 74.7 | 73.5 | 65.5 |
| InternVL-2.5 chen2024expanding | 2.2B | 69.6 | 60.8 | 43.6 | 74.7 | 2138 | 51.3 | 804 | 74.3 | 88.7 | 74.9 | 79.2 | 60.9 |
| Qwen2VL Qwen2vl | 8.2B | 77.1 | 62.0 | 54.1 | 83.0 | 2327 | 58.2 | 866 | 84.3 | 94.5 | 83.0 | 83.0 | 76.5 |
| Qwen2.5-VL bai2025qwen2 | 8.2B | 80.2 | 67.1 | 58.6 | 83.5 | 2347 | 68.2 | 864 | 84.9 | 95.7 | 83.9 | 87.3 | 82.6 |
| † InternVL-2.5 chen2024expanding | 8.1B | 77.3 | 62.8 | 56.0 | 84.6 | 2344 | 64.4 | 822 | 79.1 | 91.9 | 84.5 | 84.8 | 75.7 |
| Native MLLMs: | | | | | | | | | | | | | |
| Fuyu-8B (HD) VLM:Fuyu-8b | 8B | $-$ | 21.4 | $-$ | 10.7 | $-$ | $-$ | $-$ | $-$ | $-$ | 64.5 | $-$ | $-$ |
| SOLO solo | 7B | $-$ | $-$ | $-$ | $-$ | 1260 | 34.4 | $-$ | $-$ | $-$ | 61.4 | $-$ | $-$ |
| Chameleon-7B The performance of Chameleon-7B is from mono_internvl . team2024chameleon | 7B | 14.0 | 8.3 | 25.4 | 31.1 | 170 | 22.3 | 7 | 4.8 | 1.5 | 46.0 | 2.9 | 5.0 |
| EVE-7B diao2024EVE | 7B | 34.6 | 25.6 | 32.3 | 49.5 | 1483 | 25.2 | 327 | 51.9 | 22.0 | 48.5 | 19.5 | 20.0 |
| EVE-7B (HD) diao2024EVE | 7B | 45.2 | 25.7 | 32.6 | 52.3 | 1628 | 34.2 | 398 | 56.8 | 53.0 | 61.0 | 59.1 | 25.0 |
| Emu3 emu3 | 8B | $-$ | 37.2 | 31.6 | 58.5 | $-$ | $-$ | 687 | 64.7 | 76.3 | 70.0 | 68.6 | 43.8 |
| VoRA vora | 7B | $-$ | 33.7 | 32.2 | 64.2 | 1674 | $-$ | $-$ | 56.3 | $-$ | 65.6 | $-$ | $-$ |
| VoRA-AnyRes vora | 7B | $-$ | 33.7 | 32.0 | 61.3 | 1655 | $-$ | $-$ | 58.7 | $-$ | 61.1 | $-$ | $-$ |
| EVEv2 diao2025evev2 | 7B | 62.3 | 45.0 | 39.3 | 66.3 | 1709 | 60.0* | 702 | 71.1 | 77.4* | 74.8 | 73.9 | 45.8* |
| SAIL lei2025sail | 7B | 63.7 | 46.3 | 38.6* | 70.1 | 1719 | 57.0 | 783 | 77.1 | 78.4* | 76.7 | 69.7* | 47.3* |
| Mono-InternVL mono_internvl | 1.8B | 60.6 | 40.1 | 33.7 | 65.5 | 1875 | 45.7 | 767 | 72.6 | 80.0 | 68.6 | 73.7 | 43.0 |
| NaViL-2B (ours) | 2.4B | 68.8 | 78.3 | 41.8 | 71.2 | 1822 | 50.0 | 796 | 76.9 | 85.4 | 74.6 | 78.0 | 56.0 |
| NaViL-9B (ours) | 9.2B | 77.0 | 79.6 | 54.7 | 76.5 | 2225 | 66.7 | 837 | 77.2 | 90.6 | 82.4 | 85.4 | 70.2 |
Table 5: Comparison of NaViL and existing native MLLMs on three common NLP tasks. Except for Chameleon, models are evaluated using OpenCompass toolkit opencompass2023 .
| InternLM2-Chat 2023internlm Qwen3-8B (non-thinking) TransF:Qwen3 EVE diao2024EVE | 1.8B 8B 7B | 47.1 76.5 43.9 | 46.1 76.8 33.4 | 13.9 71.1 0.7 |
| --- | --- | --- | --- | --- |
| Chameleon team2024chameleon | 7B | 52.1 | - | 11.5 |
| Mono-InternVL mono_internvl | 2B | 45.1 | 44.0 | 12.3 |
| NaViL-9B (ours) | 9.2B | 74.9 | 75.1 | 66.2 |
Table 6: Hyper-Parameters of Model Architecture.
| Component-visual encoder | Hyper-Parameter # Params depth | NaViL-2B 0.6B 24 | NaViL-9B 1.2B 32 |
| --- | --- | --- | --- |
| width | 1472 | 1792 | |
| MLP width | 5888 | 7168 | |
| # attention heads | 23 | 28 | |
| LLM (w/ MoE) | # experts | 2 | 2 |
| # A-Params | 1.8B | 8.0B | |
| depth | 24 | 36 | |
| width | 2048 | 4096 | |
| MLP width | 8192 | 12288 | |
| # attention heads | 16 | 32 | |
Table 7: Hyper-parameters for training NaViL-2B.
| Configuration | Multi-modal Generative Pre-training (S1) | Supervised | |
| --- | --- | --- | --- |
| S1.1 | S1.2 | Fine-tuning (S2) | |
| Maximum number of image patches | $4096$ | $12188$ | $24576$ |
| Training steps | $70$ k | $40$ k | $30$ k |
| Global batch size | $7,000$ | $4,614$ | $2,234$ |
| Weight decay | $0.05$ | $0.1$ | $0.01$ |
| Learning rate schedule | constant with warm-up | cosine decay | |
| Peak learning rate | $5e^{-5}$ | $2e^{-5}$ | |
| Visual Multi-scale Packing | ✓ | | |
| LLM max sequence length | $16,384$ | | |
| Warm-up steps | $200$ | | |
| Optimizer | AdamW | | |
| Optimizer hyperparameters | $\beta_{1}=0.9,\beta_{2}=0.95,eps=1e^{-8}$ | | |
| Gradient accumulation | $1$ | | |
| Numerical precision | $\mathtt{bfloat16}$ | | |
Table 8: Hyper-parameters for training NaViL-9B.
| Configuration | Multi-modal Generative Pre-training (S1) | Supervised | |
| --- | --- | --- | --- |
| S1.1 | S1.2 | Fine-tuning (S2) | |
| Maximum number of image patches | $4096$ | $12188$ | $24576$ |
| Training steps | $50$ k | $33$ k | $6$ k |
| Weight decay | $0.05$ | $0.1$ | $0.01$ |
| Global batch size | $10,300$ | $1,792$ | $3,520$ |
| Visual Multi-scale Packing | ✗ | ✓ | ✓ |
| Learning rate schedule | constant with warm-up | cosine decay | |
| Peak learning rate | $5e^{-5}$ | $2e^{-5}$ | |
| LLM max sequence length | $16,384$ | | |
| Warm-up steps | $200$ | | |
| Optimizer | AdamW | | |
| Optimizer hyperparameters | $\beta_{1}=0.9,\beta_{2}=0.95,eps=1e^{-8}$ | | |
| Gradient accumulation | $1$ | | |
| Numerical precision | $\mathtt{bfloat16}$ | | |
## Appendix E The NLP capability
We also evaluate the NLP capability of our model on three popular NLP tasks, as shown in Tab. 5. Thanks to the modality-specific MoE architecture, NaViL maintains the NLP capabilities of its initialization LLM (Qwen3-8B). Despite not using a large amount of high-quality text data, NaViL performs well on the common NLP tasks and show much stronger NLP capabilities compared to other native MLLMs, showing its data efficiency.
## Appendix F More Qualitative Results
More visualization results of multimodal understanding are provided below.
Image Captioning and Visual Question Answering
<details>
<summary>figures/vis/vis-understanding-conventional-center.jpg Details</summary>

### Visual Description
\n
## Photograph: San Diego Convention Center Aerial View
### Overview
This is an aerial photograph of the San Diego Convention Center, situated along the waterfront. The image showcases the building's distinctive architectural design, its surrounding landscape, and the harbor area. There is no data or chart present in the image. It is a visual representation of a location.
### Components/Axes
There are no axes or components in the traditional sense of a chart or diagram. The key elements are:
* **San Diego Convention Center:** The primary subject, a large, modern building with a unique roofline.
* **Waterfront/Harbor:** The body of water adjacent to the convention center, with boats and docks visible.
* **Cityscape:** The distant city skyline across the bay.
* **Landscaping:** Palm trees and green spaces surrounding the building.
* **Roads and Railway:** Roads and a railway line running alongside the convention center.
* **Sculpture:** A large, colorful sculpture is visible in the foreground, left of the image.
### Detailed Analysis or Content Details
The convention center is a long, rectangular structure with a series of arched roof sections. The building appears to be constructed primarily of concrete and glass. The waterfront is filled with various types of boats, including sailboats and larger vessels. The cityscape in the background features a mix of high-rise buildings. The landscaping includes numerous palm trees, creating a tropical aesthetic. Roads are visible with light traffic. A railway line runs parallel to the convention center. The sculpture in the foreground is abstract and colorful.
### Key Observations
The photograph highlights the scale of the convention center and its prominent location on the waterfront. The architectural design is striking and modern. The surrounding landscape is well-maintained and adds to the overall aesthetic appeal. The presence of the railway line suggests accessibility to the center via public transportation.
### Interpretation
The image serves as a visual representation of the San Diego Convention Center as a significant landmark and event venue. The photograph emphasizes the center's integration with the surrounding environment, including the waterfront and cityscape. The overall impression is one of modernity, accessibility, and a vibrant urban setting. The image does not contain any quantifiable data or trends, but rather provides a contextual understanding of the location and its features. The photograph is likely intended to showcase the convention center's appeal as a destination for events and conferences. The inclusion of the sculpture suggests a commitment to public art and cultural enrichment.
</details>
Prompt: Describe this image in detail. Response: The image shows an aerial view of a modern architectural complex situated near a coastal area. The building has a distinctive design with a series of curved, wave-like structures on its roof, giving it a futuristic appearance. The exterior is primarily composed of light-colored materials, possibly concrete or stone, with large glass windows. In front of the building, there is a wide, multi-lane road lined with palm trees, suggesting a warm, coastal climate. The road appears to be a major thoroughfare, with designated bike lanes and pedestrian pathways. There are also train tracks running parallel to the road, indicating the presence of a public transportation system. The background features a large body of water, likely a bay or ocean, with several boats and yachts docked at a marina. The coastline is lined with more buildings, and there are hills or mountains visible in the distance, adding to the scenic view. The overall setting suggests a bustling urban area with a focus on modern infrastructure and a strong connection to the waterfront. Prompt: Which city is this building most likely in? Response: This building is most likely in San Diego, California. The architectural style and the surrounding landscape, including the water and palm trees, are characteristic of San Diego.
<details>
<summary>figures/vis/vis-understanding-worldcup.png Details</summary>

### Visual Description
\n
## Photograph: Argentina World Cup Victory Celebration
### Overview
The image depicts the Argentinian national football team celebrating their victory at the 2022 FIFA World Cup. The team is gathered closely together, with Lionel Messi prominently holding the World Cup trophy aloft. Fireworks are visible in the background, adding to the celebratory atmosphere. The players are wearing their team jerseys and are adorned with gold medals.
### Components/Axes
This is a photograph and does not contain axes, charts, or legends in the traditional sense. However, key visual components include:
* **Players:** Approximately 20-25 players are visible, all wearing the Argentinian national team kit.
* **Lionel Messi:** The central figure, holding the World Cup trophy.
* **World Cup Trophy:** A golden, cup-shaped object held by Messi.
* **Medals:** Gold medals worn around the necks of the players.
* **Jerseys:** Light blue and white striped jerseys with the Argentinian flag and Adidas logo.
* **Fireworks:** Bright white sparks and trails in the background.
* **Numbers on Jerseys:** Visible numbers include 11, 7, 23, 1, and 9.
### Detailed Analysis or Content Details
The photograph captures a moment of intense joy and triumph. The players are exhibiting a range of emotional expressions, including shouting, smiling, and raising their arms in celebration. The composition is centered around Messi and the trophy, emphasizing their significance. The lighting is bright, highlighting the players and the trophy while creating a dramatic effect with the fireworks.
* **Player Uniforms:** The jerseys are predominantly light blue and white vertical stripes. The Adidas logo is visible on the upper right chest. The Argentinian flag is displayed on the upper left chest.
* **Messi's Attire:** Messi is wearing a black bisht (traditional Arabic cloak) over his jersey.
* **Trophy Details:** The trophy appears to be gold-colored and has a distinctive shape with figures holding it aloft.
* **Medal Details:** The medals are gold with a red ribbon.
* **Jersey Numbers:** The visible jersey numbers are: 11, 7, 23, 1, 9, and others partially obscured.
### Key Observations
* The central focus is clearly on Lionel Messi and the World Cup trophy.
* The players' expressions convey a strong sense of collective joy and accomplishment.
* The fireworks add a dynamic and celebratory element to the scene.
* The presence of the bisht on Messi is notable, potentially signifying a gesture of respect from the host nation, Qatar.
### Interpretation
The photograph encapsulates a historic moment for Argentinian football and its fans. It represents the culmination of years of effort and dedication, resulting in a World Cup victory. The image is a powerful symbol of national pride, teamwork, and individual achievement. The inclusion of the bisht on Messi suggests a cultural exchange and a gesture of hospitality from Qatar, the host country. The overall composition and emotional intensity of the photograph contribute to its impact as a visual record of a significant sporting event. The image is not presenting data, but rather a moment in time. It is a documentation of an event, and its interpretation relies on understanding the context of the 2022 FIFA World Cup and the significance of the victory for Argentina.
</details>
Prompt: Who is the person in black in the middle? Response: The person in black in the middle is Lionel Messi. Prompt: What are they celebrating about? Response: The image shows a group of soccer players celebrating with a trophy, wearing medals and jerseys. They are likely celebrating a victory in a soccer tournament or championship.
<details>
<summary>figures/vis/vis-understanding-gta5.png Details</summary>

### Visual Description
\n
## Photograph: Street Scene - Grand Theft Auto V
### Overview
The image depicts a street scene from the video game Grand Theft Auto V. The perspective is from a third-person view, focusing on a character standing on a sidewalk. The scene includes buildings, a street, vehicles, and pedestrian infrastructure. There is no factual data or charts present in the image; it is a visual representation of a game environment.
### Components/Axes
The scene contains the following visible elements:
* **Buildings:** Several buildings line the street, including a store labeled "WINE" and a butcher shop labeled "ON THE BONE MEAT MARKET - GROCERIES".
* **Street:** A two-lane street runs through the center of the image.
* **Vehicles:** A black car is parked in front of the butcher shop. Other vehicles are visible further down the street.
* **Sidewalk:** A concrete sidewalk runs along the front of the buildings.
* **Pedestrian Infrastructure:** Includes a utility pole, a fire hydrant, and a manhole cover.
* **Character:** A male character is standing on the sidewalk, facing away from the camera.
* **Game UI:** A minimal game user interface is visible in the bottom-left corner, displaying a radar and health/armor indicators.
### Detailed Analysis or Content Details
The following text is visible on the buildings:
* **"WINE"** - Sign above a storefront.
* **"ON THE BONE MEAT MARKET - GROCERIES"** - Sign above a butcher shop. The sign also includes the text "OPEN 7 DAYS A WEEK" and a phone number, which appears to be "555-555-5555".
* A poster in the window of the butcher shop displays the text "ON THE BONE" in large letters, along with an image of a cow.
* A smaller sign on the butcher shop reads "Meat Market".
The character is wearing a blue long-sleeved shirt and dark pants. The street appears to be relatively empty, with minimal pedestrian traffic. The lighting suggests it is daytime.
### Key Observations
The scene appears to be set in a somewhat rundown urban area. The presence of the "WINE" store and the butcher shop suggests a residential or mixed-use neighborhood. The overall aesthetic is consistent with the visual style of Grand Theft Auto V.
### Interpretation
The image provides a snapshot of a typical urban environment within the game world of Grand Theft Auto V. It showcases the game's attention to detail in recreating a believable city setting. The scene doesn't convey any specific narrative or data points, but rather serves as a visual representation of the game's environment. The inclusion of the character suggests the player is exploring the area. The scene evokes a sense of everyday life within a fictionalized urban landscape. The lack of significant activity or points of interest suggests a relatively quiet or unremarkable location within the larger game world. The image is primarily a visual asset, designed to immerse the player in the game's environment.
</details>
Prompt: What is the content in this image? Response: The image appears to be a screenshot from a video game, likely Grand Theft Auto V (GTA V), given the visual style and interface elements. It shows a character standing on a sidewalk in a small town or suburban area. The environment includes a street with tram tracks, a few buildings with signs, and some cars parked along the road. The buildings have signs for businesses like "Wine" and "On The Bone." The scene has a somewhat deserted look, with no other characters or significant activity visible.
OCR
<details>
<summary>figures/vis/vis-understanding-nutritional.jpeg Details</summary>

### Visual Description
\n
## Nutritional Information Panel
### Overview
This image displays a nutritional information panel for a food product, specifically honey. The panel provides details on serving size, servings per package, and the average quantity of various nutrients per serving and per 100g of the product. Additional information includes ingredients and storage instructions.
### Components/Axes
The panel is structured as a table with the following columns:
* **Nutrient:** Lists the nutritional components (Energy, Protein, Fat, Carbohydrates, Sugars, Sodium).
* **Average Quantity Per Serving:** Provides the amount of each nutrient in a 10g serving.
* **Average Quantity Per 100g:** Provides the amount of each nutrient in 100g of the product.
Additional text elements include:
* **Serving size:** 10g
* **Servings per package:** 25
* **Ingredients:** 100% Pure New Zealand Honey
* **Storage Instructions:** Keep Stored Below 20 deg°C, BPA-Free Container.
* **Warning:** Not suitable for children under 1 year of age.
* A symbol of a snowflake inside a circle.
### Detailed Analysis or Content Details
| Nutrient | Average Quantity Per Serving | Average Quantity Per 100g |
|----------------|------------------------------|-----------------------------|
| Energy | 143kJ | 1430 kJ |
| Protein | less than 0.1g | 0.16g |
| Fat, total | less than 0.1g | 0.37g |
| – saturated | less than 0.1g | 0.26g |
| Carbohydrates | 8.31g | 83.1g |
| Sugars | 8.11g | 81.1g |
| Sodium | 0.41mg | 4.04mg |
### Key Observations
* The product is primarily composed of carbohydrates and sugars, constituting the vast majority of its nutritional content.
* The fat and protein content are very low.
* The sodium content is minimal.
* The nutritional values per 100g are approximately ten times the values per serving (10g).
### Interpretation
The nutritional information indicates that this honey product is a concentrated source of carbohydrates, specifically sugars. It provides a quick source of energy due to its high sugar content. The low levels of fat, protein, and sodium suggest it is a relatively "pure" carbohydrate source. The storage instructions indicate a need to keep the product cool, likely to maintain its viscosity and prevent crystallization. The warning about not being suitable for children under 1 year of age is standard for honey due to the risk of infant botulism. The ingredient list confirms the product is 100% honey sourced from New Zealand. The snowflake symbol is likely a quality mark or indication of origin.
</details>
Prompt: Extract all the text in this image. Response: NUTRITIONAL INFORMATION Serving size: 10 g Servings per package: 25 Average Quantity Per Serving | Average Quantity Per 100g Energy | 143kJ | 1430 kJ Protein | less than 0.1g | 0.16g Fat, total | less than 0.1g | 0.37g - saturated | less than 0.1g | 0.26g Carbohydrates | 8.31g | 83.1g Sugars | 8.11g | 81.1g Sodium | 0.41mg | 4.04mg INGREDIENTS: 100% Pure New Zealand Honey Keep Stored Below 20 deg°C, BPA-Free Container. Not suitable for children under 1 year of age.
<details>
<summary>figures/vis/vis-understanding-receipt.jpeg Details</summary>

### Visual Description
\n
## Receipt: Chapelbank Hotel Lunch Bill
### Overview
This is a photograph of a printed receipt from the Chapelbank Hotel, detailing a lunch bill for 5 people. The receipt lists individual items, their prices, and the total amount due.
### Components/Axes
The receipt contains the following information:
* **Header:** Chapelbank Hotel address and contact details.
* **Table Number:** Table 5
* **Itemized List:** Description of food items, quantity (where applicable), unit price, and total price for each item.
* **Subtotal:** Total amount due.
* **Note:** "Food = Lunch x 5 people" written at the top.
### Detailed Analysis or Content Details
Here's a breakdown of the items and costs:
* **Chapelbank Hotel:**
* 69 East High Street
* Forfar DD8 2EP
* Tel No 01307 463151
* **Table:** 5
* **Restaurant:**
* 3x Soup: 3.95 each, Total 11.85
* Lunch Main Lasagne (V): 6.95
* Misc Food: 6.95
* 2x Chips / Onion Rings 1.75 each, Total 3.50
* -FOOD: 29.25 - (This appears to be a running total or a subtotal before further calculations)
* **Prev Bal:** 29.25
* **SUBTOTAL:** 29.25
### Key Observations
* The receipt appears to be for a single order, with a subtotal of £29.25.
* The "Prev Bal" is equal to the "SUBTOTAL", suggesting this is the final bill amount.
* The handwritten note at the top indicates the bill is for a lunch for 5 people.
* The item "Lasagne (V)" indicates a vegetarian option.
* The item "Misc Food" is vague and doesn't specify what was ordered.
### Interpretation
The receipt represents a straightforward transaction for a lunch at the Chapelbank Hotel. The total cost of £29.25 is likely the final amount paid for the meal for 5 people. The presence of a vegetarian option (Lasagne (V)) suggests the hotel caters to different dietary requirements. The "Misc Food" item could represent a combination of smaller items or a dish not specifically listed on the menu. The receipt provides a record of the purchase, including the date (implied by the print date of the receipt) and the location. The handwritten note clarifies the context of the bill – it's for a group lunch.
</details>
Prompt: Describe the image. Response: The image is a receipt from Chapelbank Hotel for a lunch order for five people. The items listed on the receipt include 3x soup at £3.95 each, totaling £11.85; lunch main course at £6.95; lasagne (V) at £6.95; misc food at £6.95; and 2x chips/onion rings at £1.75 each, totaling £3.50. The subtotal for the food is £29.25. The receipt also includes a previous balance of £29.25, making the total amount due £29.25. The hotel’s address is 69 East High Street, Forfar DD8 2EP, and the telephone number is 01307 463151. Prompt: How many people are attending this lunch? Response: There are 5 people attending this lunch. This is indicated at the top of the receipt where it says "Food = Lunch x 5 people".
<details>
<summary>figures/vis/vis-understanding-road.jpeg Details</summary>

### Visual Description
\n
## Road Signage: Highway Exit Information
### Overview
The image depicts a highway scene with multiple overhead exit signs. The signs provide directional information for various destinations, including cities, roads, and airports. The image is taken from the perspective of a driver on the highway.
### Components/Axes
The image primarily consists of road signs. There are no axes or scales in the traditional sense, but the signs themselves function as information displays. The signs are color-coded: green for main destinations, blue for auxiliary information, and yellow for warnings.
### Detailed Analysis or Content Details
Here's a breakdown of the visible signs, reading from left to right:
* **Sign 1 (Leftmost):**
* Shape: Rectangular, green background.
* Text: "EXIT 45" (top-left, in white). Below that, a circular logo with "267 WEST" (white text on a black circle). Below that: "TOLL" (white text).
* Destinations: "Reston", "Herndon", "Dulles Airport" (white text, listed vertically).
* Distance: "NO TOLL 1 MILE" (white text, bottom).
* **Sign 2:**
* Shape: Rectangular, green background.
* Text: "EXIT 46 A" (top-left, in white).
* Route: "123 SOUTH" (white text on a black circle).
* Destinations: "Chain Br Rd", "Tysons Corner" (white text, listed vertically).
* Distance: "1/2 MILE" (white text, bottom).
* **Sign 3:**
* Shape: Rectangular, green background.
* Text: "EXIT 46 B" (top-left, in white).
* Route: "123 NORTH" (white text on a black circle).
* Destinations: "Chain Br Rd", "McLean" (white text, listed vertically).
* Text: "EXIT ONLY" (yellow background, black text, bottom-right).
* **Sign 4 (Rightmost, partially visible):**
* Shape: Rectangular, green background.
* Text: "EXIT 45" (top-left, in white).
* Destination: "Leesburg" (white text).
The road itself has multiple lanes of traffic. Visible vehicles include cars, SUVs, and a large commercial truck with the URL "www.knighthawk.com" visible on its side. A concrete sound barrier runs along the right side of the highway.
### Key Observations
The signs are arranged in a logical order, providing drivers with advance notice of upcoming exits. The use of color-coding helps to quickly identify the type of information being presented. The presence of "TOLL" on the first sign indicates a toll road. The "EXIT ONLY" designation on the third sign suggests that the exit is not a through route.
### Interpretation
The image demonstrates a typical highway exit information system designed to guide drivers to their desired destinations. The signs provide clear and concise information about upcoming exits, including route numbers, destinations, and distances. The arrangement of the signs and the use of color-coding contribute to the overall effectiveness of the system. The presence of a toll road and an exit-only route adds complexity to the navigation process, requiring drivers to pay attention to the signage. The truck's website suggests a logistics or transportation company. The overall scene suggests a busy highway corridor in a suburban or metropolitan area.
</details>
Prompt: Where is the left exit going to? Response: The left exit is going to Reston, Herndon, and Dulles Airport. Prompt: Which road leads to the EXIT 46 B? Response: The road that leads to EXIT 46 B is Chain Br Rd North.
Multi-lingual OCR
<details>
<summary>figures/vis/vis-understanding-zuowen.jpeg Details</summary>

### Visual Description
\n
## Text Document: Handwritten Chinese Text
### Overview
The image presents a handwritten text document in Chinese characters. It appears to be a composition or essay, likely a student's work given the scoring indication at the top-left. The text is arranged in a grid-like format, filling the majority of the image.
### Components/Axes
The document has the following visible components:
* **Header:** Top-left corner contains "作文(60分)" (Zuowen (60 fen) - Composition (60 points)).
* **Title:** Top-right corner contains "题目 考查 目标, 享受 生命" (Timu kaocha mubiao, xiangshou shenming - Topic: Examine goals, enjoy life).
* **Body:** The main portion of the image consists of the handwritten Chinese text, arranged in rows and columns.
### Detailed Analysis or Content Details
The text is transcribed below, followed by an English translation. Due to the handwriting, some characters are difficult to discern with absolute certainty, and the translation reflects the most probable interpretation.
**Chinese Text (Transcription):**
目标之光,如月光清辉;生命之辉,如祥星闪耀。失去了祥星,月色便清冷孤寂;不见了明月,祥星也必黯然失色。只有专注月光,方有方向;只有欣赏祥星,方可享受生命。星月之辉,夜空之辉,考查人;考查目标,享受生命。为可体味味人间清欢。
**English Translation:**
The light of goals is like the clear glow of the moonlight; the brilliance of life is like the twinkling of auspicious stars. Losing the auspicious stars, the moonlight becomes cold and lonely; not seeing the bright moon, the auspicious stars will inevitably lose their color. Only by focusing on the moonlight can one have direction; only by appreciating the auspicious stars can one enjoy life. The brilliance of the stars and moon, the brilliance of the night sky, examines people; examines goals, enjoys life. It allows one to savor the joys of the human world.
### Key Observations
* The text uses a poetic and metaphorical style, comparing goals to moonlight and life to stars.
* The structure is balanced, with parallel phrases and contrasting ideas.
* The handwriting is neat but shows some variation in stroke thickness and character formation.
* The scoring indication suggests this is a graded assignment.
### Interpretation
The text explores the relationship between having goals and enjoying life. It suggests that goals (represented by the moonlight) provide direction, while appreciating life (represented by the stars) brings fulfillment. The loss of either leads to a sense of emptiness or meaninglessness. The final sentence implies that by pursuing goals and appreciating life, one can find joy and contentment in the world. The text is a philosophical reflection on the importance of purpose and enjoyment, presented in a lyrical and evocative manner. The use of celestial imagery emphasizes the timeless and universal nature of these themes. The text is a short essay, likely written as part of a composition assignment, and demonstrates a thoughtful and articulate understanding of the prompt.
</details>
Prompt: 请先提取里面的文本,再翻译为英文。 Response: 文本内容: 作文(60分) 题目:专注目标,享受生命。 目标之光,如月光清辉;生命之辉,如群星闪耀。失去了群星,月色便显得冷寂;不见了明月,群星也势必黯然失色。只有专注于月光,方有方向;只有欣赏群星,方可享受生命。星月交辉,夜空撩人;专注目标,享受生命,方可体味人间清欢。 翻译为英文: Essay (60 points) Title: Focus on your goal, enjoy life. The light of the goal is like the clear glow of moonlight; the brilliance of life is like the dazzling stars. Without the stars, the moonlight becomes cold and desolate; without the moon, the stars will inevitably become dull. Only by focusing on the moonlight can you find direction; only by appreciating the stars can you enjoy life. The stars and moon shine together, making the night sky enchanting; focusing on your goal, enjoying life, allows you to savor the quiet joy of life.
Document Understanding
<details>
<summary>figures/vis/vis-understanding-docvqa.png Details</summary>

### Visual Description
\n
## Document: KBA Marketing Group – 2001 Venue Sponsorship Agreement
### Overview
This document is a legal agreement between KBA Marketing Group ("KBA") and Vitruce's, Inc. ("Vitruce's, Inc.") outlining the terms of a venue sponsorship for the Camel Club Program ("CCP") from January 1, 2001, to December 31, 2001. It details financial contributions, responsibilities, and legal considerations.
### Components/Axes
The document is structured into sections:
* **I. Venue Information:** Details about the venue, Vitruce's.
* **II. Sponsorship Funds and Extra Value Items:** Breakdown of financial contributions and associated benefits.
* **III. KBA will provide the following Extra Value Items:** Lists additional benefits provided by KBA.
* **IV. Tobacco License Info:** Information regarding tobacco license requirements.
* **V. Bar Essentials:** Lists essential bar supplies.
* **VI. Responsibilities:** Outlines the responsibilities of both KBA and Vitruce's, Inc.
* **VII. Legal Disclaimer:** Contains legal disclaimers and contact information.
### Content Details
**I. Venue Information**
* **Date of Agreement:** 3/1/01
* **Term of Agreement:** January 1, 2001 – December 31, 2001
* **Sponsorship Program:** Camel Club Program (“CCP”)
* **Venue Name:** Vitruce’s
* **Venue Code:** 4388
* **Venue Address:** 1832 East North Avenue, Milwaukee WI 53202
* **FEIN or SSN:** PERSONAL/CONFIDENTIAL MATERIAL REDACTED
* **Venue Owner Name:** Vitruce’s, Inc.
* **Venue Owner Category:** Corporation
**II. Sponsorship Funds and Extra Value Items**
* **Special Events Fund:** $1,595.00 to offset costs for a minimum of 1 Special Events from Jan. 1 – June 30, 2001. $1,595.00 to offset costs for a minimum of 1 Special Events from July 1 – Dec. 31, 2001.
* **Printed Materials Fund:** $810.00 to offset costs for the printing and shipping of custom flyers, as well as the purchases of local media, including quarter-page advertisements, and additional printed material and display advertising, as available, at the discretion of KBA, from Jan. 1 – Dec. 31, 2001.
* **Total Direct Payments:** $2,000.00, payable on or before the following payment dates in the following amounts:
* March 31, 2001: $500.00
* June 30, 2001: $500.00
* September 30, 2001: $500.00
* December 31, 2001: $500.00
* **TOTAL SPONSORSHIP FUNDS:** $6,000.00 (This amount includes all sub-totals set forth above)
**III. KBA will provide the following Extra Value Items**
* KBA will also provide the following Extra Value Items at no cost to venue:
* All state and local retail tobacco licensing procedures for Venue (if Venue is currently without a license), pursuant with tobacco laws current as of June 30, 2001.
* Branded ash/dusty beer coasters napkins, generic branded matchbooks, generic branded liability & request forms.
* Approximately 500 branded bar napkins, generic branded liability & request forms.
**IV. Tobacco License Info**
* Vitruce’s, Inc. represents and warrants to KBA that Venue currently possesses all licenses and permits necessary to sell and serve tobacco products and alcoholic beverages, and that said licenses are current as of date hereof and will remain current throughout the Term.
**V. Bar Essentials**
* KBA will provide, at no charge to the Venue, approximately 500 branded bar napkins, generic branded liability & request forms, generic branded matchbooks, generic branded ash/dusty beer coasters, pursuant to the Extra Value Items as set forth above.
**VI. Responsibilities**
* **KBA Responsibilities:** KBA will provide funds as outlined above and will provide assistance with tobacco licensing.
* **Vitruce’s, Inc. Responsibilities:** Vitruce’s, Inc. will comply with all applicable laws and regulations, maintain necessary licenses, and properly display promotional materials.
**VII. Legal Disclaimer**
* The document includes a standard legal disclaimer regarding the agreement's terms and conditions.
* **Contact Information:**
* KBA Marketing Group: 1-800-221-3383
* Kevin G. & Associates, Inc.: 1-800-221-3383
### Key Observations
* The sponsorship agreement is for a fixed amount of $6,000.
* The funds are allocated across special events, printed materials, and direct payments.
* KBA provides additional value items beyond the financial contribution, including assistance with licensing and branded materials.
* The agreement emphasizes the importance of maintaining valid licenses for tobacco and alcohol sales.
* There is a section with redacted personal/confidential material.
### Interpretation
This document represents a standard sponsorship agreement between a marketing group (KBA) and a venue (Vitruce’s, Inc.). KBA is investing in Vitruce’s, Inc. to promote the Camel Club Program, likely through events and advertising. The agreement is structured to provide financial support and additional resources to the venue, while also ensuring compliance with relevant regulations. The inclusion of a legal disclaimer and specific contact information highlights the formal and legally binding nature of the agreement. The redacted information suggests sensitivity around personal or financial details. The breakdown of payments into quarterly installments indicates a structured approach to funding and accountability. The provision of branded materials suggests a focus on brand visibility and promotion within the venue.
</details>
Prompt: What is the duration of this venue sponsorship agreement? Response: January 1, 2001 - December 31, 2001 Prompt: Where is Vitucci’s located (complete address)? Response: 1832 East North Avenue, Milwaukee WI 53202 Prompt: What cigarette brand is mentioned in the agreement for delivery and distribution? Response: Camel cigarette brand styles purchased by Venue Owner for resale in Venue Prompt: Who has discretion over the distribution of flyers for special events? Response: KBA Marketing Group ("KBA") has discretion over the distribution of flyers for special events.
LaTeX and Math
<details>
<summary>figures/vis/vis-understanding-math.png Details</summary>

### Visual Description
\n
## Mathematical Equation: CRPO Loss Function
### Overview
The image presents a complex mathematical equation representing the CRPO (Constrained Reinforcement Policy Optimization) loss function. It appears to be a formula used in reinforcement learning, likely for training an agent to optimize a policy while adhering to certain constraints.
### Components/Axes
The equation is presented as a single block of text with several mathematical symbols and notations. Key components include:
* **Variables:** θ (theta), G, i, j, π (pi), Â (A hat), ε (epsilon), ρ (rho), x, q, ct, r<sub>t</sub>
* **Functions/Operators:** E[.], P(.), min[.], clip[.], Σ (summation)
* **Notations:** 𝒩<sub>RP0</sub>(θ), 𝒩<sub>θ,ld</sub>, D<sub>KL</sub>
* **Equation Number:** (3) located in the top-right corner.
### Detailed Analysis / Content Details
The equation can be transcribed as follows:
𝒩<sub>RP0</sub>(θ) = E[<sub>q</sub> ~ P(<sub>q</sub> | {<sub>φi</sub>}<sup>G</sup><sub>i=1</sub> ~ 𝒩<sub>θ,ld</sub>(<sub>q</sub>|<sub>φ</sub>)]
= (1/G) * Σ<sub>i=1</sub><sup>G</sup> * (1/|<sub>φi</sub>|) * Σ<sub>j=1</sub><sup>|<sub>φi</sub>|</sup> min[ (π<sub>θ</sub>(c<sub>j</sub>|q, <sub>φi</sub>, c<sub><t</sub>) / 𝒩<sub>θ,ld</sub>(<sub>q</sub>|<sub>φi</sub>, <sub>φ</sub>)) * Â<sub>i,j</sub>, clip( (π<sub>θ</sub>(c<sub>j</sub>|q, <sub>φi</sub>, c<sub><t</sub>) / 𝒩<sub>θ,ld</sub>(<sub>q</sub>|<sub>φi</sub>, <sub>φ</sub>)), 1 - ε, 1 + ε) + Â<sub>i,j</sub> ] - βD<sub>KL</sub>[π<sub>θ</sub>||r<sub>t</sub>]
### Key Observations
The equation involves nested summations and a minimization operation. The presence of the KL divergence (D<sub>KL</sub>) suggests a regularization term to prevent the policy from deviating too far from a reference policy (r<sub>t</sub>). The clipping function limits the ratio of probabilities to a specified range (1 - ε, 1 + ε), which is a common technique in policy gradient methods to improve stability. The notation 𝒩<sub>θ,ld</sub> suggests a learned distribution.
### Interpretation
This equation defines a loss function for constrained reinforcement learning. The goal is to optimize a policy (represented by θ) such that it maximizes the expected reward (E[.]) while satisfying certain constraints. The constraints are likely related to the learned distribution 𝒩<sub>θ,ld</sub> and the KL divergence term. The equation attempts to balance exploration (through the policy π<sub>θ</sub>) with exploitation (through the advantage function Â<sub>i,j</sub>) while ensuring that the policy remains within acceptable bounds. The parameters β and ε control the strength of the constraints and the clipping range, respectively. The equation is a sophisticated approach to policy optimization, designed to address the challenges of instability and constraint satisfaction in reinforcement learning. The equation is a mathematical formulation, and its practical meaning is dependent on the specific reinforcement learning problem it is applied to.
</details>
Prompt: Convert the formula in the figure to latex format. Response: ‘‘‘latex \begin{equation*} \begin{split} \mathcal{J}_{\text{GRPO}}(\theta) &= \mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}ˆG \sim \pi_{\theta_{old}}(O|q)] \\ & \frac{1}{G} \sum_{i=1}ˆG \frac{1}{|o_i|} \sum_{t=1}ˆ{|o_i|} \left\{ \min \left[ \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \varepsilon, 1 + \varepsilon \right) \hat{A}_{i,t} \right] - \beta \mathcal{D}_{KL} \left[ \pi_{\theta} || \pi_{ref} \right] \right\}, \end{split} \end{equation*} ‘‘‘
<details>
<summary>figures/vis/vis-understanding-math-7.png Details</summary>

### Visual Description
\n
## Diagram: Parallelogram with Dimensions
### Overview
The image depicts a parallelogram with one side labeled with a length and another side labeled with a variable representing height. The diagram appears to be related to a geometry problem, likely involving area or perimeter calculations.
### Components/Axes
The diagram consists of:
* A parallelogram shape filled with a light blue color.
* A horizontal base labeled "8 m".
* A vertical line representing the height, labeled "x m".
* A right angle symbol indicating the height is perpendicular to the base.
### Detailed Analysis or Content Details
The diagram shows a parallelogram with a base of 8 meters. The height of the parallelogram is represented by the variable 'x', and is measured in meters ("x m"). The height is shown as a dashed pink line extending from the top-right corner of the parallelogram, perpendicular to the base. The right angle symbol is located at the bottom-right corner of the height line, where it intersects the base.
### Key Observations
The diagram provides the length of one side (the base) and the variable representing the height. No other numerical values are provided. The diagram is a visual representation of a geometric shape with specific dimensions.
### Interpretation
The diagram likely represents a problem where the area or perimeter of the parallelogram needs to be calculated. The variable 'x' represents the unknown height, and its value would be needed to determine the area (Area = base * height) or to calculate the perimeter if the length of the other side is known. The diagram is a simplified representation of a real-world shape, used to illustrate a geometric concept. The use of 'x' suggests this is a problem requiring algebraic manipulation to solve for the height.
</details>
Prompt: Find $x$ . $A=148$ $m^{2}$ . Please think step by step. Response: To find the value of $x$ in the parallelogram, we can use the formula for the area of a parallelogram: $A=\text{base}\times\text{height}$ Given: - Area $A=148\,\text{m}^{2}$ - Base $=8\,\text{m}$ - Height $=x\,\text{m}$ Substitute the given values into the formula: $148=8\times x$ Solve for $x$ : $x=\frac{148}{8}$ $x=18.5\,\text{m}$ So, $x=18.5\,\text{m}$ .
<details>
<summary>figures/vis/vis-understanding-math-10.png Details</summary>

### Visual Description
\n
## Data Table: Weight of Winning Dog at a Dog Show
### Overview
The image presents a data table showing the weight (in kilograms) of the winning dog at a dog show for the years 2013 through 2017. The table has two columns: "Year" and "Weight (kg)".
### Components/Axes
* **Title:** "Weight of winning dog at a dog show" - positioned at the top-center.
* **Column 1 Label:** "Year" - positioned at the top-left of the data area.
* **Column 2 Label:** "Weight (kg)" - positioned at the top-right of the data area.
* **Data Rows:** Representing the years 2013, 2014, 2015, 2016, and 2017.
### Detailed Analysis or Content Details
The table contains the following data points:
* **2013:** Weight = 27 kg
* **2014:** Weight = 28 kg
* **2015:** Weight = 34 kg
* **2016:** Weight = 29 kg
* **2017:** Weight = 40 kg
The weight values generally increase over the years, with a notable jump from 2016 to 2017.
### Key Observations
* The lowest weight recorded is 27 kg in 2013.
* The highest weight recorded is 40 kg in 2017.
* There is a consistent, but not strictly linear, increase in the weight of the winning dogs over the five-year period.
* The largest year-to-year increase in weight is between 2016 (29 kg) and 2017 (40 kg), a difference of 11 kg.
### Interpretation
The data suggests a trend of increasing weight in the winning dogs at the dog show between 2013 and 2017. This could be due to several factors, such as:
* **Breeding practices:** Breeders may be selectively breeding for larger dogs.
* **Diet and nutrition:** Improved dog food and nutritional practices could lead to larger dogs.
* **Changing breed standards:** The standards for judging dog shows may have shifted to favor larger dogs.
* **Random variation:** It's possible the observed trend is simply due to chance, especially with a small sample size (only 5 years of data).
The significant increase in weight from 2016 to 2017 is a notable outlier and warrants further investigation. It would be helpful to have data from more years to determine if this is a temporary fluctuation or a continuation of a more rapid increase in weight. Without additional context about the breeds of dogs competing, it's difficult to draw definitive conclusions.
</details>
Prompt: A dog show enthusiast recorded the weight of the winning dog at recent dog shows. According to the table, what was the rate of change between 2016 and 2017? (Unit: kilograms per year) Response: To determine the rate of change between 2016 and 2017, we use the formula for rate of change: $\text{Rate of Change}=\frac{\text{Weight in 2017}-\text{Weight in 2016}}{\text{Year 2017}-\text{Year 2016}}$ From the table: - Weight in 2016 = 29 kg - Weight in 2017 = 40 kg Substitute these values into the formula: $\text{Rate of Change}=\frac{40\,\text{kg}-29\,\text{kg}}{2017-2016}\\ =\frac{11\,\text{kg}}{1\,\text{year}}$ $\text{Rate of Change}=11\,\text{kg/year}$ Thus, the rate of change between 2016 and 2017 is $\boxed{11\,\text{kg/year}}$ .