2504.07951v4

Model: gemma-3-27b-it-free

## Scaling Laws for Native Multimodal Models Mustafa Shukor 2 Enrico Fini 1 Victor Guilherme Turrisi da Costa 1 Matthieu Cord 2 Joshua Susskind 1 Alaaeldin El-Nouby 1 1 Apple 2 Sorbonne University ## Abstract Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)-those trained from the ground up on all modalities-and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance. 10 18 2 3 Validation Loss ## 1. Introduction Multimodality provides a rich signal for perceiving and understanding the world. Advances in vision [23, 52, 55, 80] and language models [3, 19, 67] have enabled the development of powerful multimodal models that understand language, images, and audio. A common approach involves grafting separately pre-trained unimodal models, such as connecting a vision encoder to the input layer of an LLM [6, 9, 35, 43, 62, 64, 73, 78]. Although this seems like a convenient approach, it remains an open question whether such late-fusion strategies are inherently optimal for understanding multimodal signals. Moreover, with abundant multimodal data available, initializing from unimodal pre-training is potentially detrimental, as it may introduce biases that prevent the model <details> <summary>Image 1 Details</summary> ![3f519281](/v1/image/3f5192810b17285255086d37c332cf297a7647d60a980515ae356fa20c8112f9) ### Visual Description ## Chart: Validation Loss vs. FLOPs & N/D vs. FLOPs ### Overview The image presents two charts, stacked vertically. The top chart displays Validation Loss against FLOPs (Floating Point Operations) for three different training stages: Early, Late, and MoE (Mixture of Experts). The bottom chart shows the ratio of N (number of parameters) to D (dataset size) against FLOPs, also for the same three training stages. Both charts use a logarithmic scale for the x-axis (FLOPs). ### Components/Axes **Top Chart:** * **Y-axis:** Validation Loss (scale from approximately 1.5 to 4) * **X-axis:** FLOPs (logarithmic scale, from approximately 1018 to 1024) * **Legend:** * Early: (dashed orange line) L ∝ C-0.0492 * Late: (dashed blue line) L ∝ C-0.0494 * MoE: (dashed green line) L ∝ C-0.0474 * **Inset Box:** A zoomed-in view of the initial portion of the top chart, highlighting the early stages of loss reduction. **Bottom Chart:** * **Y-axis:** N/D (ratio of parameters to dataset size, scale from approximately -0.02 to 4) * **X-axis:** FLOPs (logarithmic scale, from approximately 1018 to 1024) * **Legend:** * Early: (dashed orange line) N/D ∝ C0.053 * Late: (dashed blue line) N/D ∝ C0.076 * MoE: (dashed green line) N/D ∝ C-0.312 ### Detailed Analysis or Content Details **Top Chart (Validation Loss vs. FLOPs):** * **Early (Orange):** The line starts at approximately 3.8 at 1018 FLOPs and decreases rapidly to around 2.2 at 1022 FLOPs, then continues to decrease more slowly to approximately 1.8 at 1024 FLOPs. * **Late (Blue):** The line begins at approximately 3.5 at 1018 FLOPs, decreases to around 2.1 at 1022 FLOPs, and continues to decrease to approximately 1.7 at 1024 FLOPs. * **MoE (Green):** The line starts at approximately 3.6 at 1018 FLOPs, decreases to around 2.3 at 1022 FLOPs, and continues to decrease to approximately 1.6 at 1024 FLOPs. * All three lines exhibit a downward trend, indicating that validation loss decreases as FLOPs increase. The rate of decrease appears to slow down as FLOPs increase. **Bottom Chart (N/D vs. FLOPs):** * **Early (Orange):** The line starts at approximately 1.5 at 1018 FLOPs, decreases to around 0.8 at 1022 FLOPs, and then increases to approximately 2.5 at 1024 FLOPs. * **Late (Blue):** The line begins at approximately 0.5 at 1018 FLOPs, decreases to around 0.2 at 1022 FLOPs, and then increases to approximately 1.5 at 1024 FLOPs. * **MoE (Green):** The line starts at approximately 0.1 at 1018 FLOPs, decreases to around -0.1 at 1022 FLOPs, and then increases to approximately 0.2 at 1024 FLOPs. * The Early and Late lines show a U-shaped curve, decreasing initially and then increasing. The MoE line shows a more pronounced decrease followed by a slight increase. ### Key Observations * The Validation Loss consistently decreases with increasing FLOPs for all three training stages. * The MoE model consistently exhibits the lowest Validation Loss across all FLOPs values. * The N/D ratio shows a complex relationship with FLOPs, with Early and Late stages exhibiting a U-shaped curve, while MoE shows a more negative trend initially. * The inset box in the top chart highlights the initial rapid decrease in validation loss, suggesting a quick learning phase. ### Interpretation The charts demonstrate the scaling behavior of validation loss and model complexity (N/D ratio) with increasing computational resources (FLOPs) during different training stages (Early, Late, and MoE). The decreasing validation loss with increasing FLOPs indicates that the models are learning and improving their performance. The MoE model consistently outperforms the Early and Late models in terms of validation loss, suggesting that the Mixture of Experts architecture is more effective at utilizing computational resources. The N/D ratio provides insights into the model's capacity relative to the dataset size. The U-shaped curve observed in the Early and Late stages suggests that initially, increasing model capacity (N) leads to better performance, but beyond a certain point, it can lead to overfitting or diminishing returns. The MoE model's different behavior (initial decrease followed by a slight increase) suggests that it may have a different capacity scaling behavior, potentially due to its ability to selectively activate different experts. The power law relationships (L ∝ C-x and N/D ∝ Cx) indicate that the validation loss and N/D ratio scale with the model size (C) in a predictable manner. The different exponents (x values) for each training stage suggest that the scaling behavior varies depending on the training phase and architecture. The negative exponent for the MoE model's N/D ratio suggests that increasing model size can actually *decrease* the N/D ratio, potentially indicating a more efficient use of parameters. </details> C - C - C . - . Figure 1. Scaling properties of Native Multimodal Models. Based on the scaling laws study in § 3.1, we observe: (1) early and late fusion models provide similar validation loss L when trained with the same compute budget C (FLOPs); (2) This performance is achieved via a different trade-off between parameters N and number of training tokens D , where early-fusion models requires fewer parameters. (3) Sparse early-fusion models achieve lower loss and require more training tokens for a given FLOP budget. from fully leveraging cross-modality co-dependancies. An additional challenge is scaling such systems; each component (e.g., vision encoder, LLM) has its own set of hyperparameters, pre-training data mixtues, and scaling properties with respect to the amount of data and compute applied. A more flexible architecture might allow the model to dynamically allocate its capacity across modalities, simplifying scaling efforts. In this work, we focus on the scaling properties of native multimodal models trained from the ground up on multimodal data. We first investigate whether the commonly adopted late-fusion architectures hold an intrinsic advantage by comparing them to early-fusion models, which process raw multimodal inputs without relying on dedicated vision encoders. We conduct scaling experiments on early and late fusion architectures, deriving scaling laws to pre- . dict their performance and compute-optimal configurations. Our findings indicate that late fusion offers no inherent advantage when trained from scratch. Instead, early-fusion models are more efficient and are easier to scale. Furthermore, we observe that native multimodal models follow scaling laws similar to those of LLMs [26], albeit with slight variations in scaling coefficients across modalities and datasets. Our results suggest that model parameters and training tokens should be scaled roughly equally for optimal performance. Moreover, we find that different multimodal training mixtures exhibit similar overall trends, indicating that our findings are likely to generalize to a broader range of settings. While our findings favor early fusion, multimodal data is inherently heterogeneous, suggesting that some degree of parameter specialization may still offer benefits. To investigate this, we explore leveraging Mixture of Experts (MoEs) [59], a technique that enables the model to dynamically allocate specialized parameters across modalities in a symmetric and parallel manner, in contrast to late-fusion models, which are asymmetric and process data sequentially. Training native multimodal models with MoEs results in significantly improved performance and therefore, faster convergence. Our scaling laws for MoEs suggest that scaling number of training tokens is more important than the number of active parameters. This unbalanced scaling is different from what is observed for dense models, due to the higher number of total parameters for sparse models. In addition, Our analysis reveals that experts tend to specialize in different modalities, with this specialization being particularly prominent in the early and last layers. ## 1.1. Summary of our findings Our findings can be summarized as follows: Native Early and Late fusion perform on par: Early fusion models trained from scratch perform on par with their late-fusion counterparts, with a slight advantage to earlyfusion models for low compute budgets (Figure 3). Furthermore, our scaling laws study indicates that the computeoptimal models for early and late fusion perform similarly as the compute budget increases (Figure 1 Top). NMMs scale similarly to LLMs: The scaling laws of native multimodal models follow similar laws as text-only LLMs with slightly varying scaling exponents depending on the target data type and training mixture (Table 2). Late-fusion requires more parameters: Computeoptimal late-fusion models require a higher parameters-todata ratio when compared to early-fusion (Figure 1 bottom). Sparsity significantly benefits early-fusion NMMs: Sparse NMMs exhibit significant improvements compared to their dense counterparts at the same inference cost (Figure 10). Furthermore, they implicitly learn modalityspecific weights when trained with sparsity (Figure 12). In Table 1. Definitions of the expressions used throughout the paper. | Expression | Definition | |--------------|-------------------------------------------------------------------------------------------------------------------| | N | Number of parameters in the multimodal decoder. For MoEs this refers to the active parameters only. | | D | Total number of multimodal tokens. | | N v | Number of parameters in the vision-specific encoder. Only exists in late-fusion architectures. | | D v | Number of vision-only tokens. | | C | Total number of FLOPs, estimated as C = 6 ND for early-fusion and C = 6( N v D v + ND ) for late-fusion. | | L | Validation loss measured as the average over interleaved image- text, image-caption, and text-only data mixtures. | addition, compute-optimal models rely more on scaling the number of training tokens than the number of active parameters as the compute-budget grows (Figure 1 Bottom). Modality-agnostic routing beats Modality-aware routing for Sparse NMMs: Training sparse mixture of experts with modality-agnostic routing consistently outperforms models with modality-aware routing (Figure 11). ## 2. Preliminaries ## 2.1. Definitions Native Multimodal Models (NMMs): Models that are trained from scratch on all modalities simultaneously without relying on pre-trained LLMs or vision encoders. Our focus is on the representative image and text modalities, where the model processes both text and images as input and generates text as output. Early fusion: Enabling multimodal interaction from the beginning, using almost no modality-specific parameters ( e.g ., except a linear layer to patchify images). Using a single transformer model, this approach processes raw multimodal input-tokenized text and continuous image patches-with no image discretization. In this paper, we refer to the main transformer as the decoder. Late fusion: Delaying the multimodal interaction to deeper layers, typically after separate unimodal components has processed that process each modality independently (e.g., a vision encoder connected to a decoder). Modality-agnostic routing: In sparse mixture-of-experts, modality-agnostic routing refers to relying on a learned router module that is trained jointly with the model. Modality-aware routing: Routing based on pre-defined rules such as routing based on the modality type ( e.g ., vision-tokens, token-tokens). ## 2.2. Scaling Laws We aim to understand the scaling properties of NMMs and how different architectural choices influence trade-offs. To this end, we analyze our models within the scaling laws framework proposed by Hoffmann et al. [26], Kaplan et al. [31]. We compute FLOPs based on the total number of parameters, using the approximation C = 6 ND , as adopted in prior work [2, 26]. However, we modify this estimation to suit our setup: for late-fusion models, FLOPs is computed Figure 2. Scaling laws for early-fusion and late-fusion native multimodal models. Each point represents a model (300M to 3B parameters) trained on varying number of tokens (250M to 400B). We report the average cross-entropy loss on the validation sets of interleaved (Obelics), Image-caption (HQITP), and text-only data (DCLM). <details> <summary>Image 2 Details</summary> ![56d0461d](/v1/image/56d0461d130c53c537e66331ca30209778883939b6199618e90dcc4c6890ec4f) ### Visual Description ## 3D Scatter Plot: Validation Loss vs. Tokens Seen & Parameters ### Overview The image presents two 3D scatter plots visualizing the relationship between Validation Loss, Tokens Seen, and Parameters. Each plot is colored according to FLOPs (Floating Point Operations per second). The plots are side-by-side for comparison. Each plot also includes a formula relating Validation Loss (L) to Tokens Seen (N) and Parameters (D). ### Components/Axes * **X-axis:** Tokens Seen (ranging from approximately 200B to 800B, then to 1B and 0.3B). * **Y-axis:** Parameters (ranging from approximately 200B to 800B, then to 1B and 0.3B). * **Z-axis:** Validation Loss (ranging from approximately 2.4 to 3.1). * **Color Scale (Legend):** FLOPs, ranging from 1e+20 (light color) to 1e+22 (dark color). * Left Plot: Red to Orange gradient. * Right Plot: Blue to Cyan gradient. * **Formula (Top of each plot):** * Left Plot: L(N, D) α N-0.30051 + D-0.33491 * Right Plot: L(N, D) α N-0.2777 + D-0.34299 * **Data Points:** Scatter points representing individual data instances. ### Detailed Analysis or Content Details **Left Plot (Red/Orange):** * The data points generally cluster in the region where Tokens Seen are between 200B and 800B, and Parameters are between 200B and 800B. * The Validation Loss values are concentrated between approximately 2.6 and 3.0. * The color gradient indicates that lower Validation Loss values (around 2.4-2.6) correspond to darker orange/red colors, suggesting higher FLOPs. * The data points show a general trend of decreasing Validation Loss as both Tokens Seen and Parameters increase, but with significant scatter. * Approximate data points (Validation Loss, Tokens Seen, Parameters): * (2.45, 200B, 200B) - Dark Orange * (2.55, 500B, 500B) - Orange * (2.7, 800B, 800B) - Light Orange * (2.9, 1B, 1B) - Light Orange * (3.0, 0.3B, 0.3B) - Light Orange **Right Plot (Blue/Cyan):** * The data points are similarly clustered in the region where Tokens Seen are between 200B and 800B, and Parameters are between 200B and 800B. * The Validation Loss values are concentrated between approximately 2.6 and 3.0. * The color gradient indicates that lower Validation Loss values (around 2.4-2.6) correspond to darker blue/cyan colors, suggesting higher FLOPs. * The data points show a general trend of decreasing Validation Loss as both Tokens Seen and Parameters increase, but with significant scatter. * Approximate data points (Validation Loss, Tokens Seen, Parameters): * (2.4, 200B, 200B) - Dark Cyan * (2.5, 500B, 500B) - Cyan * (2.75, 800B, 800B) - Light Cyan * (2.95, 1B, 1B) - Light Cyan * (3.05, 0.3B, 0.3B) - Light Cyan ### Key Observations * Both plots exhibit a similar overall distribution of data points. * The right plot (blue/cyan) generally shows slightly lower Validation Loss values compared to the left plot (red/orange) for similar values of Tokens Seen and Parameters. * The FLOPs color scale suggests a correlation between lower Validation Loss and higher FLOPs in both plots. * The formulas at the top of each plot indicate a power-law relationship between Validation Loss and both Tokens Seen and Parameters, with different exponents for each plot. ### Interpretation The plots demonstrate the impact of model size (Parameters) and training data (Tokens Seen) on Validation Loss. The decreasing trend of Validation Loss with increasing Parameters and Tokens Seen suggests that larger models trained on more data generally perform better. The FLOPs color scale indicates that achieving lower Validation Loss often requires more computational resources. The different exponents in the formulas for the two plots suggest that the relationship between Validation Loss and model size/training data may vary depending on the specific model or dataset. The right plot's formula suggests a slightly weaker dependence on Tokens Seen but a similar dependence on Parameters compared to the left plot. The scatter in the data points indicates that other factors besides model size and training data also influence Validation Loss. These factors could include model architecture, optimization algorithms, and data quality. The plots provide a valuable visual representation of the trade-offs between model performance, computational cost, and data requirements in machine learning. </details> as 6( N v D v + ND ) . We consider a setup where, given a compute budget C , our goal is to predict the model's final performance, as well as determine the optimal number of parameters or number of training tokens. Consistent with prior studies on LLM scaling [26], we assume a power-law relationship between the final model loss and both model size ( N ) and training tokens ( D ):  Here, E represents the lowest achievable loss on the dataset, while A N α captures the effect of increasing the number of parameters, where a larger model leads to lower loss, with the rate of improvement governed by α . Similarly, B D β accounts for the benefits of a higher number of tokens, with β determining the rate of improvement. Additionally, we assume a linear relationship between compute budget (FLOPs) and both N and D ( C ∝ ND ). This further leads to power-law relationships detailed in Appendix C.7. ## 2.3. Experimental setup Our models are based on the autoregressive transformer architecture [71] with SwiGLU FFNs [58] and QK-Norm [17] following Li et al. [39]. In early-fusion models, image patches are linearly projected to match the text token dimension, while late-fusion follows the CLIP architecture [55]. We adopt causal attention for text tokens and bidirectional attention for image tokens, we found this to work better. Training is conducted on a mixture of public and private multimodal datasets, including DCLM [39], Obelics [34], DFN [21], COYO [11], and a private collection of HighQuality Image-Text Pairs (HQITP). Images are resized to 224×224 resolution with a 14×14 patch size. We use a context length of 1k for the multimodal sequences. For training efficiency, we train our models with bfloat16 , Fully Sharded Data Parallel (FSDP) [82], activation checkpointing, and gradient accumulation. We also use se- Table 2. Scaling laws for native multimodal models . We report the scaling laws results for early and late fusion models. We fit the scaling laws for different target data types as well as their average loss (A VG). | L = E + A N α + B D β | N ∝ C a | N ∝ C a | D ∝ C b | D ∝ C b | L ∝ C c | L ∝ C c | D ∝ N d | D ∝ N d | |--------------------------|---------------|-----------|-----------|-----------|-----------|-----------|-----------|---------------| | Model | Data | E | α | β | a | b | c | d | | GPT3 [10] | Text | - | - | - | - | - | -0.048 | | | Chinchilla [26] | Text | 1.693 | 0.339 | 0.285 | 0.46 | 0.54 | - | | | NMM(early-fusion) | Text | 2.222 | 0.3084 | 0.3375 | 0.5246 | 0.4774 | -0.0420 | 0.9085 0.9187 | | | Image-Caption | 1.569 | 0.3111 | 0.3386 | 0.5203 | 0.4785 | -0.0610 | | | | Interleaved | 1.966 | 0.2971 | 0.338 | 0.5315 | 0.4680 | -0.0459 | 0.8791 | | | AVG | 1.904 | 0.301 | 0.335 | 0.5262 | 0.473 | -0.0492 | 0.8987 | | NMM(late-fusion) | AVG | 1.891 | 0.2903 | 0.3383 | 0.6358 | 0.4619 | -0.0494 | 0.6732 | | Sparse NMM(early-fusion) | AVG | 2.158 | 0.710 | 0.372 | 0.361 | 0.656 | -0.047 | 1.797 | quence packing for the image captioning dataset to reduce the amount of padded tokens. Similar to previous works [2, 5, 26], we evaluate performance on held-out subsets of interleaved (Obelics), Image-caption (HQITP), and text-only data (DCLM). Further implementation details are provided in Appendix A. ## 3. Scaling native multimodal models In this section, we present a scaling laws study of native multimodal models, examining various architectural choices § 3.1, exploring different data mixtures § 3.2, analyzing the practical trade-offs between late and early fusion NMMs, and comparing the performance of native pretraining and continual pre-training of NMMs § 3.3. Setup. We train models ranging from 0.3B to 4B active parameters, scaling the width while keeping the depth constant. For smaller training token budgets, we reduce the warm-up phase to 1K steps while maintaining 5K steps for larger budgets. Following H¨ agele et al. [25], models are trained with a constant learning rate, followed by a cooldown phase using an inverse square root scheduler. The cool-down phase spans 20% of the total steps spent at the constant learning rate. To estimate the scaling coefficients in Eq 1, we apply the L-BFGS algorithm [51] and Huber loss [28] (with δ = 10 -3 ), performing a grid search over initialization ranges. ## 3.1. Scaling laws of NMMs Scaling laws for early-fusion and late-fusion models. Figure 2 (left) presents the final loss averaged across interleaved, image-caption, and text datasets for early-fusion NMMs. The lowest-loss frontier follows a power law as a function of FLOPs. Fitting the power law yields the expression L ∝ C -0 . 049 , indicating the rate of improvement with increasing compute. When analyzing the scaling laws per data type ( e.g ., image-caption, interleaved, text), we observe that the exponent varies (Table 2). For instance, the model achieves a higher rate of improvement for image- Figure 3. Early vs late fusion: scaling training FLOPs. We compare early and late fusion models when scaling both the number of model parameters and the number of training tokens. Overall, early fusion shows a slight advantage, especially at smaller model sizes, and the gap decreases when scaling the number of parameters N . <details> <summary>Image 3 Details</summary> ![a976554f](/v1/image/a976554f9cf9afcc7fd6b8be66184dcdb863623e568b16164037763b5c596f1c) ### Visual Description \n ## Charts: Validation Loss vs. FLOPS for Different Model Configurations ### Overview The image presents three separate charts, each displaying validation loss as a function of FLOPS (Floating Point Operations). The charts compare the performance of different model configurations (Late-289M, Late-494M, Late-1B, Late-2.4B, Early-275M, Early-464M, Early-932M, Early-2.28B) across three different training scenarios: Image-Caption, Interleaved, and Text-only. All charts share the same x-axis (FLOPS) and y-axis (Validation Loss), but vary in the specific data displayed. ### Components/Axes * **X-axis:** FLOPS, ranging from 1020 to 1022 (logarithmic scale). * **Y-axis:** Validation Loss, ranging from approximately 2.2 to 3.4. * **Legend:** Located at the bottom of the image, containing the following model configurations and their corresponding colors: * Late-289M (Light Blue) * Late-494M (Turquoise) * Late-1B (Blue) * Late-2.4B (Dark Blue) * Early-275M (Orange) * Early-464M (Brown) * Early-932M (Dark Brown) * Early-2.28B (Dark Orange) * **Chart Titles:** * Image-Caption (Top-Left) * Interleaved (Top-Center) * Text-only (Top-Right) ### Detailed Analysis or Content Details **Image-Caption Chart (Left)** * **Late-289M (Light Blue):** Starts at approximately 3.05, decreases to approximately 2.35. * **Late-494M (Turquoise):** Starts at approximately 3.0, decreases to approximately 2.3. * **Late-1B (Blue):** Starts at approximately 3.0, decreases to approximately 2.25. * **Late-2.4B (Dark Blue):** Starts at approximately 3.0, decreases to approximately 2.2. * **Early-275M (Orange):** Starts at approximately 3.1, decreases to approximately 2.4. * **Early-464M (Brown):** Starts at approximately 3.05, decreases to approximately 2.3. * **Early-932M (Dark Brown):** Starts at approximately 3.0, decreases to approximately 2.25. * **Early-2.28B (Dark Orange):** Starts at approximately 3.0, decreases to approximately 2.2. **Interleaved Chart (Center)** * **Late-289M (Light Blue):** Starts at approximately 3.05, decreases to approximately 2.35. * **Late-494M (Turquoise):** Starts at approximately 3.0, decreases to approximately 2.3. * **Late-1B (Blue):** Starts at approximately 3.0, decreases to approximately 2.25. * **Late-2.4B (Dark Blue):** Starts at approximately 3.0, decreases to approximately 2.2. * **Early-275M (Orange):** Starts at approximately 3.1, decreases to approximately 2.4. * **Early-464M (Brown):** Starts at approximately 3.05, decreases to approximately 2.3. * **Early-932M (Dark Brown):** Starts at approximately 3.0, decreases to approximately 2.25. * **Early-2.28B (Dark Orange):** Starts at approximately 3.0, decreases to approximately 2.2. **Text-only Chart (Right)** * **Late-289M (Light Blue):** Starts at approximately 3.3, decreases to approximately 2.6. * **Late-494M (Turquoise):** Starts at approximately 3.3, decreases to approximately 2.6. * **Late-1B (Blue):** Starts at approximately 3.25, decreases to approximately 2.55. * **Late-2.4B (Dark Blue):** Starts at approximately 3.25, decreases to approximately 2.5. * **Early-275M (Orange):** Starts at approximately 3.35, decreases to approximately 2.7. * **Early-464M (Brown):** Starts at approximately 3.3, decreases to approximately 2.6. * **Early-932M (Dark Brown):** Starts at approximately 3.25, decreases to approximately 2.55. * **Early-2.28B (Dark Orange):** Starts at approximately 3.25, decreases to approximately 2.5. In all three charts, the lines generally slope downwards, indicating that validation loss decreases as FLOPS increase. Larger models (higher parameter counts) generally achieve lower validation loss for a given FLOPS value. ### Key Observations * The "Late" models consistently outperform the "Early" models across all three training scenarios. * The performance gap between the models tends to narrow as FLOPS increase. * The "Text-only" scenario generally exhibits higher validation loss compared to the "Image-Caption" and "Interleaved" scenarios. * The largest models (Late-2.4B and Early-2.28B) achieve the lowest validation loss, but the improvement diminishes with increasing FLOPS. ### Interpretation The charts demonstrate the relationship between model size (parameter count, represented by the model names), computational cost (FLOPS), and model performance (validation loss) for different training paradigms. The consistent outperformance of the "Late" models suggests that the training methodology or data used for these models is more effective. The decreasing validation loss with increasing FLOPS indicates that more computation generally leads to better model performance, but with diminishing returns. The higher validation loss in the "Text-only" scenario suggests that the models may require more computational resources or a different architecture to achieve comparable performance on text-based tasks compared to tasks involving images and captions or interleaved data. The charts provide valuable insights for optimizing model size and training strategies to achieve the desired balance between performance and computational cost. The diminishing returns observed at higher FLOPS values suggest that there may be a point beyond which increasing computational resources does not significantly improve model performance. </details> caption data ( L ∝ C -0 . 061 ) when compared to interleaved documents ( L ∝ C -0 . 046 ). To model the loss as a function of the number of training tokens D and model parameters N , we fit the parametric function in Eq 1, obtaining scaling exponents α = 0 . 301 and β = 0 . 335 . These describe the rates of improvement when scaling the number of model parameters and training tokens, respectively. Assuming a linear relationship between compute, N , and D ( i.e ., C ∝ ND ), we derive the law relating model parameters to the compute budget (see Appendix C for details). Specifically, for a given compute budget C , we compute the corresponding model size N at logarithmically spaced D values and determine N opt , the parameter count that minimizes loss. Repeating this across different FLOPs values produces a dataset of ( C, N opt ) , to which we fit a power law predicting the compute-optimal model size as a function of compute: N ∗ ∝ C 0 . 526 . Similarly, we fit power laws to estimate the computeoptimal training dataset size as a function of compute and model size:  These relationships allow practitioners to determine the optimal model and dataset size given a fixed compute budget. When analyzing by data type, we find that interleaved data benefits more from larger models ( a = 0 . 532 ) compared to image-caption data ( a = 0 . 520 ), whereas the opposite trend holds for training tokens. We conduct a similar study on late-fusion models in Figure 2 (right) and observe comparable scaling behaviors. In particular, the loss scaling exponent ( c = -0 . 0494 ) is nearly identical to that of early fusion ( c = -0 . 0492 ). This trend is evident in Figure 3, where early fusion outperforms late fusion at smaller model scales, while both architectures converge to similar performance at larger model sizes. We also observe similar trends when varying late-fusion con- Figure 4. Early vs late: pretraining efficiency. Early-fusion is faster to train and consumes less memory. Models are trained on 16 H100 GPUs for 160k steps (300B tokens). <details> <summary>Image 4 Details</summary> ![cc91f757](/v1/image/cc91f757e965615557e55a5943e7adcc49f95308daf8eea4e9bd6b82c7a37ef7) ### Visual Description \n ## Bar Charts: Relative Memory Usage and Relative Train Time vs. FLOPS ### Overview The image presents two bar charts side-by-side. The left chart displays "Relative Memory Usage" (in GB per GPU) against "FLOPS (x 10^21)". The right chart shows "Relative Train Time" (in Hours) against the same "FLOPS" scale. Both charts compare data categorized as "Early" and "Late". A horizontal dashed blue line at y=0 is present in both charts, serving as a reference point. ### Components/Axes * **X-axis (Both Charts):** FLOPS (x 10^21), with markers at 2, 4, and 6. * **Y-axis (Left Chart):** Relative Memory Usage (GB per GPU), ranging from -10 to 0. * **Y-axis (Right Chart):** Relative Train Time (Hours), ranging from -150 to 0. * **Legend (Both Charts):** * "Early" - Represented by a light orange color. * "Late" - Represented by a darker orange color. * **Reference Line (Both Charts):** A horizontal dashed blue line at y=0. ### Detailed Analysis or Content Details **Left Chart: Relative Memory Usage** * **Early:** * At 2 FLOPS (x 10^21): Approximately -2.5 GB per GPU. * At 4 FLOPS (x 10^21): Approximately -6.5 GB per GPU. * At 6 FLOPS (x 10^21): Approximately -2.5 GB per GPU. * **Late:** * At 2 FLOPS (x 10^21): Approximately -1.0 GB per GPU. * At 4 FLOPS (x 10^21): Approximately -1.0 GB per GPU. * At 6 FLOPS (x 10^21): Approximately -0.5 GB per GPU. The "Early" bars consistently show more negative memory usage than the "Late" bars across all FLOPS values. The "Early" memory usage decreases from 2 to 4 FLOPS, then increases again at 6 FLOPS. **Right Chart: Relative Train Time** * **Early:** * At 2 FLOPS (x 10^21): Approximately -25 Hours. * At 4 FLOPS (x 10^21): Approximately -60 Hours. * At 6 FLOPS (x 10^21): Approximately -100 Hours. * **Late:** * At 2 FLOPS (x 10^21): Approximately -15 Hours. * At 4 FLOPS (x 10^21): Approximately -50 Hours. * At 6 FLOPS (x 10^21): Approximately -80 Hours. The "Early" train time is consistently more negative (longer) than the "Late" train time across all FLOPS values. The "Early" train time decreases (becomes more negative) as FLOPS increase. ### Key Observations * Both charts demonstrate a clear difference between "Early" and "Late" categories. * "Early" consistently exhibits more negative values for both memory usage and train time, indicating a greater reduction or longer duration relative to the zero baseline. * The "Early" memory usage shows a non-monotonic trend, decreasing then increasing with FLOPS. * The "Early" train time shows a consistent decrease (more negative) with increasing FLOPS. ### Interpretation The data suggests that the "Early" phase of whatever process is being measured (likely model training) requires significantly more memory and takes considerably longer than the "Late" phase. The decreasing train time with increasing FLOPS in the "Early" phase indicates that increasing computational power (FLOPS) leads to faster training in that phase. The non-monotonic memory usage trend in the "Early" phase could indicate a complex relationship between FLOPS and memory requirements, potentially due to optimization strategies or changing data characteristics during training. The horizontal line at y=0 suggests that the values are relative to a baseline, and negative values represent a reduction or longer duration compared to that baseline. The charts likely represent a comparison of two different training strategies or phases of a single strategy, with "Early" representing an initial, more resource-intensive phase and "Late" representing a refined, more efficient phase. </details> figurations, such as using a smaller vision encoder with a larger text decoder Appendix B. Scaling laws of NMMs vs LLMs. Upon comparing the scaling law coefficients of our NMMs to those reported for text-only LLMs ( e.g ., GPT-3, Chinchilla), we find them to be within similar ranges. In particular, for predicting the loss as a function of compute, GPT-3 [10] follows L ∝ C -0 . 048 , while our models follow L ∝ C -0 . 049 , suggesting that the performance of NMMs adheres to similar scaling laws as LLMs. Similarly, our estimates of the α and β parameters in Eq 1 ( α = 0 . 301 , β = 0 . 335 ) closely match those reported by Hoffmann et al. [26] ( α = 0 . 339 , β = 0 . 285 ). Likewise, our computed values of a = 0 . 526 and b = 0 . 473 align closely with a = 0 . 46 and b = 0 . 54 from [26], reinforcing the idea that, for native multimodal models, the number of training tokens and model parameters should be scaled proportionally. However, since the gap between a and b is smaller than in LLMs, this principle holds even more strongly for NMMs. Additionally, as a = 0 . 526 is greater than b = 0 . 473 in our case, the optimal model size for NMMs is larger than that of LLMs, Figure 5. Scaling laws with different training mixtures. Early-fusion models follow similar scaling trends when changing the pretraining mixtures. However, increasing the image captions leads to a higher scaling exponent norm (see Table 3). <details> <summary>Image 5 Details</summary> ![6a472f80](/v1/image/6a472f8094b560bacb03435ae1e4dc01366fc25685838b1a3404e858d2c93591) ### Visual Description ## Chart: Validation Loss vs. FLOPS for Different Data Splits ### Overview The image presents four separate line charts, each depicting the relationship between Validation Loss (y-axis) and FLOPS (x-axis) for different data split configurations. Each chart represents a different split ratio: 45-45-10, 40-20-40, 30-30-40, and 20-40-40. Within each chart, multiple lines represent different batch sizes (0.275B, 0.464B, 0.932B, 1.627B, 2.280B, and 3.354B). Each chart also includes a linear equation representing the trendline. ### Components/Axes * **X-axis:** FLOPS (Floating Point Operations Per Second), scaled logarithmically from 1019 to 1022. * **Y-axis:** Validation Loss, scaled linearly from 2.5 to 4. * **Chart Titles:** Located at the top of each chart, indicating the data split ratio (45-45-10, 40-20-40, 30-30-40, 20-40-40). * **Legend:** Located at the bottom of the image, horizontally aligned. It maps colors to batch sizes: * Orange: 0.275B * Light Orange: 0.464B * Dark Orange: 0.932B * Brown: 1.627B * Dark Brown: 2.280B * Maroon: 3.354B * **Trendlines:** Each chart displays a linear equation of the form L = aCb, where L is Validation Loss, C is FLOPS, and a and b are coefficients. ### Detailed Analysis or Content Details **Chart 1: 45-45-10** * Trend: All lines slope downwards, indicating that as FLOPS increase, Validation Loss decreases. The lines are relatively close together, suggesting batch size has a moderate effect. * Equation: L = 29.574C-0.0492 * Data Points (approximate): * 0.275B: At FLOPS ~1019, Loss ~3.9; At FLOPS ~1022, Loss ~2.6 * 0.464B: At FLOPS ~1019, Loss ~3.8; At FLOPS ~1022, Loss ~2.6 * 0.932B: At FLOPS ~1019, Loss ~3.7; At FLOPS ~1022, Loss ~2.5 * 1.627B: At FLOPS ~1019, Loss ~3.6; At FLOPS ~1022, Loss ~2.5 * 2.280B: At FLOPS ~1019, Loss ~3.5; At FLOPS ~1022, Loss ~2.4 * 3.354B: At FLOPS ~1019, Loss ~3.4; At FLOPS ~1022, Loss ~2.4 **Chart 2: 40-20-40** * Trend: Similar to Chart 1, all lines slope downwards. The lines are again relatively close. * Equation: L = 28.590C-0.0486 * Data Points (approximate): * 0.275B: At FLOPS ~1019, Loss ~3.8; At FLOPS ~1022, Loss ~2.6 * 0.464B: At FLOPS ~1019, Loss ~3.7; At FLOPS ~1022, Loss ~2.5 * 0.932B: At FLOPS ~1019, Loss ~3.6; At FLOPS ~1022, Loss ~2.4 * 1.627B: At FLOPS ~1019, Loss ~3.5; At FLOPS ~1022, Loss ~2.4 * 2.280B: At FLOPS ~1019, Loss ~3.4; At FLOPS ~1022, Loss ~2.3 * 3.354B: At FLOPS ~1019, Loss ~3.3; At FLOPS ~1022, Loss ~2.3 **Chart 3: 30-30-40** * Trend: Downward slope for all lines. The lines appear slightly more spread out than in the previous charts. * Equation: L = 25.623C-0.0483 * Data Points (approximate): * 0.275B: At FLOPS ~1019, Loss ~3.7; At FLOPS ~1022, Loss ~2.5 * 0.464B: At FLOPS ~1019, Loss ~3.6; At FLOPS ~1022, Loss ~2.4 * 0.932B: At FLOPS ~1019, Loss ~3.5; At FLOPS ~1022, Loss ~2.4 * 1.627B: At FLOPS ~1019, Loss ~3.4; At FLOPS ~1022, Loss ~2.3 * 2.280B: At FLOPS ~1019, Loss ~3.3; At FLOPS ~1022, Loss ~2.3 * 3.354B: At FLOPS ~1019, Loss ~3.2; At FLOPS ~1022, Loss ~2.2 **Chart 4: 20-40-40** * Trend: Downward slope for all lines. The lines are relatively close together. * Equation: L = 29.002C-0.0488 * Data Points (approximate): * 0.275B: At FLOPS ~1019, Loss ~3.8; At FLOPS ~1022, Loss ~2.6 * 0.464B: At FLOPS ~1019, Loss ~3.7; At FLOPS ~1022, Loss ~2.5 * 0.932B: At FLOPS ~1019, Loss ~3.6; At FLOPS ~1022, Loss ~2.5 * 1.627B: At FLOPS ~1019, Loss ~3.5; At FLOPS ~1022, Loss ~2.4 * 2.280B: At FLOPS ~1019, Loss ~3.4; At FLOPS ~1022, Loss ~2.4 * 3.354B: At FLOPS ~1019, Loss ~3.3; At FLOPS ~1022, Loss ~2.3 ### Key Observations * The Validation Loss consistently decreases as FLOPS increase across all data split configurations and batch sizes. * The batch size has a relatively small impact on the Validation Loss, with the lines generally remaining close together within each chart. * The 30-30-40 split appears to have the lowest overall Validation Loss across the range of FLOPS. * The equations show a negative exponent for FLOPS, confirming the inverse relationship between FLOPS and Validation Loss. ### Interpretation The charts demonstrate the impact of computational resources (FLOPS) and data split configurations on model performance (Validation Loss). The consistent downward trend indicates that increasing FLOPS generally leads to improved model accuracy, as measured by lower Validation Loss. The relatively small effect of batch size suggests that, within the tested range, the data split configuration is a more significant factor in determining performance. The 30-30-40 split consistently yields the lowest Validation Loss, suggesting it may be the optimal configuration for this particular model and dataset. The linear equations provide a quantitative relationship between FLOPS and Validation Loss, which can be used to predict performance for different computational budgets. The data suggests diminishing returns as FLOPS increase, as the rate of loss reduction slows down at higher FLOPS values. This implies that there is a point beyond which increasing FLOPS provides only marginal improvements in model accuracy. </details> Table 3. Scaling laws for different training mixtures . Earlyfusion models. C-I-T refer to image-caption, interleaved and text while the optimal number of training tokens is lower, given a fixed compute budget. | | C-I-T (%) | I/T ratio | E | α | β | a | b | d | c | |----|-------------|-------------|-------|-------|-------|-------|-------|-------|---------| | 1 | 45-45-10 | 1.19 | 1.906 | 0.301 | 0.335 | 0.527 | 0.474 | 0.901 | -0.0492 | | 2 | 40-20-40 | 0.65 | 1.965 | 0.328 | 0.348 | 0.518 | 0.486 | 0.937 | -0.0486 | | 3 | 30-30-40 | 0.59 | 1.847 | 0.253 | 0.338 | 0.572 | 0.428 | 0.748 | -0.0463 | | 4 | 20-40-40 | 0.49 | 1.836 | 0.259 | 0.354 | 0.582 | 0.423 | 0.726 | -0.0488 | Compute-optimal trade-offs for early vs. late fusion NMMs. While late- and early-fusion models reduce loss at similar rates with increasing FLOPs, we observe distinct trade-offs in their compute-optimal models. Specifically, N opt is larger for late-fusion models, whereas D opt is larger for early-fusion models. This indicates that, given a fixed compute budget, late-fusion models require a higher number of parameters, while early-fusion models benefit more from a higher number of training tokens. This trend is also reflected in the lower N opt D opt ∝ C 0 . 053 for early fusion compared to N opt D opt ∝ C 0 . 076 for late fusion. As shown in Figure 1 (bottom), when scaling FLOPs, the number of parameters of early fusion models becomes significantly lower, which is crucial for reducing inference costs and, consequently, lowering serving costs after deployment. Early-fusion is more efficient to train. We compare the training efficiency of lateand early-fusion architectures. As shown in Figure 4, early-fusion models consume less memory and train faster under the same compute budget. This advantage becomes even more pronounced as compute increases, highlighting the superior training efficiency of early fusion while maintaining comparable performance to late fusion at scale. Notably, for the same FLOPs, latefusion models have a higher parameter count and higher effective depth ( i.e ., additional vision encoder layers alongside decoder layers) compared to early-fusion models. Figure 7. Early vs late fusion: changing the training mixture. Wevary the training mixtures and plot the final training loss. Early fusion models attain a favorable performance when increasing the proportion of interleaved documents and text-only data. <details> <summary>Image 6 Details</summary> ![3f4dfa37](/v1/image/3f4dfa37e21e63fce950c677e8139fe38884b4909622c71ae1588799f45241a6) ### Visual Description \n ## Line Chart: Cross-Entropy vs. Interleaving/Text Percentage ### Overview The image presents two line charts comparing cross-entropy values for "Late" and "Early" conditions under two different scenarios: "Interleaved" and "Text-only". The charts visualize the relationship between cross-entropy and the percentage of interleaving or text. Each chart has error bars, but the values are not clearly visible. ### Components/Axes * **Y-axis (Both Charts):** "Cross-entropy" with a scale ranging from approximately 2.6 to 2.9. * **X-axis (Left Chart):** "% of Interleaved" with a scale ranging from approximately 30 to 90. * **X-axis (Right Chart):** "% of Text" with a scale ranging from approximately 10 to 30. * **Legend (Both Charts):** * "Late" - Represented by a blue line with circular markers. * "Early" - Represented by an orange/brown line with diamond markers. * **Titles:** * Left Chart: "Interleaved" * Right Chart: "Text-only" ### Detailed Analysis or Content Details **Left Chart: Interleaved** * **"Late" Line:** The blue line slopes downward, indicating a decrease in cross-entropy as the percentage of interleaved content increases. * At approximately 30% Interleaved: Cross-entropy is around 2.72. * At approximately 60% Interleaved: Cross-entropy is around 2.64. * At approximately 90% Interleaved: Cross-entropy is around 2.60. * **"Early" Line:** The orange/brown line also slopes downward, but is generally above the "Late" line. * At approximately 30% Interleaved: Cross-entropy is around 2.75. * At approximately 60% Interleaved: Cross-entropy is around 2.62. * At approximately 90% Interleaved: Cross-entropy is around 2.58. **Right Chart: Text-only** * **"Late" Line:** The blue line slopes downward, indicating a decrease in cross-entropy as the percentage of text increases. * At approximately 10% Text: Cross-entropy is around 2.90. * At approximately 20% Text: Cross-entropy is around 2.86. * At approximately 30% Text: Cross-entropy is around 2.82. * **"Early" Line:** The orange/brown line also slopes downward, and is generally above the "Late" line. * At approximately 10% Text: Cross-entropy is around 2.92. * At approximately 20% Text: Cross-entropy is around 2.87. * At approximately 30% Text: Cross-entropy is around 2.83. ### Key Observations * In both charts, cross-entropy decreases as the percentage of interleaved content or text increases. * The "Early" condition consistently exhibits higher cross-entropy values than the "Late" condition across both scenarios. * The rate of decrease in cross-entropy appears to be more pronounced at lower percentages of interleaving/text. * Error bars are present, but their magnitude is difficult to determine visually. ### Interpretation The data suggests that increasing the proportion of interleaved content or text leads to a reduction in cross-entropy, indicating improved performance or a better fit between the model and the data. The consistently higher cross-entropy values for the "Early" condition suggest that the "Late" condition is more effective or better aligned with the data. The difference in cross-entropy between the "Early" and "Late" conditions may indicate a learning or adaptation effect over time. The charts demonstrate the impact of content presentation (interleaved vs. text-only) and timing ("Early" vs. "Late") on cross-entropy, a measure of the difference between the predicted probability distribution and the true distribution. The decreasing trend in cross-entropy with increasing percentage suggests that more content or a more integrated presentation leads to a more accurate model. </details> ## 3.2. Scaling laws for different data mixtures We investigate how variations in the training mixture affect the scaling laws of native multimodal models. To this end, we study four different mixtures that reflect common community practices [34, 41, 46, 81], with Image CaptionInterleaved-Text ratios of 45-45-10 (our default setup), 30-30-40 , 40-20-40 , and 20-40-40 . For each mixture, we conduct a separate scaling study by training 76 different models, following our setup in § 3.1. Overall, Figure 5 shows that different mixtures follow similar scaling trends; however, the scaling coefficients vary depending on the mixture (Table 3). Interestingly, increasing the proportion of image-caption data (mixtures 1 and 2) leads to lower a and higher b , whereas increasing the ratio of interleaved and text data (mixtures 3 and 4) have the opposite effect. Notably, image-caption data contains more image tokens than text tokens; therefore, increasing its proportion results in more image tokens, while increasing interleaved and text data increases text token counts. This suggests that, when image tokens are prevalent, training for longer decreases the loss faster than increasing the model size. We also found that for a fixed model size, increasing text-only and interleaved data ratio is in favor of early-fusion Figure 7. Figure 8. Early native vs initializing from LLMs: initializing from pre-trained models and scaling training tokens. We compare training with and without initializing from DCLM-1B. <details> <summary>Image 7 Details</summary> ![52614d6d](/v1/image/52614d6dd0c757a844e5e3e1afea0daaed39570afc2b054c5989c2755f353b29) ### Visual Description \n ## Line Chart: Validation Loss vs. Tokens Seen for Different Training Strategies ### Overview The image presents three line charts, each depicting the relationship between Validation Loss and Tokens Seen during training. Each chart represents a different training strategy: "Image-Caption", "Interleaved", and "Text-only". Each chart contains two lines representing different initialization methods ("Early" and "Early + Init."). The charts are arranged horizontally, side-by-side. ### Components/Axes * **X-axis:** "Tokens seen" with markers at 200B, 600B, 1T, and 1.4T. (B = Billion, T = Trillion) * **Y-axis:** "Validation Loss" ranging from approximately 2.3 to 2.9. * **Legend:** Located at the bottom-center of the image. * Solid Line: "Early" (represented by a dark orange color) * Dashed Line: "Early + Init." (represented by a lighter orange color with triangle markers) * **Titles:** Each chart has a title indicating the training strategy: "Image-Caption", "Interleaved", and "Text-only", positioned at the top-center of each chart. ### Detailed Analysis or Content Details **Chart 1: Image-Caption** * **Early (Solid Line):** The line slopes downward, indicating decreasing validation loss as tokens seen increase. * 200B: Approximately 2.52 * 600B: Approximately 2.38 * 1T: Approximately 2.32 * 1.4T: Approximately 2.28 * **Early + Init. (Dashed Line):** The line also slopes downward, but starts at a lower validation loss than the "Early" line and remains consistently below it. * 200B: Approximately 2.45 * 600B: Approximately 2.32 * 1T: Approximately 2.27 * 1.4T: Approximately 2.23 **Chart 2: Interleaved** * **Early (Solid Line):** The line slopes downward. * 200B: Approximately 2.70 * 600B: Approximately 2.58 * 1T: Approximately 2.50 * 1.4T: Approximately 2.45 * **Early + Init. (Dashed Line):** The line slopes downward, starting below the "Early" line. * 200B: Approximately 2.62 * 600B: Approximately 2.50 * 1T: Approximately 2.43 * 1.4T: Approximately 2.38 **Chart 3: Text-only** * **Early (Solid Line):** The line slopes downward. * 200B: Approximately 2.90 * 600B: Approximately 2.80 * 1T: Approximately 2.75 * 1.4T: Approximately 2.70 * **Early + Init. (Dashed Line):** The line slopes downward, starting below the "Early" line. * 200B: Approximately 2.83 * 600B: Approximately 2.73 * 1T: Approximately 2.68 * 1.4T: Approximately 2.63 ### Key Observations * In all three charts, the "Early + Init." line consistently exhibits lower validation loss than the "Early" line, suggesting that the initialization method improves performance. * The "Text-only" training strategy consistently shows the highest validation loss across all tokens seen, indicating it is the least effective of the three strategies. * The "Image-Caption" strategy consistently shows the lowest validation loss across all tokens seen, indicating it is the most effective of the three strategies. * The rate of decrease in validation loss appears to slow down as the number of tokens seen increases in all charts, suggesting diminishing returns from further training. ### Interpretation The data suggests that the initialization method ("Early + Init.") consistently improves model performance across all training strategies. The "Image-Caption" strategy yields the best results, likely due to the richer information provided by both image and text data. The "Text-only" strategy performs the worst, indicating that the model benefits significantly from visual information. The decreasing rate of validation loss reduction with increasing tokens seen suggests that the models are approaching convergence, and further training may not yield substantial improvements. The charts provide a comparative analysis of different training approaches and highlight the importance of both data modality and initialization techniques in achieving optimal model performance. The consistent gap between the "Early" and "Early + Init." lines suggests a systematic benefit from the initialization, rather than a random fluctuation. </details> ## 3.3. Native multimodal pre-training vs . continual training of LLMs In this section, we compare training natively from scratch to continual training after initializing from a pre-trained LLM. We initialize the model from DCLM-1B [21] that is trained on more than 2T tokens. Figure 8 shows that native multimodal models can close the gap with initialized models when trained for longer. Specifically, on image captioning data, the model requires fewer than 100B multimodal tokens to reach comparable performance. However, on interleaved and text data, the model may need longer training-up to 1T tokens. Considering the cost of pre-training, these results suggest that training natively could be a more efficient approach for achieving the same performance on multimodal benchmarks. ## 4. Towards multimodal specialization Previously, we demonstrated that early-fusion models achieve performance on par with late-fusion models under a fixed compute budget. However, multimodal data is inherently heterogeneous, and training a unified model to fit such diverse distributions may be suboptimal. Here, we argue for multimodal specialization within a unified architecture. Ideally, the model should implicitly adapt to each modality, for instance, by learning modality-specific weights or specialized experts. Mixture of Experts is a strong candidate for this approach, having demonstrated effectiveness in LLMs. In this section, we highlight the advantages of sparse earlyfusion models over their dense counterparts. Setup. Our sparse models are based on the dropless-MoE implementation of Gale et al. [24], which eliminates token dropping during training caused by expert capacity constraints. We employ a topk expert-choice routing mechanism, where each token selects its topk experts among the E available experts. Specifically, we set k = 1 and E = 8 , as we find this configuration to work effectively. Additionally, we incorporate an auxiliary load-balancing loss [59] with a weight of 0.01 to ensure a balanced expert utilization. Figure 9. Scaling laws for sparse early-fusion NMMs. We report the final validation loss averaged across interleaved, imagecaptions and text data. <details> <summary>Image 8 Details</summary> ![43aa9cf0](/v1/image/43aa9cf0f53389484f36b00aa91b9c72e1233b855cd03287bf804b4a2ff3326c) ### Visual Description \n ## Chart: Validation Loss vs. FLOPS for Different Model Sizes ### Overview The image presents a line chart illustrating the relationship between Validation Loss and FLOPS (Floating Point Operations) for several different model sizes. The chart appears to be evaluating the performance of machine learning models as computational cost increases. A linear regression line is also plotted. ### Components/Axes * **X-axis:** FLOPS, labeled "FLOPS". The scale is logarithmic, ranging from approximately 1019 to 1022. * **Y-axis:** Validation Loss, labeled "Validation Loss". The scale is linear, ranging from approximately 2.5 to 4.5. * **Legend:** Located in the top-right corner, the legend identifies six different model sizes, each represented by a different shade of green and a corresponding marker: * 0.275B (lightest green) * 0.464B (slightly darker green) * 0.932B (medium green) * 1.627B (darker green) * 2.280B (very dark green) * 3.534B (darkest green) * **Linear Regression Line:** A black line is plotted across the chart, with the equation "L = 26.287-0.047" displayed near the top-left. ### Detailed Analysis Each model size is represented by a line showing how Validation Loss decreases as FLOPS increase. * **0.275B (lightest green):** The line starts at approximately (1019, 4.2) and decreases to approximately (5 x 1021, 2.7). The slope is initially steep, then flattens out. * **0.464B (slightly darker green):** The line starts at approximately (1019, 4.1) and decreases to approximately (5 x 1021, 2.6). The slope is initially steep, then flattens out. * **0.932B (medium green):** The line starts at approximately (1019, 3.9) and decreases to approximately (5 x 1021, 2.5). The slope is initially steep, then flattens out. * **1.627B (darker green):** The line starts at approximately (1019, 3.7) and decreases to approximately (5 x 1021, 2.4). The slope is initially steep, then flattens out. * **2.280B (very dark green):** The line starts at approximately (1019, 3.5) and decreases to approximately (5 x 1021, 2.3). The slope is initially steep, then flattens out. * **3.534B (darkest green):** The line starts at approximately (1019, 3.3) and decreases to approximately (5 x 1021, 2.2). The slope is initially steep, then flattens out. All lines exhibit a similar trend: a rapid decrease in Validation Loss for lower FLOPS values, followed by a gradual flattening as FLOPS increase. The larger models (3.534B) consistently achieve lower Validation Loss values for a given FLOPS value. The linear regression line starts at approximately (1019, 3.5) and decreases to approximately (1022, 2.5). ### Key Observations * **Negative Correlation:** There is a clear negative correlation between FLOPS and Validation Loss. As computational cost (FLOPS) increases, the model's performance (Validation Loss) generally improves. * **Diminishing Returns:** The rate of improvement in Validation Loss decreases as FLOPS increase, suggesting diminishing returns from increasing computational cost beyond a certain point. * **Model Size Impact:** Larger models consistently outperform smaller models, achieving lower Validation Loss for the same FLOPS. * **Convergence:** All lines appear to converge at higher FLOPS values, indicating that the performance difference between models diminishes as computational resources become abundant. ### Interpretation This chart demonstrates the trade-off between computational cost and model performance. Increasing the size of the model (and therefore the number of FLOPS required) generally leads to improved performance, as measured by Validation Loss. However, the gains in performance diminish as the model size increases, suggesting that there is an optimal point beyond which further increases in model size do not provide significant benefits. The linear regression line provides a baseline for expected performance. The model lines generally fall below the regression line, indicating that the models are performing better than expected based on a simple linear relationship between FLOPS and Validation Loss. The convergence of the lines at higher FLOPS values suggests that all models are approaching a similar level of performance, and that further increases in computational cost may not be worthwhile. This information is crucial for resource allocation and model selection in machine learning projects. The equation provided for the linear regression line could be used to estimate the expected Validation Loss for a given FLOPS value, and to compare the performance of different models. </details> Following Abnar et al. [2], we compute training FLOPs as 6 ND , where N represents the number of active parameters. ## 4.1. Sparse vs dense NMMs when scaling FLOPs We compare sparse MoE models to their dense counterparts by training models with different numbers of active parameters and varying amounts of training tokens. Figure 10 shows that, under the same inference cost (or number of active parameters), MoEs significantly outperform dense models. Interestingly, this performance gap is more pronounced for smaller model sizes. This suggests that MoEs enable models to handle heterogeneous data more effectively and specialize in different modalities. However, as dense models become sufficiently large, the gap between the two architectures gradually closes. ## 4.2. Scaling laws for sparse early-fusion models We train different models (ranging from 300M to 3.4B active parameters) on varying amounts of tokens (ranging from 250M to 600B) and report the final loss in Figure 9. We fit a power law to the convex hull of the lowest loss as a function of compute (FLOPs). Interestingly, the exponent ( -0 . 048 ) is close to that of dense NMMs ( -0 . 049 ), indicating that both architectures scale similarly. However, the multiplicative constant is smaller for MoEs ( 27 . 086 ) compared to dense models ( 29 . 574 ), revealing lower loss. Additionally, MoEs require longer training to reach saturation compared to dense models (Appendix C for more details). We also predict the coefficients of Eq 1 by considering N as the number of active parameters. Table 2 shows significantly higher α compared to dense models. Interestingly, b is significantly higher than a , revealing that the training tokens should be scaled at a higher rate than the number of parameters when training sparse NMMs. We also experiment with a scaling law that takes into account the sparsity [2] and reached similar conclusions Appendix C.7. ## 4.3. Modality-aware vs . Modality-agnostic routing Another alternative to MoEs is modality-aware routing, where multimodal tokens are assigned to experts based on Figure 10. MoE vs Dense: scaling training FLOPs. We compare MoE and dense early-fusion models when scaling both the amount of training tokens and model sizes. MoEs beat dense models when matching the number of active parameters. <details> <summary>Image 9 Details</summary> ![2b5efd49](/v1/image/2b5efd4918894717c40f4a71592b1966d59bfccfbbda7c5b80d59c6c1b6cbdad) ### Visual Description ## Chart: Validation Loss vs. Tokens Seen ### Overview The image presents two line charts comparing validation loss against the number of tokens seen during training. The charts are titled "Image-Caption" and "Interleaved", suggesting different training methodologies. Each chart displays multiple lines, each representing a different model configuration. The x-axis represents "Tokens seen" and the y-axis represents "Validation Loss". ### Components/Axes * **X-axis:** "Tokens seen" with markers at 100B, 200B, 300B, and 400B. (B = Billion) * **Y-axis:** "Validation Loss" with a scale ranging from approximately 2.2 to 3.0. * **Legend:** Located at the bottom of the image, containing the following model configurations: * Dense-275M (Orange) - represented by a solid orange line with circle markers. * Dense-464M (Light Orange) - represented by a dashed light orange line with square markers. * Dense-932M (Brown) - represented by a solid brown line with circle markers. * Dense-1.6B (Dark Orange) - represented by a solid dark orange line with diamond markers. * MoE-275M (Green) - represented by a solid green line with circle markers. * MoE-464M (Light Green) - represented by a dashed light green line with square markers. * MoE-932M (Dark Green) - represented by a solid dark green line with circle markers. * MoE-1.63B (Bright Green) - represented by a solid bright green line with diamond markers. ### Detailed Analysis or Content Details **Image-Caption Chart:** * **Dense-275M:** Starts at approximately 2.9, decreases to around 2.55 by 400B tokens. * **Dense-464M:** Starts at approximately 2.8, decreases to around 2.5 by 400B tokens. * **Dense-932M:** Starts at approximately 2.75, decreases to around 2.45 by 400B tokens. * **Dense-1.6B:** Starts at approximately 2.85, decreases to around 2.5 by 400B tokens. * **MoE-275M:** Starts at approximately 2.8, decreases to around 2.4 by 400B tokens. * **MoE-464M:** Starts at approximately 2.7, decreases to around 2.35 by 400B tokens. * **MoE-932M:** Starts at approximately 2.65, decreases to around 2.3 by 400B tokens. * **MoE-1.63B:** Starts at approximately 2.6, decreases to around 2.2 by 400B tokens. **Interleaved Chart:** * **Dense-275M:** Starts at approximately 2.95, decreases to around 2.75 by 400B tokens. * **Dense-464M:** Starts at approximately 2.9, decreases to around 2.7 by 400B tokens. * **Dense-932M:** Starts at approximately 2.85, decreases to around 2.65 by 400B tokens. * **Dense-1.6B:** Starts at approximately 2.9, decreases to around 2.7 by 400B tokens. * **MoE-275M:** Starts at approximately 2.85, decreases to around 2.55 by 400B tokens. * **MoE-464M:** Starts at approximately 2.8, decreases to around 2.5 by 400B tokens. * **MoE-932M:** Starts at approximately 2.75, decreases to around 2.45 by 400B tokens. * **MoE-1.63B:** Starts at approximately 2.7, decreases to around 2.35 by 400B tokens. ### Key Observations * In both charts, all models exhibit a decreasing trend in validation loss as the number of tokens seen increases, indicating learning and improvement. * The MoE models consistently demonstrate lower validation loss compared to their Dense counterparts across all token counts. * Larger models (higher M/B values) generally achieve lower validation loss within each architecture type (Dense or MoE). * The rate of loss reduction appears to slow down as the number of tokens seen increases, suggesting diminishing returns from further training. * The "Interleaved" chart generally shows higher validation loss values than the "Image-Caption" chart for the same models and token counts. ### Interpretation The charts compare the performance of Dense and Mixture-of-Experts (MoE) models with varying sizes (275M to 1.6B parameters) during training on two different datasets or training strategies ("Image-Caption" and "Interleaved"). The validation loss metric indicates how well the models generalize to unseen data. The consistent outperformance of MoE models suggests that the Mixture-of-Experts architecture is more effective at capturing the complexity of the data and achieving better generalization. The larger models within each architecture type also perform better, indicating that increasing model capacity can lead to improved performance, up to a point. The higher validation loss in the "Interleaved" chart suggests that the interleaved training strategy may be more challenging or require more data to achieve comparable performance to the "Image-Caption" strategy. This could be due to the nature of the interleaved data or the training process itself. The diminishing returns observed at higher token counts suggest that further training may not significantly improve performance and could potentially lead to overfitting. The charts provide valuable insights into the trade-offs between model architecture, size, training strategy, and performance. </details> their modalities, similar to previous works [7, 75]. We train models with distinct image and text experts in the form of FFNs, where image tokens are processed only by the image FFN and text tokens only by the text FFN. Compared to modality-aware routing, MoEs exhibit significantly better performance on both image-caption and interleaved data as presented in Figure 11. ## 4.4. Emergence of expert specialization and sharing We investigate multimodal specialization in MoE architectures. In Figure 13, we visualize the normalized number of text and image tokens assigned to each expert across layers. To quantify this specialization, we compute a specialization score, defined as the average, across all experts within a layer, of 1 -H ( p ) , where H is the binary entropy of each expert's text/image token distribution. We plot this specialization score in Figure 12. Higher specialization scores indicate a tendency for experts to focus on either text or image tokens, while lower scores indicate a shared behavior. These visualizations provide clear evidence of modalityspecific experts, particularly in the early layers. Furthermore, the specialization score decreases as the number of layers increases, before rising again in the last layers. This suggests that early and final layers exhibit higher modality specialization compared to mid-layers. This behavior is intuitive, as middle layers are expected to hold higherlevel features that may generalize across modalities, and consistent with findings in [61] that shows increasing alignment between modalities across layers. The emergence of both expert specialization and cross-modality sharing in our modality-agnostic MoE, suggests it may be a preferable approach compared to modality-aware sparsity. All data displayed here is from an early-fusion MoE model with 1B active parameters trained for 300B tokens. Table 4. Supervised finetuning on the LLaVA mixture. All models are native at 1.5B scale and pre-trained on 300B tokens. | | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | CIDEr | CIDEr | |--------------|------------|------------|------------|------------|------------|------------|---------|----------| | | AVG | VQAv2 | TextVQA | OKVQA | GQA | VizWiz | COCO | TextCaps | | Late-fusion | 46.8 | 69.4 | 25.8 | 50.1 | 65.8 | 22.8 | 70.7 | 50.9 | | Early-fusion | 47.6 | 69.3 | 28.1 | 52.1 | 65.4 | 23.2 | 72.0 | 53.8 | | Early-MoEs | 48.2 | 69.8 | 30.0 | 52.1 | 65.4 | 23.6 | 69.6 | 55.7 | Figure 11. Modality-aware vs modality agnostic routing for sparse NMMs. We compare modality-agnostic routing with modality-aware routing when scaling both the amount of training tokens and model sizes. <details> <summary>Image 10 Details</summary> ![ac95324d](/v1/image/ac95324d751e128f61d50a3390b03ff5452564e16ed4d838a1011f9fb64ae102) ### Visual Description \n ## Chart: Validation Loss vs. Tokens Seen ### Overview The image presents two line charts comparing validation loss against tokens seen for different model configurations. The left chart is labeled "Image-Caption" and the right chart is labeled "Interleaved". Each chart displays multiple lines representing different models, with the legend at the bottom identifying each line by model name and size (in millions or billions of parameters). ### Components/Axes * **X-axis:** "Tokens seen", ranging from 100B to 400B, with markers at 100B, 200B, 300B, and 400B. * **Y-axis (Left Chart):** "Validation Loss", ranging from 2.2 to 2.9, with markers at 2.2, 2.4, 2.6, and 2.8. * **Y-axis (Right Chart):** "Validation Loss", ranging from 2.5 to 3.1, with markers at 2.6, 2.8, and 3.0. * **Legend:** Located at the bottom of the image, containing the following model configurations: * Aware-275M (light green, dashed) * Agnostic-275M (dark green, solid) * Aware-464M (light green, dash-dot) * Agnostic-464M (dark green, dashed-dot) * Aware-932M (light green, dotted) * Agnostic-932M (dark green, solid-dotted) * Aware-1.63 (light green, solid) * Agnostic-1.63B (dark green, solid) ### Detailed Analysis or Content Details **Image-Caption Chart (Left):** * **Aware-275M:** Starts at approximately 2.85, decreases to approximately 2.65 at 200B, then fluctuates around 2.6 to 2.7, ending at approximately 2.6. * **Agnostic-275M:** Starts at approximately 2.8, decreases to approximately 2.55 at 200B, then decreases to approximately 2.45 at 300B, and ends at approximately 2.4. * **Aware-464M:** Starts at approximately 2.8, decreases to approximately 2.6 at 200B, then decreases to approximately 2.55 at 300B, and ends at approximately 2.5. * **Agnostic-464M:** Starts at approximately 2.75, decreases to approximately 2.55 at 200B, then decreases to approximately 2.5 at 300B, and ends at approximately 2.5. * **Aware-932M:** Starts at approximately 2.75, decreases to approximately 2.55 at 200B, then decreases to approximately 2.45 at 300B, and ends at approximately 2.4. * **Agnostic-932M:** Starts at approximately 2.7, decreases to approximately 2.5 at 200B, then decreases to approximately 2.4 at 300B, and ends at approximately 2.35. * **Aware-1.63:** Starts at approximately 2.7, decreases to approximately 2.45 at 200B, then decreases to approximately 2.3 at 300B, and ends at approximately 2.25. * **Agnostic-1.63B:** Starts at approximately 2.65, decreases to approximately 2.4 at 200B, then decreases to approximately 2.3 at 300B, and ends at approximately 2.2. **Interleaved Chart (Right):** * **Aware-275M:** Starts at approximately 3.0, decreases to approximately 2.85 at 200B, then fluctuates around 2.8 to 2.9, ending at approximately 2.8. * **Agnostic-275M:** Starts at approximately 3.0, decreases to approximately 2.8 at 200B, then decreases to approximately 2.75 at 300B, and ends at approximately 2.7. * **Aware-464M:** Starts at approximately 3.0, decreases to approximately 2.8 at 200B, then decreases to approximately 2.7 at 300B, and ends at approximately 2.7. * **Agnostic-464M:** Starts at approximately 2.95, decreases to approximately 2.75 at 200B, then decreases to approximately 2.7 at 300B, and ends at approximately 2.7. * **Aware-932M:** Starts at approximately 2.95, decreases to approximately 2.75 at 200B, then decreases to approximately 2.65 at 300B, and ends at approximately 2.6. * **Agnostic-932M:** Starts at approximately 2.9, decreases to approximately 2.7 at 200B, then decreases to approximately 2.6 at 300B, and ends at approximately 2.55. * **Aware-1.63:** Starts at approximately 2.9, decreases to approximately 2.65 at 200B, then decreases to approximately 2.5 at 300B, and ends at approximately 2.4. * **Agnostic-1.63B:** Starts at approximately 2.85, decreases to approximately 2.6 at 200B, then decreases to approximately 2.5 at 300B, and ends at approximately 2.4. ### Key Observations * In both charts, all models exhibit a decreasing trend in validation loss as the number of tokens seen increases. This indicates that the models are learning and improving with more data. * The larger models (Aware-1.63 and Agnostic-1.63B) consistently achieve lower validation loss compared to the smaller models. * The "Agnostic" models generally perform slightly better than the "Aware" models, particularly at higher token counts. * The rate of loss reduction appears to slow down as the number of tokens seen increases, suggesting diminishing returns from further training. ### Interpretation The charts demonstrate the impact of model size and training data on validation loss. Larger models, with more parameters, are capable of achieving lower validation loss, indicating better generalization performance. The "Agnostic" models' slight advantage suggests that their architecture or training procedure may be more effective. The decreasing trend in validation loss with increasing tokens seen confirms the importance of data in model training. The flattening of the curves at higher token counts suggests that there is a point of diminishing returns, where adding more data does not significantly improve performance. These results are crucial for understanding the trade-offs between model size, training data, and performance, and for optimizing model training strategies. The two charts, "Image-Caption" and "Interleaved", show similar trends, suggesting that the observed effects are consistent across different training paradigms. </details> ## 5. Evaluation on downstream tasks with SFT Following previous work on scaling laws, we primarily rely on validation losses. However, we generally find that this evaluation correlates well with performance on downstream tasks. To validate this, we conduct a multimodal instruction tuning stage (SFT) on the LLaVA mixture [43] and report accuracy and CIDEr scores across several VQA and captioning tasks. Table 4 confirms the ranking of different model configurations. Specifically, early fusion outperforms late fusion, and MoEs outperform dense models. However, since the models are relatively small (1.5B scale), trained from scratch, and fine-tuned on a small dataset, the overall scores are lower than the current state of the art. Further implementation details can be found in Appendix A. ## 6. Related work Large multimodal models. A long-standing research goal has been to develop models capable of perceiving the world through multiple modalities, akin to human sensory experience. Recent progress in vision and language processing has shifted the research focus from smaller, taskspecific models toward large, generalist models that can handle diverse inputs [29, 67]. Crucially, pre-trained vision and language backbones often require surprisingly little adaptation to enable effective cross-modal communication [32, 47, 62, 68, 69]. Simply integrating a vision encoder with either an encoder-decoder architecture [45, 48, 63, 72] Figure 12. MoE specialization score. Entropy-based image/text specialization score (as described in § 4.4) across layers for two data sources: HQITP and Obelics. HQITP has a more imbalanced image-to-text token distribution, resulting in generally higher specialization. Despite this difference, both data sources exhibit a similar trend: the specialization score decreases in the early layers before increasing again in the final layers. <details> <summary>Image 11 Details</summary> ![a660365f](/v1/image/a660365ff6865d53fdd04cb0a4d788822dff6a99efdab5f10f1c2ca451f087ac) ### Visual Description ## Line Chart: I/T Specialization vs. Layers ### Overview This image presents a line chart illustrating the relationship between I/T specialization and the number of layers in two different models: HQITP and Obelics. The chart displays how I/T specialization changes as the number of layers increases. ### Components/Axes * **X-axis:** "Layers" - ranging from 0 to 22, with tick marks at integer values. * **Y-axis:** "I/T specialization" - ranging from 0 to 1, with tick marks at 0.2 intervals. * **Data Series 1:** HQITP - represented by a blue line with circular markers. * **Data Series 2:** Obelics - represented by a red line with square markers. * **Legend:** Located at the bottom-center of the chart, identifying the two data series with their corresponding colors and labels. ### Detailed Analysis **HQITP (Blue Line):** The HQITP line starts at approximately 1.0 at Layer 0. It then decreases to approximately 0.65 at Layer 2, then fluctuates between approximately 0.6 and 0.75 from Layer 2 to Layer 22, with a slight upward trend towards the end. * Layer 0: ~1.0 * Layer 2: ~0.65 * Layer 4: ~0.62 * Layer 6: ~0.60 * Layer 8: ~0.64 * Layer 10: ~0.62 * Layer 12: ~0.65 * Layer 14: ~0.63 * Layer 16: ~0.67 * Layer 18: ~0.62 * Layer 20: ~0.64 * Layer 22: ~0.72 **Obelics (Red Line):** The Obelics line starts at approximately 0.8 at Layer 0. It rapidly decreases to approximately 0.2 at Layer 2, then fluctuates between approximately 0.15 and 0.35 from Layer 2 to Layer 22, with a slight upward trend towards the end. * Layer 0: ~0.8 * Layer 2: ~0.2 * Layer 4: ~0.25 * Layer 6: ~0.30 * Layer 8: ~0.28 * Layer 10: ~0.25 * Layer 12: ~0.32 * Layer 14: ~0.22 * Layer 16: ~0.15 * Layer 18: ~0.20 * Layer 20: ~0.25 * Layer 22: ~0.30 ### Key Observations * HQITP consistently exhibits higher I/T specialization values compared to Obelics across all layers. * Both models show a significant drop in I/T specialization between Layer 0 and Layer 2. * The Obelics model experiences a more dramatic initial decrease in I/T specialization than HQITP. * Both lines exhibit fluctuations in I/T specialization as the number of layers increases, suggesting a complex relationship. * Both lines show a slight upward trend in specialization towards the end of the range (Layers 20-22). ### Interpretation The chart suggests that the HQITP model maintains a higher degree of I/T specialization as the number of layers increases compared to the Obelics model. The initial drop in specialization for both models could indicate a loss of specific functionality or a diffusion of responsibility as layers are added. The fluctuations observed in both lines suggest that the relationship between layers and specialization is not linear and may be influenced by other factors not represented in this chart. The slight upward trend at the end of the range could indicate a convergence or stabilization of specialization as the models reach a certain level of complexity. The significant difference in specialization between the two models suggests that they may be designed for different purposes or operate under different principles. The rapid decline in Obelics specialization could indicate a more significant trade-off between complexity and specialization in that model. </details> or a decoder-only LLM has yielded highly capable multimodal systems [1, 6, 9, 13, 16, 35, 43, 49, 64, 73, 78, 83]. This late-fusion approach, where modalities are processed separately before being combined, is now well-understood, with established best practices for training effective models [34, 41, 46, 81]. In contrast, early-fusion models [8, 18, 66], which combine modalities at an earlier stage, remain relatively unexplored, with only a limited number of publicly released models [8, 18]. Unlike [18, 66], our models utilize only a single linear layer and rely exclusively on a nexttoken prediction loss. Furthermore, we train our models from scratch on all modalities without image tokenization. Native Multimodal Models. We define native multimodal models as those trained from scratch on all modalities simultaneously [67] rather than adapting LLMs to accommodate additional modalities. Due to the high cost of training such models, they remain relatively underexplored, with most relying on late-fusion architectures [27, 79]. Some multimodal models trained from scratch [4, 66, 76] relax this constraint by utilizing pre-trained image tokenizers such as [20, 70] to convert images into discrete tokens, integrating them into the text vocabulary. This approach enables models to understand and generate text and images, facilitating a more seamless multimodal learning process. Scaling laws. Scaling law studies aim to predict how model performance scales with training compute. Early works [26, 31] found that LLM performance follows a power-law relationship with compute, enabling the compute-optimal estimation of the number of model parameters and training tokens at scale for a given budget. Similar research has extended these findings to sparse Mixture of Experts (MoE) models, considering factors such as sparsity, number of experts, and routing granularity [15, 33, 74]. Scaling laws have also been observed across various domains, including image models [23], video models [56], protein LLMs [14], and imitation learning [54]. However, few stud- Figure 13. MoE specialization frequency. Percentage of text and image tokens routed to each expert on interleaved data from Obelics. Experts are ordered for better visualization. The first layer shows the highest amount of unimodal experts. <details> <summary>Image 12 Details</summary> ![9c394cf1](/v1/image/9c394cf1ad46f9ed75d96817e8a914425da283e8137a777049bddf8588eefd41) ### Visual Description \n ## Stacked Bar Charts: Percentage of L/T Tokens by Expert and Layer ### Overview The image presents three stacked bar charts, each representing a different layer (0, 16, and 23) of a model. Each chart displays the percentage of L/T (likely Language/Text) tokens attributed to either "Image" or "Text" as assessed by different "Experts" (numbered 0 through 7). The charts visually compare the distribution of these token types across layers and expert opinions. ### Components/Axes * **X-axis:** "Experts" - numbered 0 to 7. * **Y-axis:** "% of L/T tokens" - ranging from 0% to 100%. * **Stacked Bars:** Each bar represents an expert's assessment. The bars are divided into two segments: * "Image" (represented by teal/green color) * "Text" (represented by orange/yellow color) * **Titles:** Each chart has a title indicating the layer number: "Layer 0", "Layer 16", "Layer 23". * **Legend:** A small box in each chart identifies the colors corresponding to "Text" and "Image". ### Detailed Analysis or Content Details **Layer 0:** The chart for Layer 0 shows a relatively even distribution between "Image" and "Text" for most experts. * Expert 7: Approximately 10% Image, 90% Text. * Expert 2: Approximately 20% Image, 80% Text. * Expert 4: Approximately 30% Image, 70% Text. * Expert 3: Approximately 35% Image, 65% Text. * Expert 0: Approximately 40% Image, 60% Text. * Expert 5: Approximately 45% Image, 55% Text. * Expert 1: Approximately 50% Image, 50% Text. * Expert 6: Approximately 55% Image, 45% Text. **Layer 16:** The chart for Layer 16 shows a shift towards "Text" being more dominant for most experts. * Expert 2: Approximately 0% Image, 100% Text. * Expert 5: Approximately 10% Image, 90% Text. * Expert 1: Approximately 20% Image, 80% Text. * Expert 7: Approximately 25% Image, 75% Text. * Expert 0: Approximately 30% Image, 70% Text. * Expert 3: Approximately 40% Image, 60% Text. * Expert 6: Approximately 50% Image, 50% Text. * Expert 4: Approximately 60% Image, 40% Text. **Layer 23:** The chart for Layer 23 continues the trend of "Text" dominance, with even higher percentages for most experts. * Expert 7: Approximately 0% Image, 100% Text. * Expert 2: Approximately 10% Image, 90% Text. * Expert 0: Approximately 20% Image, 80% Text. * Expert 4: Approximately 30% Image, 70% Text. * Expert 5: Approximately 35% Image, 65% Text. * Expert 1: Approximately 40% Image, 60% Text. * Expert 3: Approximately 45% Image, 55% Text. * Expert 6: Approximately 50% Image, 50% Text. ### Key Observations * **Trend:** As the layer number increases (0 -> 16 -> 23), the percentage of tokens attributed to "Text" generally increases, while the percentage attributed to "Image" decreases. * **Expert Variability:** There is some variability in the assessments made by different experts, particularly in Layer 0. * **Layer 23 Dominance:** In Layer 23, most experts assign a very high percentage (80% or more) of tokens to "Text". ### Interpretation The data suggests that as the model progresses through deeper layers (from Layer 0 to Layer 23), it increasingly interprets the input as "Text" rather than "Image". This could indicate that the model is learning to abstract away from the raw visual features of the image and focus more on the textual information associated with it. The initial layers (Layer 0) show a more balanced view, likely reflecting the initial processing of the image's visual components. The increasing dominance of "Text" in later layers might be a consequence of the model's training objective or architecture, which may prioritize textual understanding. The variability among experts suggests that the distinction between "Image" and "Text" tokens is not always clear-cut and can be subject to interpretation. The charts provide a visual representation of how the model's internal representation of the input data changes as it processes information through different layers. </details> ies have investigated scaling laws for multimodal models. Notably, Aghajanyan et al. [5] examined multimodal models that tokenize modalities into discrete tokens and include multimodal generation. In contrast, we focus on studying early-fusion models that take raw multimodal inputs and are trained on interleaved multimodal data. Mixture of experts (MoEs). MoEs [59] scale model capacity efficiently by sparsely activating parameters, enabling large models with reduced per-sample compute. While widely studied in LLMs [22, 30, 36, 37, 42, 65, 77, 84], MoEs remain underexplored in multimodal settings. Prior work has examined contrastive models [50], late-fusion LLMs [38, 40], and modality-specific experts [7, 12, 60]. We focus on analyzing MoEs in early-fusion multimodal models. ## 7. Limitations Our study finds that scaling law coefficients are broadly consistent across training mixtures, though a broader exploration is needed to validate this observation. While validation loss scales predictably with compute, the extent to which this correlates with downstream performance remains unclear and warrants further investigation. The accuracy of scaling law predictions improves with higher FLOPs, but their extrapolation to extreme model sizes is still an open question (Appendix D for more details). ## 8. Conclusion We explore various strategies for compute-optimal pretraining of native multimodal models. We found the NMMs follow similar scaling laws to those of LLMs. Contrary to common belief, we find no inherent advantage in adopting late-fusion architectures over early-fusion ones. While both architectures exhibit similar scaling properties, early-fusion models are more efficient to train and outperform latefusion models at lower compute budgets. Furthermore, we show that sparse architectures encourage modality-specific specialization, leading to performance improvements while maintaining the same inference cost. ## Acknowledgment We thank Philipp Dufter, Samira Abnar, Xiujun Li, Zhe Gan, Alexander Toshev, Yinfei Yang, Dan Busbridge, and Jason Ramapuram for many fruitful discussions. We thank Denise Hui, and Samy Bengio for infra and compute support. Finally, we thank, Louis B´ ethune, Pierre Ablin, Marco Cuturi, and the MLR team at Apple for their support throughout the project. ## References - [1] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 , 2024. 8 - [2] Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models. arXiv preprint arXiv:2501.12370 , 2025. 2, 3, 6, 18, 20 - [3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. 1 - [4] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520 , 2022. 8 - [5] Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning , pages 265-279. PMLR, 2023. 3, 8 - [6] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716-23736, 2022. 1, 8 - [7] Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, and Furu Wei. Vlmo: Unified vision-language pretraining with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358 , 2021. 7, 8 - [8] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘ gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 8 - [9] Lucas Beyer, Andreas Steiner, Andr´ e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726 , 2024. 1, 8 - [10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877-1901, 2020. 3, 4 - [11] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo700m: Image-text pair dataset. https://github.com/ kakaobrain/coyo-dataset , 2022. 3, 13 - [12] Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, and Dongyu Zhang. Eve: Efficient visionlanguage pre-training with masked prediction and modalityaware moe. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 1110-1119, 2024. 8 - [13] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 24185-24198, 2024. 8 - [14] Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, and Le Song. Training compute-optimal protein language models. bioRxiv , 2024. 8 - [15] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning , pages 4057-4086. PMLR, 2022. 8 - [16] Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv:2409.11402 , 2024. 8 - [17] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning , pages 7480-7512. PMLR, 2023. 3 - [18] Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832 , 2024. 8 - [19] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. 1 - [20] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 12873-12883, 2021. 8 - [21] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425 , 2023. 3, 6, 13 - [22] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research , 23(120):1-39, 2022. 8 - [23] Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis B´ ethune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders, 2024. 1, 8 - [24] Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixtureof-experts. Proceedings of Machine Learning and Systems , 5:288-304, 2023. 6 - [25] Alexander H¨ agele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392 , 2024. 3 - [26] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems , pages 3001630030, 2022. 2, 3, 4, 8, 17 - [27] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems , 36:72096-72109, 2023. 8 - [28] Peter J. Huber. Robust Estimation of a Location Parameter , pages 492-518. Springer New York, New York, NY, 1992. 3 - [29] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. 7 - [30] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088 , 2024. 8 - [31] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 , 2020. 2, 8, 15 - [32] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal inputs and outputs. In International Conference on Machine Learning , pages 17283-17300. PMLR, 2023. 7 - [33] Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi´ oro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr´ ol, Tomasz Odrzyg´ o´ zd´ z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871 , 2024. 8, 18 - [34] Hugo Laurenc ¸on, Lucile Saulnier, L´ eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems , 36, 2024. 3, 5, 8, 13 - [35] Hugo Laurenc ¸on, L´ eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246 , 2024. 1, 8 - [36] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 , 2020. 8 - [37] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning , pages 6265-6274. PMLR, 2021. 8 - [38] Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-ofexperts model. arXiv preprint arXiv:2410.05993 , 2024. 8 - [39] Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794 , 2024. 3, 13, 15 - [40] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 , 2024. 8 - [41] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26689-26699, 2024. 5, 8 - [42] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024. 8 - [43] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26296-26306, 2024. 1, 7, 8 - [44] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017. 13 - [45] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations , 2022. 7 - [46] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, et al. Mm1: methods, analysis and insights from multimodal llm pre-training. In European Conference on Computer Vision , pages 304-323. Springer, 2025. 5, 8, 13 - [47] Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space. In The Eleventh International Conference on Learning Representations , 2023. 7 - [48] David Mizrahi, Roman Bachmann, Oguzhan Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, and Amir Zamir. 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems , 36:58363-58408, 2023. 7 - [49] Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, et al. Anymal: An efficient and scalable any-modality augmented language model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages 1314-1332, 2024. 8 - [50] Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems , 35:95649576, 2022. 8 - [51] Jorge Nocedal. Updating quasi newton matrices with limited storage. Mathematics of Computation , 35(151):951958, 1980. 3 - [52] Maxime Oquab, Timoth´ ee Darcet, Th´ eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 , 2023. 1 - [53] Tim Pearce and Jinyeop Song. Reconciling kaplan and chinchilla scaling laws. arXiv preprint arXiv:2406.12907 , 2024. 15 - [54] Tim Pearce, Tabish Rashid, Dave Bignell, Raluca Georgescu, Sam Devlin, and Katja Hofmann. Scaling laws for pre-training agents and world models. arXiv preprint arXiv:2411.04434 , 2024. 8 - [55] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748-8763. PMLR, 2021. 1, 3, 15 - [56] Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pretraining from videos. arXiv preprint arXiv:2501.05453 , 2025. 8 - [57] Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, and Jonathon Shlens. Perceptual grouping in contrastive vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5571-5584, 2023. 13 - [58] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020. 3 - [59] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer. arXiv preprint arXiv:1701.06538 , 2017. 2, 6, 8 - [60] Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision-language models with sparse mixture of experts. In The 2023 Conference on Empirical Methods in Natural Language Processing , 2023. 8 - [61] Mustafa Shukor and Matthieu Cord. Implicit multimodal alignment: On the generalization of frozen llms to multimodal inputs. Advances in Neural Information Processing Systems , 37:130848-130886, 2024. 7 - [62] Mustafa Shukor, Corentin Dancette, and Matthieu Cord. epalm: Efficient perceptual augmentation of language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22056-22069, 2023. 1, 7 - [63] Mustafa Shukor, Corentin Dancette, Alexandre Rame, and Matthieu Cord. Unival: Unified model for image, video, audio and language tasks. Transactions on Machine Learning Research Journal , 2023. 7 - [64] Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 , 2025. 1, 8 - [65] Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent. arXiv preprint arXiv:2411.02265 , 2024. 8 - [66] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 , 2024. 8 - [67] Gemini Team, Rohan Anil, Sebastian Borgeaud, JeanBaptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023. 1, 7, 8 - [68] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems , 34:200-212, 2021. 7 - [69] Th´ eophane Vallaeys, Mustafa Shukor, Matthieu Cord, and Jakob Verbeek. Improved baselines for data-efficient perceptual augmentation of llms. arXiv preprint arXiv:2403.13499 , 2024. 7 - [70] Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems . Curran Associates, Inc., 2017. 8 - [71] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems , 2017. 3 - [72] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International conference on machine learning , pages 23318-23340. PMLR, 2022. 7 - [73] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024. 1, 8 - [74] Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, and Jingang Wang. Scaling laws across model architectures: A comparative analysis of dense and MoE models in large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 5583-5595, Miami, Florida, USA, 2024. Association for Computational Linguistics. 8, 18 - [75] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and visionlanguage tasks. arXiv preprint arXiv:2208.10442 , 2022. 7 - [76] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 , 2024. 8 - [77] Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L¨ u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-moe: A deep dive into training techniques for mixture-of-experts language models. arXiv preprint arXiv:2406.06563 , 2024. 8 - [78] Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models. arXiv preprint arXiv:2408.08872 , 2024. 1, 8 - [79] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research , 2022. 8 - [80] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11975-11986, 2023. 1 - [81] Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, et al. Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv preprint arXiv:2409.20566 , 2024. 5, 8 - [82] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, ChienChin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 , 2023. 3 - [83] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations , 2024. 8 - [84] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. Stmoe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906 , 2022. 8 ## Scaling Laws for Native Multimodal Models ## Supplementary Material This supplementary material is organized as follows: - Appendix A: contains the implementation details and the hyperparameters used to train our models. - Appendix B: contains detailed comparison between early and late fusion models. - Appendix C: contains more details about scaling laws derivation, evaluation and additional results. - Appendix D: contains discussion about the paper limitations. - Appendix E: contains more results about MoEs and modality specialization. | Data type | dataset | #samples | sampling prob. | |---------------|--------------|------------|------------------| | | DFN [21] | 2B | 27% | | Image-Caption | COYO [11] | 600M | 11.25% | | | HQITP[57] | 400M | 6.75% | | Interleaved | Obelics [34] | 141M Docs | 45% | | Text | DCLM [39] | 6.6T Toks | 10% | ## A. Experimental setup In Table 6, we show the pre-training hyperparameters for different model configurations used to derive the scaling laws. The number of parameters ranges from 275M to 3.7B, with model width increasing accordingly, while the depth remains fixed at 24 layers. Learning rates vary by model size, decreasing as the model scales up. Based on empirical experiments and estimates similar to [46], we found these values to be effective in our setup. Training is optimized using a fully decoupled AdamW optimizer with momentum values β 1 = 0 . 9 , β 2 = 0 . 95 , and a weight decay of 1 e -4 . The batch size is set to 2k samples, which account for 2M tokens, given 1k context length. Gradient clipping is set to 1.0, with a maximum warmup duration of 5k iterations, adjusted for shorter training runs: 1k and 2.5k warmup steps for models trained between 1k-4k and 5k-15k steps, respectively. For MoEs, we found that longer warmup is significantly better, so we adopt a 2.5k warmup for all runs under 20k steps. We use a constant learning rate schedule with cooldown during the final 20% of training, gradually reducing to zero following an inverse square root schedule. For vision processing, image inputs are divided into (14 , 14) patches, with augmentations including Random Resized Crop (resizing images to 224px with a scale range of [0.4, 1.0]) and Random Horizontal Flip with a probability of 0.5. We train our models on mixture of interleaved, image captions and text only data Table 5. For late fusion models, we found that using smaller learning rate for the vision encoder significantly boost the performance Table 8, and when both the encoder and decoder are initialized (Appendix B.7) we found that freezing the vision encoder works best Table 7. Table 5. Pre-training data mixture. Unless otherwise specified, the training mixture contains 45%, 45% and 10% of image captions, interleaved documents and text-only data. | Early-fusion | | | | | | | | |-------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|---------------|---------------|---------------|---------------|---------------|---------------| | Params width depth | 275M 800 | 468M 1088 | 932M 1632 | 1.63B 2208 24 | 2.28B 2624 | 3.35B 3232 | 3.35B 3232 | | Learning rate | | | 5e-4 | 4.2e-4 | | 3.5e-4 | 3.5e-4 | | Late-fusion | 1.5e-3 | 1.5e-3 | | | 4e-4 | | | | Params vision encoder | | | | | | | | | width vision encoder depth | 289M 384 | 494M 512 | 1B 768 | 1.75B 1024 | 2.43B 1184 | 3.7B | 3.7B | | width depth | 768 | 1024 | 1536 | 24 2048 24 | 2464 | 3072 | 3072 | | Early-fusion MoEs Active Params width | | | 5e-4 | | | | | | Learning rate | 1.5e-3 | 1.5e-3 | | 4.2e-4 | 3.8e-4 | 3.3e-4 | 3.3e-4 | | depth | 275M 800 | 468M 1088 | 932M 1632 | 1.63B 2208 24 | 2.28B 2624 | 3.35B 3232 | 3.35B 3232 | | Learning rate Training tokens Optimizer | 1.5e-3 2.5B-600B decoupled β 1 = 0 . 9 ,β 2 = 0 . | 1.5e-3 | 5e-4 | 4.2e-4 | 4e-4 | 3.5e-4 | 3.5e-4 | | Optimizer Momentum Minimum Learning rate Weight decay Batch size Patch size Gradient clipping MAximum Warmup iterations | Fully 0 1e-4 | | | | | | | | | AdamW [44] 95 | AdamW [44] 95 | AdamW [44] 95 | AdamW [44] 95 | AdamW [44] 95 | AdamW [44] 95 | AdamW [44] 95 | | | 2k 14) | 2k 14) | 2k 14) | 2k 14) | 2k 14) | 2k 14) | 2k 14) | | | (14, 1.0 | (14, 1.0 | (14, 1.0 | (14, 1.0 | (14, 1.0 | (14, 1.0 | (14, 1.0 | | | 5k | 5k | 5k | 5k | 5k | 5k | 5k | | Augmentations: | | | | | | | | | RandomResizedCrop | | | | | | | | | | 224px | 224px | 224px | 224px | 224px | 224px | 224px | | | [0.4, 1.0] | [0.4, 1.0] | [0.4, 1.0] | [0.4, 1.0] | [0.4, 1.0] | [0.4, 1.0] | [0.4, 1.0] | | size | | | | | | | | | scale | | | | | | | | | | p = 0 . 5 | p = 0 . 5 | p = 0 . 5 | p = 0 . 5 | p = 0 . 5 | p = 0 . 5 | p = 0 . 5 | | RandomHorizontalFlip | | | | | | | | Table 6. Pre-training hyperparameters We detail the hyperaparmeters used for pre-training different model configurations to derive scaling laws. Table 7. Vision encoder scaler. Freezing the vision encoder works best when initializing late-fusion models with pre-trained models. | Vision encoder lr scaler | Interleaved (CE) | Image-Caption (CE) | Text (CE) | AVG (CE) | AVG (SFT) (Acc) | |----------------------------|--------------------|----------------------|-------------|------------|-------------------| | 1 | 2.521 | 2.15 | 2.867 | 2.513 | 43.49 | | 0.1 | 2.502 | 2.066 | 2.862 | 2.477 | 52.27 | | 0.01 | 2.502 | 2.066 | 2.859 | 2.476 | 53.76 | | 0.001 | 2.513 | 2.066 | 2.857 | 2.479 | - | | 0 (frozen) | 2.504 | 2.061 | 2.856 | 2.474 | 54.14 | <details> <summary>Image 13 Details</summary> ![e0a8962a](/v1/image/e0a8962aec1f96cad2c08ae69514122f26924e50e8e272e67bcbc123e8a3b1b4) ### Visual Description \n ## Line Chart: Validation Loss vs. Tokens Seen for Different Models ### Overview The image presents three line charts, each depicting the relationship between Validation Loss and Tokens Seen for different model configurations. The charts are arranged horizontally, representing three different training data modalities: Image-Caption, Interleaved, and Text-only. Each chart displays six distinct lines, each representing a different model (Late-1B, Late-2.4B, Late-3.7B, Early-932M, Early-2.28B, Early-3.3B). The charts aim to compare the performance of these models across different training data types as measured by validation loss during training. ### Components/Axes * **X-axis:** "Tokens seen", ranging from 100B to 400B, with markers at 100B, 200B, 300B, and 400B. * **Y-axis:** "Validation Loss", with varying scales for each chart: * Image-Caption: 2.2 to 2.9 * Interleaved: 2.5 to 2.9 * Text-only: 2.7 to 3.1 * **Legend:** Located at the bottom-center of the image, identifying each line with a model name and color. * Late-1B (Light Blue) * Late-2.4B (Blue) * Late-3.7B (Dark Blue) * Early-932M (Light Orange) * Early-2.28B (Orange) * Early-3.3B (Dark Orange) * **Titles:** Each chart has a title indicating the training data modality: "Image-Caption", "Interleaved", and "Text-only". ### Detailed Analysis or Content Details **Image-Caption Chart:** * **Late-1B (Light Blue):** Starts at approximately 2.85, decreases to approximately 2.35. * **Late-2.4B (Blue):** Starts at approximately 2.75, decreases to approximately 2.3. * **Late-3.7B (Dark Blue):** Starts at approximately 2.7, decreases to approximately 2.25. * **Early-932M (Light Orange):** Starts at approximately 2.65, decreases to approximately 2.4. * **Early-2.28B (Orange):** Starts at approximately 2.55, decreases to approximately 2.35. * **Early-3.3B (Dark Orange):** Starts at approximately 2.5, decreases to approximately 2.3. **Interleaved Chart:** * **Late-1B (Light Blue):** Starts at approximately 2.85, decreases to approximately 2.55. * **Late-2.4B (Blue):** Starts at approximately 2.8, decreases to approximately 2.5. * **Late-3.7B (Dark Blue):** Starts at approximately 2.75, decreases to approximately 2.45. * **Early-932M (Light Orange):** Starts at approximately 2.7, decreases to approximately 2.5. * **Early-2.28B (Orange):** Starts at approximately 2.65, decreases to approximately 2.4. * **Early-3.3B (Dark Orange):** Starts at approximately 2.6, decreases to approximately 2.35. **Text-only Chart:** * **Late-1B (Light Blue):** Starts at approximately 3.05, decreases to approximately 2.8. * **Late-2.4B (Blue):** Starts at approximately 3.0, decreases to approximately 2.75. * **Late-3.7B (Dark Blue):** Starts at approximately 2.95, decreases to approximately 2.7. * **Early-932M (Light Orange):** Starts at approximately 2.9, decreases to approximately 2.7. * **Early-2.28B (Orange):** Starts at approximately 2.85, decreases to approximately 2.65. * **Early-3.3B (Dark Orange):** Starts at approximately 2.8, decreases to approximately 2.6. ### Key Observations * In all three charts, the validation loss generally decreases as the number of tokens seen increases, indicating that the models are learning and improving with more training data. * The "Late" models consistently exhibit lower validation loss compared to the "Early" models for the same number of tokens seen across all three data modalities. * The largest models (Late-3.7B and Early-3.3B) generally achieve the lowest validation loss, suggesting that model size is a significant factor in performance. * The rate of decrease in validation loss appears to slow down as the number of tokens seen increases, indicating diminishing returns from further training. * The Text-only chart has the highest overall validation loss values compared to the Image-Caption and Interleaved charts, suggesting that this data modality is more challenging for the models. ### Interpretation The data suggests that model performance, as measured by validation loss, is influenced by both model size and the type of training data used. Larger models generally perform better, and the choice of data modality can significantly impact the learning process. The consistent decrease in validation loss across all models indicates that continued training is beneficial, but the diminishing returns suggest that there is a point at which further training may not yield substantial improvements. The higher validation loss observed in the Text-only modality could be due to the inherent complexity of language modeling or the specific characteristics of the text dataset used. The comparison between "Early" and "Late" models suggests that the training procedure or initialization strategy used for the "Late" models is more effective. These charts provide valuable insights into the training dynamics of these models and can inform decisions about model selection, training data curation, and hyperparameter tuning. </details> Figure 14. Early vs late fusion: scaling training FLOPs. We compare early and late fusion models when scaling both the model size and the number of training tokens. The gap decreases mainly due to scaling models size. Figure 15. Early vs late fusion: changing the training mixture. Wevary the training mixtures and plot the final training loss. Early fusion models become better when increasing the proportion of interleaved documents. Early and late fusion has 1.63B and 1.75B parameters respectively. <details> <summary>Image 14 Details</summary> ![4c95055d](/v1/image/4c95055dc30b4891a480c25ca45c40f06961c431fbf80272de3b8ded6437d65c) ### Visual Description \n ## Line Charts: Interleaved Cross-Entropy vs. % Interleaved ### Overview This image presents three line charts, </details> Table 8. Vision encoder scaler. Reducing the learning rate for the vision encoder is better when training late-fusion models from scratch. | Vision encoder lr scaler | Interleaved (CE) | Image-Caption (CE) | Text (CE) | AVG (CE) | AVG (SFT) (Acc) | |----------------------------|--------------------|----------------------|-------------|------------|-------------------| | 0.1 | 2.674 | 2.219 | 3.072 | 2.655 | 34.84 | | 0.01 | 2.672 | 2.197 | 3.071 | 2.647 | 38.77 | | 0.001 | 2.674 | 2.218 | 3.073 | 2.655 | 38.46 | ## B. Late vs early fusion This section provides additional comparison between early and late fusion models. ## B.1. Scaling FLOPs Figure 14 compares early-fusion and late-fusion models when scaling FLOPs. Specifically, for each model size, we train multiple models using different amounts of training tokens. The performance gap between the two approaches mainly decreases due to increasing model sizes rather than increasing the number of training tokens. Despite the decreasing gap, across all the models that we train, earlyfusion consistently outperform late-fusion. ## B.2. Changing the training data mixture Weanalyze how the performance gap between early and late fusion models changes with variations in the training data mixture. As shown in Figure 16 and Figure 15, when fixing the model size, increasing the ratio of text and interleaved data favors early fusion. Interestingly, the gap remains largely unchanged for other data types. We also observe interference effects between different data types. Specifically, increasing the amount of interleaved data negatively impacts performance on image captions and vice versa. Additionally, increasing the proportion of text-only data slightly improves interleaved performance but increases loss on image captions. Overall, we find that text-only and interleaved data are correlated across different setups. ## B.3. Scaling image resolution is in favor of earlyfusion We examine how both architectures perform with varying image resolution. We fix the number of model parameters to 1.63B and 1.75B for early and late fusion respecively. All models are trained for 100K steps or 200B tokens. Since Figure 16. Early vs late fusion: changing the amount of text-only data in the training mixture (isoFLOPs). We vary the ratio of text-only data and plot the final training loss. The gap increases with the text data ratio in favor of early fusion model. Early fusion has 1.63B parameters and late fusion 1.75B parameters. <details> <summary>Image 15 Details</summary> ![be85c13c](/v1/image/be85c13cd54832e7264177cf743f0017539c5919251d3701d3b8a1223f2ee361) ### Visual Description ## Line Charts: Cross-Entropy (CE) vs. Percentage of Text ### Overview This image presents three line charts comparing Cross-Entropy (CE) values for "Late" and "Early" data, plotted against the percentage of text. The charts appear to be examining the relationship between the amount of text present and the CE for different data timings. ### Components/Axes Each chart shares the following components: * **X-axis:** "% of Text" ranging from 10 to 30. * **Y-axis:** The left chart is labeled "Interleaved CE", the middle chart is labeled "Text CE", and the right chart has no label. * **Legend:** Located in the top-right corner of the rightmost chart, with two entries: * "Late" (represented by a blue line with circle markers) * "Early" (represented by an orange line with triangle markers) ### Detailed Analysis or Content Details **Chart 1: Interleaved CE** * **"Late" Line (Blue):** The line slopes downward overall. * At 10% Text: Approximately 2.66 CE. * At 20% Text: Approximately 2.62 CE. * At 30% Text: Approximately 2.61 CE. * **"Early" Line (Orange):** The line slopes downward more steeply than the "Late" line. * At 10% Text: Approximately 2.65 CE. * At 20% Text: Approximately 2.58 CE. * At 30% Text: Approximately 2.53 CE. **Chart 2: Text CE** * **"Late" Line (Blue):** The line slopes upward overall. * At 10% Text: Approximately 2.42 CE. * At 20% Text: Approximately 2.45 CE. * At 30% Text: Approximately 2.49 CE. * **"Early" Line (Orange):** The line slopes upward, but less steeply than the "Late" line. * At 10% Text: Approximately 2.42 CE. * At 20% Text: Approximately 2.43 CE. * At 30% Text: Approximately 2.46 CE. **Chart 3: Text-only (No Y-axis label)** * **"Late" Line (Blue):** The line slopes downward. * At 10% Text: Approximately 2.90 CE. * At 20% Text: Approximately 2.86 CE. * At 30% Text: Approximately 2.82 CE. * **"Early" Line (Orange):** The line slopes downward more steeply than the "Late" line. * At 10% Text: Approximately 2.90 CE. * At 20% Text: Approximately 2.85 CE. * At 30% Text: Approximately 2.80 CE. ### Key Observations * The "Late" and "Early" lines consistently exhibit opposite trends in the "Interleaved CE" and "Text-only" charts: "Late" is generally higher in "Text CE" and lower in "Interleaved CE" and "Text-only" than "Early". * The "Interleaved CE" chart shows both lines decreasing with increasing percentage of text. * The "Text CE" chart shows both lines increasing with increasing percentage of text. * The "Text-only" chart also shows both lines decreasing with increasing percentage of text. ### Interpretation The data suggests that the timing of the data ("Late" vs. "Early") significantly impacts the Cross-Entropy values, and this impact is modulated by the percentage of text present. The opposing trends in the "Interleaved CE" and "Text CE" charts indicate a complex interaction between the data timing, text content, and the evaluation metric. Specifically: * **Interleaved CE:** As the percentage of text increases, the CE decreases for both "Late" and "Early" data. This suggests that adding more text improves the model's performance when the data is interleaved. * **Text CE:** As the percentage of text increases, the CE increases for both "Late" and "Early" data. This suggests that adding more text *decreases* the model's performance when evaluating only the text component. * **Text-only:** Similar to "Interleaved CE", increasing the percentage of text decreases the CE. The difference between "Late" and "Early" data suggests that the model's ability to process and learn from text changes over time. The "Late" data appears to benefit more from the addition of text in the interleaved context, while the "Early" data shows a more consistent, but less pronounced, response. The "Text-only" chart suggests that the "Early" data is more sensitive to the amount of text present. The lack of a Y-axis label on the third chart is a minor data quality issue, but the context of the other charts allows for a reasonable interpretation. </details> Figure 17. Early vs late fusion: training with different image resolutions (isoFLOPs). For the same training FLOPs we vary the image resolution (and thus the number of image tokens) during training and report the final training loss. Increasing resolution, hurts the performance on text and interleaved documents, while helping image captioning. The gap stays almost the same on text and interleaved data while slightly increase on image captioning in favor of early fusion. <details> <summary>Image 16 Details</summary> ![e0d4b7a7](/v1/image/e0d4b7a75810739ebe38d226a227fe7043cc5b025d6c991d4b4fb9a81da872ff) ### Visual Description ## Line Chart: Validation Loss vs. Image Resolution ### Overview The image presents two line charts comparing the validation loss of two models, "Late-1B" and "Early-932M", across varying image resolutions. The charts are positioned side-by-side, labeled "Image-Caption" and "Interleaved". Both charts share the same y-axis label ("Validation Loss") and x-axis label ("Image resolution"). ### Components/Axes * **X-axis:** Image resolution, with markers at 112px, 168px, 224px, 280px, and 336px. * **Y-axis:** Validation Loss, ranging from approximately 2.5 to 2.72. * **Legend (Top-Right of each chart):** * "Late-1B" - Represented by a light blue line with circular markers. * "Early-932M" - Represented by an orange line with diamond markers. * **Chart Titles (Top-Center):** * Left Chart: "Image-Caption" * Right Chart: "Interleaved" ### Detailed Analysis or Content Details **Image-Caption Chart (Left):** The light blue line ("Late-1B") starts at approximately 2.56 at 112px and slopes downward, reaching a minimum of approximately 2.51 at 336px. The orange line ("Early-932M") begins at approximately 2.56 at 112px, also slopes downward, and reaches a minimum of approximately 2.50 at 336px. * 112px: Late-1B ≈ 2.56, Early-932M ≈ 2.56 * 168px: Late-1B ≈ 2.54, Early-932M ≈ 2.54 * 224px: Late-1B ≈ 2.53, Early-932M ≈ 2.52 * 280px: Late-1B ≈ 2.52, Early-932M ≈ 2.51 * 336px: Late-1B ≈ 2.51, Early-932M ≈ 2.50 **Interleaved Chart (Right):** The light blue line ("Late-1B") starts at approximately 2.70 at 112px and generally slopes upward, reaching approximately 2.72 at 224px, then slightly decreases to approximately 2.72 at 336px. The orange line ("Early-932M") begins at approximately 2.69 at 112px, increases to approximately 2.70 at 224px, and then increases more sharply to approximately 2.72 at 336px. * 112px: Late-1B ≈ 2.70, Early-932M ≈ 2.69 * 168px: Late-1B ≈ 2.70, Early-932M ≈ 2.70 * 224px: Late-1B ≈ 2.72, Early-932M ≈ 2.70 * 280px: Late-1B ≈ 2.72, Early-932M ≈ 2.71 * 336px: Late-1B ≈ 2.72, Early-932M ≈ 2.72 ### Key Observations * In the "Image-Caption" chart, both models exhibit a decreasing validation loss as image resolution increases, suggesting improved performance with higher resolution images. The "Early-932M" model consistently has a slightly lower validation loss than the "Late-1B" model. * In the "Interleaved" chart, the "Late-1B" model shows a relatively flat validation loss, while the "Early-932M" model's validation loss increases with image resolution. * The scales of the Y-axis are different between the two charts. ### Interpretation The charts compare the performance of two models ("Late-1B" and "Early-932M") under different training conditions ("Image-Caption" and "Interleaved") as image resolution increases. The "Image-Caption" chart suggests that both models benefit from higher resolution images, as indicated by the decreasing validation loss. The "Early-932M" model appears to be slightly more efficient in utilizing higher resolution images in this scenario. The "Interleaved" chart presents a different picture. The "Late-1B" model maintains a relatively stable performance, while the "Early-932M" model's performance degrades with increasing resolution. This could indicate that the "Interleaved" training method introduces challenges for the "Early-932M" model at higher resolutions, potentially due to increased computational complexity or overfitting. The differing trends between the two charts highlight the importance of the training methodology in influencing model performance. The "Interleaved" method seems to negatively impact the "Early-932M" model's ability to leverage higher resolution images, while the "Image-Caption" method allows both models to benefit. The different Y-axis scales make direct comparison of absolute loss values difficult, but the relative trends within each chart are clear. </details> the patch size remains constant, increasing the resolution results in a higher number of visual tokens. For all resolutions, we maintain the same number of text tokens. As shown in Figure 17, the early-fusion model consistently outperforms the late-fusion model across resolutions, particularly for multimodal data, with the performance gap widening at higher resolutions. Additionally, we observe that the loss on text and interleaved data increases as resolution increases. ## B.4. Early-fusion is consistently better when matching the late-fusion model size In this section, we compare the late-fusion model with different configurations of early-fusion one. Specifically, we train early-fusion models that match the late-fusion model in total parameters (Params), text model size (Text), and FLOPs (FLOPs), assuming 45-45-10 training mixture. As shown in Figure 18, early fusion consistently outperforms late fusion when normalized by total parameters, followed by normalization by FLOPs. When matching the text model size, early fusion performs better at higher ratios of interleaved data. ## B.5. Different late-fusion configuration We examine how this scaling changes with different latefusion configurations. Instead of scaling both the vision and text models equally, as done in the main paper, we fix the vision encoder size to 300M and scale only the text model. Figure 19 shows that late-fusion models lag behind at smaller model sizes, with the gap closing significantly as the text model scales. This suggests that allocating more parameters to shared components is more beneficial, further supporting the choice of early-fusion models. ## B.6. Different context lengths In the paper, we use a 1k context length following [31]. Also following, this paper, we ignore the context length effect, as the model dimension dominates the training compute estimate. Moreover, [53] empirically found that scaling coefficients are robust to context length. Nevertheless, Our initial experiments (Figure 20) indicate that scaling the context length did not significantly affect the comparison between late and early fusion. ## B.7. Initializing from LLM and CLIP We study the case where both late and early fusion models are initialized from pre-trained models, specifically DCLM1B [39] and CLIP-ViT-L [55] for late fusion. Interestingly, Figure 21 shows that for text and interleaved multimodal documents, early fusion can match the performance of late fusion when trained for longer. However, closing the gap on image caption data remains more challenging. Notably, when considering the overall training cost, including that of pre-trained models, early fusion requires significantly longer training to compensate for the vision encoder's pretraining cost. <details> <summary>Image 17 Details</summary> ![c5697bb8](/v1/image/c5697bb89782e9bfe10c59da08792f5c76d64003206c3c9dacf33a43429abd19) ### Visual Description ## Line Charts: Contrastive Loss vs. Interleaving Percentage ### Overview The image presents three line charts, each depicting the relationship between a contrastive loss metric (y-axis) and the percentage of interleaving (x-axis). The charts compare different evaluation methods: Paired CE, Interleaved CE, and Text CE. Each chart contains four data series, representing different evaluation components: 'L', 'E (Text)', 'E (FLOPs)', and 'E (Params)'. ### Components/Axes * **X-axis:** "% of Interleaved" - Represents the percentage of interleaving, with scales varying for each chart: * Paired CE: 0, 27, 45, 63, 72 * Interleaved CE: 18, 27, 45, 63, 72 * Text CE: 0, 18, 45, 63, 72, 90 * **Y-axis:** Contrastive Loss - The vertical axis represents the contrastive loss value. The scales vary for each chart: * Paired CE: Approximately 2.3 to 2.7 * Interleaved CE: Approximately 2.55 to 2.8 * Text CE: Approximately 2.85 to 3.1 * **Legend:** Located at the bottom-center of the image. * Blue Line ('L'): Represents a loss metric labeled 'L'. * Green Line ('E (Text)'): Represents a loss metric labeled 'E (Text)'. * Orange Line ('E (FLOPs)'): Represents a loss metric labeled 'E (FLOPs)'. * Red Line ('E (Params)'): Represents a loss metric labeled 'E (Params)'. * **Chart Titles:** * Top-Left: "Paired CE" * Top-Center: "Interleaved CE" * Top-Right: "Text CE" ### Detailed Analysis or Content Details **Paired CE Chart:** * 'L' (Blue): Starts at approximately 2.3, increases to approximately 2.65 at 63% interleaving, and decreases slightly to approximately 2.6 at 72% interleaving. * 'E (Text)' (Green): Starts at approximately 2.35, increases to approximately 2.7 at 63% interleaving, and decreases slightly to approximately 2.65 at 72% interleaving. * 'E (FLOPs)' (Orange): Starts at approximately 2.3, increases to approximately 2.6 at 63% interleaving, and decreases slightly to approximately 2.55 at 72% interleaving. * 'E (Params)' (Red): Starts at approximately 2.35, increases to approximately 2.7 at 63% interleaving, and decreases slightly to approximately 2.65 at 72% interleaving. **Interleaved CE Chart:** * 'L' (Blue): Starts at approximately 2.75, decreases to approximately 2.6 at 72% interleaving. * 'E (Text)' (Green): Starts at approximately 2.78, decreases to approximately 2.62 at 72% interleaving. * 'E (FLOPs)' (Orange): Starts at approximately 2.75, decreases to approximately 2.6 at 72% interleaving. * 'E (Params)' (Red): Starts at approximately 2.77, decreases to approximately 2.6 at 72% interleaving. **Text CE Chart:** * 'L' (Blue): Starts at approximately 3.0, decreases to approximately 2.85 at 90% interleaving. * 'E (Text)' (Green): Starts at approximately 3.05, decreases to approximately 2.88 at 90% interleaving. * 'E (FLOPs)' (Orange): Starts at approximately 3.0, decreases to approximately 2.85 at 90% interleaving. * 'E (Params)' (Red): Starts at approximately 3.05, decreases to approximately 2.9 at 90% interleaving. ### Key Observations * In the "Paired CE" chart, all lines exhibit a similar upward trend up to 63% interleaving, followed by a slight decrease. * In the "Interleaved CE" and "Text CE" charts, all lines consistently decrease as the percentage of interleaving increases, indicating a reduction in contrastive loss. * The "Text CE" chart generally shows higher loss values compared to the other two charts. * The lines representing 'L', 'E (Text)', 'E (FLOPs)', and 'E (Params)' are relatively close to each other within each chart, suggesting a consistent relationship between these evaluation components. ### Interpretation The charts demonstrate the impact of interleaving on contrastive loss across different evaluation methods. The "Paired CE" chart suggests that increasing interleaving initially increases loss, potentially due to increased difficulty in distinguishing between samples, but beyond a certain point (around 63%), further interleaving may slightly reduce loss. The "Interleaved CE" and "Text CE" charts indicate that interleaving generally reduces contrastive loss, implying that interleaving improves the model's ability to differentiate between samples. The higher loss values in the "Text CE" chart might suggest that text-based evaluation is more sensitive to the effects of interleaving or that the text data itself is more challenging to evaluate. The close proximity of the lines within each chart suggests that the different evaluation components ('L', 'E (Text)', 'E (FLOPs)', 'E (Params)') are correlated and respond similarly to changes in interleaving. The charts provide insights into the optimal level of interleaving for different evaluation scenarios, potentially guiding the design of more effective contrastive learning strategies. </details> Figure 18. Early vs late fusion: changing the training mixture and early-fusion configuration. We vary the training mixtures and plot the final training loss for different configuration of early fusion models. For the same number of total parameters early fusion consistently outperform late fusion. <details> <summary>Image 18 Details</summary> ![23b7d601](/v1/image/23b7d6012b72a43b5075a71596d99439df7123421da94dba13f55efb2b2974a7) ### Visual Description \n ## Line Chart: Cross-Entropy Loss vs. Tokens Seen for Different Models ### Overview This image presents three line charts, each depicting the cross-entropy (CE) loss as a function of "Tokens seen". Each chart represents a different data type: "Image-Caption CE", "Interleaved CE", and "Text CE". Each chart contains data for six different models, distinguished by name and size (in billions of parameters - B). The charts aim to compare the learning curves of these models across different data modalities. ### Components/Axes * **X-axis:** "Tokens seen", ranging from 0.05 to 0.4, with markers at 0.05, 0.1, 0.2, 0.3, and 0.4. * **Y-axis:** "Cross-Entropy Loss" (CE). The scales vary for each chart: * Image-Caption CE: approximately 2.2 to 2.9 * Interleaved CE: approximately 2.5 to 3.1 * Text CE: approximately 2.7 to 3.4 * **Legend:** Located at the bottom of each chart, identifying the six models: * Late-0.555B (light blue) * Late-1.14B (blue) * Late-2.320B (dark blue) * Late-3.354B (teal) * Early-0.464B (light orange) * Early-0.932B (orange) * Early-1.672B (dark orange) * Early-3.54B (red) ### Detailed Analysis or Content Details **Image-Caption CE Chart:** * **Late-0.555B (light blue):** Starts at approximately 2.85 and decreases to approximately 2.45. * **Late-1.14B (blue):** Starts at approximately 2.75 and decreases to approximately 2.35. * **Late-2.320B (dark blue):** Starts at approximately 2.65 and decreases to approximately 2.25. * **Late-3.354B (teal):** Starts at approximately 2.6 and decreases to approximately 2.2. * **Early-0.464B (light orange):** Starts at approximately 2.8 and decreases to approximately 2.5. * **Early-0.932B (orange):** Starts at approximately 2.7 and decreases to approximately 2.4. * **Early-1.672B (dark orange):** Starts at approximately 2.6 and decreases to approximately 2.3. * **Early-3.54B (red):** Starts at approximately 2.55 and decreases to approximately 2.2. **Interleaved CE Chart:** * **Late-0.555B (light blue):** Starts at approximately 2.95 and decreases to approximately 2.65. * **Late-1.14B (blue):** Starts at approximately 2.9 and decreases to approximately 2.6. * **Late-2.320B (dark blue):** Starts at approximately 2.85 and decreases to approximately 2.55. * **Late-3.354B (teal):** Starts at approximately 2.8 and decreases to approximately 2.5. * **Early-0.464B (light orange):** Starts at approximately 2.9 and decreases to approximately 2.6. * **Early-0.932B (orange):** Starts at approximately 2.85 and decreases to approximately 2.55. * **Early-1.672B (dark orange):** Starts at approximately 2.8 and decreases to approximately 2.5. * **Early-3.54B (red):** Starts at approximately 2.75 and decreases to approximately 2.45. **Text CE Chart:** * **Late-0.555B (light blue):** Starts at approximately 3.15 and decreases to approximately 2.9. * **Late-1.14B (blue):** Starts at approximately 3.1 and decreases to approximately 2.85. * **Late-2.320B (dark blue):** Starts at approximately 3.05 and decreases to approximately 2.8. * **Late-3.354B (teal):** Starts at approximately 3.0 and decreases to approximately 2.75. * **Early-0.464B (light orange):** Starts at approximately 3.2 and decreases to approximately 2.95. * **Early-0.932B (orange):** Starts at approximately 3.15 and decreases to approximately 2.9. * **Early-1.672B (dark orange):** Starts at approximately 3.1 and decreases to approximately 2.85. * **Early-3.54B (red):** Starts at approximately 3.05 and decreases to approximately 2.8. ### Key Observations * In all three charts, the loss generally decreases as the number of tokens seen increases, indicating learning. * Larger models (higher B values) consistently exhibit lower loss values across all data types. * The "Late" models generally perform better than the "Early" models for a given size. * The rate of loss decrease appears to slow down as the number of tokens seen increases, suggesting diminishing returns. * The "Text CE" chart shows the highest overall loss values compared to the other two charts. ### Interpretation The charts demonstrate the impact of model size and training data type on cross-entropy loss. The consistent trend of decreasing loss with increasing tokens seen confirms that the models are learning from the data. The superior performance of larger models suggests that increased model capacity allows for better representation of the underlying data distributions. The difference in loss values between the data types (Image-Caption, Interleaved, Text) indicates that the models find some data types easier to learn than others, with "Text CE" being the most challenging in this case. The distinction between "Early" and "Late" models suggests that the training procedure or initialization strategy used for the "Late" models is more effective. The flattening of the learning curves at higher token counts suggests that further training may yield only marginal improvements. The data suggests that model size is a significant factor in performance, but the choice of data type and training methodology also play crucial roles. </details> Figure 19. Early vs late fusion: scaling training FLOPs while fixing the vision encoder size. We compare early and late fusion models when scaling both the amount of training tokens and model sizes. For late fusion mdoels, we fix the vision encoder size (300M) and scale the text model (250M, 834M, 2B, 3B). The gap between early and late get tighter when scaling the text model. Figure 20. Early vs late fusion with different context lengths. <details> <summary>Image 19 Details</summary> ![24fa9314](/v1/image/24fa93144a5beafa1e234cdfa91073aac15383a97cd409efeaac54a9c0532370) ### Visual Description \n ## Line Chart: Interleaved CE vs. Time ### Overview This image presents a line chart illustrating the relationship between Interleaved CE (Coding Efficiency) and Time (T), for four different configurations: Late-1k, Late-2k, Early-1k, and Early-2k. The chart displays a decreasing trend for all configurations as time increases. ### Components/Axes * **X-axis:** Time (T), ranging from 0.0T to 0.3T, with markers at 0.0T, 0.2T, and 0.3T. * **Y-axis:** Interleaved CE, ranging from approximately 2.6 to 2.8, with markers at 2.6, 2.65, 2.7, 2.75, and 2.8. * **Legend:** Located in the top-right corner, identifying the four data series: * Late-1k (Blue, square markers) * Late-2k (Blue, circle markers) * Early-1k (Orange, square markers) * Early-2k (Orange, circle markers) ### Detailed Analysis * **Late-1k (Blue squares):** The line slopes downward consistently. * At 0.0T: Approximately 2.73 * At 0.2T: Approximately 2.67 * At 0.3T: Approximately 2.62 * **Late-2k (Blue circles):** The line slopes downward consistently. * At 0.0T: Approximately 2.78 * At 0.2T: Approximately 2.72 * At 0.3T: Approximately 2.67 * **Early-1k (Orange squares):** The line slopes downward consistently, and is the lowest of the four lines. * At 0.0T: Approximately 2.71 * At 0.2T: Approximately 2.65 * At 0.3T: Approximately 2.60 * **Early-2k (Orange circles):** The line slopes downward consistently. * At 0.0T: Approximately 2.76 * At 0.2T: Approximately 2.70 * At 0.3T: Approximately 2.66 ### Key Observations * All four configurations exhibit a decrease in Interleaved CE as time increases. * The "Late" configurations generally have higher Interleaved CE values than the "Early" configurations at all time points. * The "2k" configurations generally have higher Interleaved CE values than the "1k" configurations at all time points. * The Early-1k configuration consistently shows the lowest Interleaved CE values. ### Interpretation The data suggests that Interleaved CE degrades over time for all tested configurations. The "Late" configurations, and those with "2k" settings, demonstrate better coding efficiency compared to the "Early" and "1k" configurations, respectively. This could indicate that the "Late" and "2k" settings are more robust to the effects of time or are inherently more efficient. The consistent downward trend across all lines suggests a systematic effect, potentially related to resource exhaustion, increased complexity, or other time-dependent factors. The differences between the configurations suggest that certain parameter settings can mitigate the degradation of Interleaved CE over time. Further investigation would be needed to understand the underlying mechanisms driving these trends and to optimize the configurations for sustained coding efficiency. </details> Figure 21. Early vs late fusion when initializing the encoder and decoder. Early-fusion can match the performance of latefusion models when trained for longer. However, the gap is bigger on image-caption data. <details> <summary>Image 20 Details</summary> ![1d87e7c6](/v1/image/1d87e7c619856124b87e5e99b9fb9943ffd09802f5dd54832b175613774d850f) ### Visual Description ## Line Charts: Training Loss vs. Tokens Seen for Different Initialization Strategies ### Overview The image presents three separate line charts, each depicting the relationship between a loss metric (CE - Cross-Entropy) and the number of tokens seen during training. Each chart corresponds to a different training condition: "Image-Caption CE", "Interleaved CE", and "Text CE". Within each chart, two lines represent different initialization strategies: "Late-init" (blue) and "Early-Init" (brown). The x-axis represents the number of tokens seen, with markers at 100B, 400B, and 1T (1 Trillion). The y-axis represents the CE loss value. ### Components/Axes * **X-axis Title (all charts):** "Tokens seen" * **X-axis Markers (all charts):** 100B, 400B, 1T * **Y-axis Titles:** * "Image-Caption CE" chart: CE loss (approximately 2.2 to 2.5) * "Interleaved CE" chart: CE loss (approximately 2.5 to 2.7) * "Text CE" chart: CE loss (approximately 2.7 to 2.85) * **Legend (bottom-center):** * "Late-init" - Blue line with circular markers * "Early-Init" - Brown line with circular markers ### Detailed Analysis or Content Details **Image-Caption CE Chart:** * The "Early-Init" line (brown) starts at approximately 2.48 at 100B tokens, decreases to approximately 2.32 at 400B tokens, and then to approximately 2.28 at 1T tokens. This line shows a decreasing trend. * The "Late-init" line (blue) starts at approximately 2.33 at 100B tokens, decreases to approximately 2.25 at 400B tokens, and then to approximately 2.23 at 1T tokens. This line also shows a decreasing trend, and remains below the "Early-Init" line throughout. **Interleaved CE Chart:** * The "Early-Init" line (brown) starts at approximately 2.63 at 100B tokens, decreases to approximately 2.55 at 400B tokens, and then to approximately 2.51 at 1T tokens. This line shows a decreasing trend. * The "Late-init" line (blue) starts at approximately 2.58 at 100B tokens, decreases to approximately 2.52 at 400B tokens, and then to approximately 2.50 at 1T tokens. This line also shows a decreasing trend, and remains below the "Early-Init" line throughout. **Text CE Chart:** * The "Early-Init" line (brown) starts at approximately 2.82 at 100B tokens, decreases to approximately 2.76 at 400B tokens, and then to approximately 2.73 at 1T tokens. This line shows a decreasing trend. * The "Late-init" line (blue) starts at approximately 2.80 at 100B tokens, decreases to approximately 2.75 at 400B tokens, and then to approximately 2.73 at 1T tokens. This line also shows a decreasing trend, and remains below the "Early-Init" line throughout. ### Key Observations * In all three charts, both "Early-Init" and "Late-init" lines exhibit a decreasing trend as the number of tokens seen increases, indicating that the loss is decreasing with more training data. * The "Late-init" strategy consistently achieves lower loss values than the "Early-Init" strategy across all three training conditions. * The rate of loss decrease appears to slow down as the number of tokens seen increases, suggesting diminishing returns from further training. * The "Image-Caption CE" chart has the lowest overall loss values, while the "Text CE" chart has the highest. ### Interpretation The data suggests that the "Late-init" initialization strategy is more effective than the "Early-Init" strategy for all three training conditions ("Image-Caption", "Interleaved", and "Text"). The consistently lower loss values achieved by "Late-init" indicate that it leads to better model performance. The decreasing loss curves across all conditions demonstrate that the models are learning and improving with more training data. The differences in loss values between the three training conditions suggest that the type of training data (image-caption pairs, interleaved data, or text only) impacts the difficulty of the learning task. The "Text CE" condition appears to be the most challenging, as it consistently exhibits the highest loss values. The slowing rate of loss decrease at higher token counts suggests that the models are approaching convergence and that further training may yield only marginal improvements. This data could be used to inform decisions about initialization strategies and training data selection for similar models. </details> ## C. Scaling laws  Following [26], we determine the parameters that minimize the following objective across all our runs i :  We perform this optimization across various initialization ranges and select the parameters that achieve the lowest loss across all initializations. Specifically, our grid search spans { 0 , 0 . 5 , 2 . 5 } for α and β , { 0 , 5 , 10 , ..., 30 } for a and b , and {-1 , -0 . 5 , 1 , 0 . 5 } for e . We use the L-BFGS algorithm with δ = 1 e -3 .  While these equations have a closed-form solution [26] for early-fusion models that can be derived from Eq 1, this is not the case for late-fusion models without specifying either the vision encoder or text model size. To ensure a fair comparison, we derive these equations for both models, by performing linear regression in log space. We found that the regression is very close to the coefficient found with closed-form derivation Table 9. For instance, to derive N = K a C a , given a FLOP budget C and a set of linearly spaced tokens D i ranging from 10B to 600B, we compute the model size for each D i as N i = C 6 D for early fusion and N i = C 6 D + 0 . 483 ∗ N v for late fusion (for the 45-45-10 mixture, D v = 0 . 544 D , thus C = 6 D (0 . 544 N v + N t ) ). We then apply Eq 1 to obtain the loss for each model size and select N that has the minimum loss. We repeat this for all FLOP values corresponding to our runs, resulting in a set of points ( C, N opt ) that we use to regress a and K a . We follow a similar procedure to find b and d . For late-fusion models, we regress a linear model to determine N v given N . Notably, even though we maintain a fixed width ratio for late-fusion models, this approach is more accurate, as embedding layers prevent a strictly fixed ratio between text and vision model sizes. We present the regression results in Figure 22. Table 9. Scaling laws parameters for early-fusion. Doing regression to derive the scaling laws coefficients leads to very close results to using the closed-form solution. | Model | a | b | d | n | dn | |-------------|---------|---------|---------|---------|----------| | Closed form | 0.52649 | 0.47351 | 0.89938 | 1.11188 | -0.05298 | | Regression | 0.52391 | 0.47534 | 0.90052 | 1.10224 | -0.04933 | ## C.3. Fitting L ∝ C c To determine the relationship between the final model loss and the compute budget C , we begin by interpolating the points corresponding to the same model size and compute the convex hull that covers the minimum loss achieved by all runs for each FLOP. This results in a continuous mapping from the FLOPs to the lowest loss. We consider a range of FLOPs, excluding very small values ( ≤ 3 e 19 ), and construct a dataset of ( C, L ) for linearly spaced compute C . Using this data, we find the linear relationship between L and C in the log space and deduce the exponent c . We visualize the results in Figure 26. Figure 22. Regression results of the scaling laws coefficients. our estimation of the scaling coefficients is close to the closed form solution. <details> <summary>Image 21 Details</summary> ![98823e92](/v1/image/98823e9293596983ae19eb543b38cbef55c88379286654f5723aaffd85c8266a) ### Visual Description ## Charts: Scaling Relationships of N, D, and D/N with C ### Overview The image presents three scatter plots arranged in a 3x2 grid, displaying scaling relationships between different variables (N, D, and D/N) and a variable 'C'. Each plot includes data points and a fitted line, along with a power-law relationship equation. The plots are grayscale. ### Components/Axes Each plot shares the following characteristics: * **X-axis:** Labeled 'C', with a logarithmic scale ranging from approximately 1020 to 1022. * **Y-axis:** Each plot has a different Y-axis label and scale: * Top-left: 'N', logarithmic scale from approximately 100 to 1010 (labeled as 1B and 10B). * Top-right: 'N', logarithmic scale from approximately 108 to 1010 (labeled as 100M and 1B). * Middle-left: 'D', logarithmic scale from approximately 108 to 1010 (labeled as 100B). * Middle-right: 'D', logarithmic scale from approximately 108 to 1010 (labeled as 100B). * Bottom-left: 'D/N', logarithmic scale from approximately 10-9 to 10-6 (labeled as 10-8 and 10-6). * Bottom-right: 'D/N', logarithmic scale from approximately 10-9 to 10-6 (labeled as 10-8 and 10-6). * **Data Points:** Each plot contains numerous data points, colored either orange or blue. * **Fitted Lines:** A solid line is fitted through each set of data points, also colored orange or blue. * **Equations:** Each plot displays an equation of the form "Variable ∝ Cexponent" above the plot. ### Detailed Analysis or Content Details **Top-Left Plot (N vs. C):** * Color: Orange * Trend: The data points exhibit an upward trend, indicating a positive correlation between N and C. The line slopes upward. * Equation: N ∝ C0.526 * Data Points (approximate): * C = 1020, N ≈ 100 * C = 1021, N ≈ 105 * C = 1022, N ≈ 1010 **Top-Right Plot (N vs. C):** * Color: Blue * Trend: The data points exhibit an upward trend, indicating a positive correlation between N and C. The line slopes upward. * Equation: N ∝ C0.694 * Data Points (approximate): * C = 1020, N ≈ 108 * C = 1021, N ≈ 109 * C = 1022, N ≈ 1010 **Middle-Left Plot (D vs. C):** * Color: Orange * Trend: The data points exhibit an upward trend, indicating a positive correlation between D and C. The line slopes upward. * Equation: D ∝ C0.429 * Data Points (approximate): * C = 1020, D ≈ 108 * C = 1021, D ≈ 109 * C = 1022, D ≈ 1010 **Middle-Right Plot (D vs. C):** * Color: Blue * Trend: The data points exhibit an upward trend, indicating a positive correlation between D and C. The line slopes upward. * Equation: D ∝ C0.462 * Data Points (approximate): * C = 1020, D ≈ 108 * C = 1021, D ≈ 109 * C = 1022, D ≈ 1010 **Bottom-Left Plot (D/N vs. C):** * Color: Orange * Trend: The data points exhibit a downward trend, indicating a negative correlation between D/N and C. The line slopes downward. * Equation: D/N ∝ C-0.053 * Data Points (approximate): * C = 1020, D/N ≈ 10-8 * C = 1021, D/N ≈ 10-7 * C = 1022, D/N ≈ 10-6 **Bottom-Right Plot (D/N vs. C):** * Color: Blue * Trend: The data points exhibit a downward trend, indicating a negative correlation between D/N and C. The line slopes downward. * Equation: D/N ∝ C-0.076 * Data Points (approximate): * C = 1020, D/N ≈ 10-8 * C = 1021, D/N ≈ 10-7 * C = 1022, D/N ≈ 10-6 ### Key Observations * Both datasets (orange and blue) show positive correlations between N and C, and between D and C. * Both datasets show negative correlations between D/N and C. * The exponents in the power-law relationships differ slightly between the two datasets, suggesting different scaling behaviors. * The scaling exponent for N is larger than that for D in both datasets. ### Interpretation The plots demonstrate power-law scaling relationships between the variables N, D, and D/N with respect to C. The positive scaling of N and D with C suggests that as C increases, both N and D tend to increase as well. The negative scaling of D/N with C indicates that the ratio of D to N decreases as C increases. The differing exponents between the orange and blue datasets suggest that the underlying systems or processes generating these relationships are not identical. The fact that N scales more strongly with C than D suggests that N is more sensitive to changes in C than D. These relationships could be indicative of complex system behavior, such as self-organized criticality or fractal growth. The specific meaning of N, D, and C would require additional context to fully interpret. </details> ## C.4. Scaling laws for different target data type In Figure 27, we derive the scaling laws for different target data types. In general, we observe that the model learns image captioning faster than interleaved data, as indicated by the higher absolute value of the scaling exponent (e.g., 0.062 vs 0.046), despite using the same data ratio for captioning and interleaved data (45% each). Additionally, we find that the model learns more slowly on textonly data, likely due to the smaller amount of text-only data (10%). Across model configurations, we find that early fusion scales similarly to late fusion on image captioning but has a lower multiplicative constant (49.99 vs 47.97). For MoEs, the model learns faster but exhibits a higher multiplicative constant. On text and interleaved data, early and late fusion models scale similarly and achieve comparable Figure 23. Observed vs predicted loss. We visualize the loss predicted by our scaling laws (Eq 1) and the actual loss achived by each run. <details> <summary>Image 22 Details</summary> ![e50190f3](/v1/image/e50190f39c90b05809db6ab7fe05ed34fa6dcc723e866e03ca47f4acdd2aab87) ### Visual Description ## Scatter Plots: Predicted Loss vs. Observed Loss ### Overview The image presents two scatter plots, side-by-side, visualizing the relationship between "Observed loss" and "Predicted loss". Each plot contains data for six different model sizes, denoted by values in billions (B) – 0.275B, 0.464B, 0.932B, 1.627B, 2.280B, and 3.354B. A dashed black line representing the ideal prediction (Predicted loss = Observed loss) is overlaid on both plots. ### Components/Axes Both plots share the same axes labels: * **X-axis:** "Observed loss" - ranging from approximately 2.5 to 4.0. * **Y-axis:** "Predicted loss" - ranging from approximately 2.5 to 4.0. * **Legends:** Located in the top-right corner of each plot, listing the model sizes with corresponding colors. The legend colors are as follows (left plot): * 0.275B: Light orange/red * 0.464B: Orange * 0.932B: Dark orange * 1.627B: Brown * 2.280B: Dark brown * 3.354B: Very dark brown/black The legend colors are as follows (right plot): * 0.275B: Light blue * 0.464B: Blue * 0.932B: Dark blue * 1.627B: Darker blue * 2.280B: Very dark blue * 3.354B: Darkest blue/black ### Detailed Analysis or Content Details **Left Plot:** * **0.275B (Light orange/red):** The data points generally follow the dashed line, but show some deviation, particularly at higher observed loss values. Approximate data points: (2.6, 2.6), (3.0, 3.1), (3.5, 3.6), (3.9, 3.9). * **0.464B (Orange):** Similar trend to 0.275B, with slightly more deviation. Approximate data points: (2.6, 2.7), (3.0, 3.2), (3.5, 3.6), (3.9, 3.9). * **0.932B (Dark orange):** Shows a more pronounced upward curve, indicating overestimation of loss at higher observed loss values. Approximate data points: (2.6, 2.8), (3.0, 3.3), (3.5, 3.7), (3.9, 4.0). * **1.627B (Brown):** The upward curve is even more pronounced. Approximate data points: (2.6, 2.9), (3.0, 3.4), (3.5, 3.8), (3.9, 4.1). * **2.280B (Dark brown):** The curve continues to become more pronounced. Approximate data points: (2.6, 3.0), (3.0, 3.5), (3.5, 3.9), (3.9, 4.2). * **3.354B (Very dark brown/black):** The most pronounced upward curve, indicating significant overestimation of loss at higher observed loss values. Approximate data points: (2.6, 3.1), (3.0, 3.6), (3.5, 4.0), (3.9, 4.3). **Right Plot:** * **0.275B (Light blue):** The data points generally follow the dashed line, but show some deviation, particularly at higher observed loss values. Approximate data points: (2.6, 2.6), (3.0, 3.1), (3.5, 3.6), (3.9, 3.9). * **0.464B (Blue):** Similar trend to 0.275B, with slightly more deviation. Approximate data points: (2.6, 2.7), (3.0, 3.2), (3.5, 3.6), (3.9, 3.9). * **0.932B (Dark blue):** Shows a more pronounced upward curve, indicating overestimation of loss at higher observed loss values. Approximate data points: (2.6, 2.8), (3.0, 3.3), (3.5, 3.7), (3.9, 4.0). * **1.627B (Darker blue):** The upward curve is even more pronounced. Approximate data points: (2.6, 2.9), (3.0, 3.4), (3.5, 3.8), (3.9, 4.1). * **2.280B (Very dark blue):** The curve continues to become more pronounced. Approximate data points: (2.6, 3.0), (3.0, 3.5), (3.5, 3.9), (3.9, 4.2). * **3.354B (Darkest blue/black):** The most pronounced upward curve, indicating significant overestimation of loss at higher observed loss values. Approximate data points: (2.6, 3.1), (3.0, 3.6), (3.5, 4.0), (3.9, 4.3). ### Key Observations * In both plots, all data series tend to cluster around the dashed line at lower observed loss values. * As the model size increases (from 0.275B to 3.354B), the data points increasingly deviate from the dashed line, exhibiting an upward curve. This indicates that larger models tend to *overestimate* the predicted loss, especially when the observed loss is high. * The degree of overestimation is directly proportional to the model size. The largest model (3.354B) shows the most significant overestimation. * The two plots appear to be identical in terms of the trends and data distribution. ### Interpretation The plots demonstrate a clear trend: larger models exhibit a tendency to overestimate loss, particularly for higher observed loss values. This suggests a potential issue with calibration in these models. Calibration refers to the alignment between predicted probabilities and actual outcomes. A well-calibrated model's predicted loss should accurately reflect the observed loss. The upward curvature observed in the larger models indicates that they are assigning higher probabilities to outcomes that are less likely to occur, leading to an overestimation of loss. This could be due to factors such as overfitting, where the model learns to fit the training data too closely and fails to generalize well to unseen data. The fact that both plots show the same trend suggests that this is a consistent behavior across the dataset and is not specific to a particular data split or experimental setup. This is a critical observation for model development, as it highlights the need for techniques to improve calibration and prevent overestimation of loss in larger models. The dashed line serves as a benchmark for ideal prediction, and the deviation from this line quantifies the degree of miscalibration. </details> performance. However, MoEs demonstrate better overall performance while learning slightly more slowly. ## C.5. Scaling laws for different training mixtures We investigate how the scaling laws change when modifying the training mixtures. Specifically, we vary the ratio of image caption, interleaved, and text-only data and report the results in Figure 28. Overall, we observe similar scaling trends, with only minor changes in the scaling coefficients. Upon closer analysis, we find that increasing the ratio of a particular data type in the training mixture, leads to a corresponding increase in its scaling exponent. For instance, increasing the ratio of image captions from 30% to 40% raises the absolute value of the exponent from 0.056 to 0.061. However, for text-only data, we do not observe significant changes in the scaling coefficients when varying its proportion in the training mixture. Table 10. Scaling laws prediction errors. We report the mean square error, R2 and mean absolute error for the loss prediction for held-in and held-out (8B model) data. | Parameter | MSE | R2 | MAE (%) | |-------------|--------|--------|-----------| | Held-in | 0.0029 | 0.9807 | 0.8608 | | Held-out | 0.0004 | 0.9682 | 0.553 | Table 11. Scaling laws sensitivity. We report the mean and standard deviation after bootstrapping with 100 iterations. | Model | E | α | β | a | b | d | |---------|---------|---------|---------|---------|---------|---------| | Avg | 1.80922 | 0.29842 | 0.33209 | 0.54302 | 0.48301 | 0.92375 | | Std | 0.33811 | 0.10101 | 0.02892 | 0.08813 | 0.05787 | 0.23296 | ## C.6. Scaling laws evaluation For each model size and number of training tokens, we compute the loss using the estimated functional form in Eq 1 and compare it to the actual loss observed in our runs. Figure 23, Figure 24, and Table 10 visualizes these comparisons, showing that our estimation is highly accurate, particularly for lower loss values and larger FLOPs. We also assess our scaling laws in an extrapolation setting, predicting performance beyond the model sizes used for fitting. Notably, our approach estimates the performance of an 8B model with reasonable accuracy. Additionally, we conduct a sensitivity analysis using bootstrapping. Specifically, we sample P points with replacement ( P being the total number of trained models) and re-estimate the scaling law coefficients. This process is repeated 100 times, and we report the mean and standard deviation of each coefficient. Table 11 shows that our estimation is more precise for β than for α , primarily due to the smaller number of model sizes relative to the number of different token counts used to derive the scaling laws. ## C.7. Scaling laws for sparse NMMs. Similar to dense models, we fit a parametric loss function (Eq 1) to predict the loss of sparse NMMs based on the number of parameters and training tokens, replacing the total parameter count with the number of active parameters. While incorporating sparsity is standard when deriving scaling laws for MoEs [2, 33, 74], we focus on deriving scaling laws specific to the sparsity level used in our MoE setup. This yields coefficients that are implicitly conditioned on the sparsity configuration. We also experiment with a sparsity-aware formulation of the scaling law as proposed in [2], and observe consistent . 464M 932M 1.63B 2.28B <details> <summary>Image 23 Details</summary> ![3230ae91](/v1/image/3230ae919ffd372feedd302f6da91e7eda73d615fad38e7b5879512918bbc488) ### Visual Description \n ## Scatter Plot: Predicted Loss vs. Observed Loss ### Overview This image presents a scatter plot comparing predicted loss against observed loss, with data points representing different model sizes (indicated by the legend). The plot appears to be evaluating the performance of a model, potentially in a machine learning context, by assessing how well its predicted loss aligns with the actual observed loss. Two regions are highlighted with black boxes. ### Components/Axes * **X-axis:** "Observed loss" - Scale ranges from approximately 2.4 to 4.0. * **Y-axis:** "Predicted loss" - Scale ranges from approximately 2.4 to 4.0. * **Legend:** Located in the top-left corner. Contains the following data series labels: * 275M (Light Orange, Circle Marker) * 464M (Orange, Circle Marker) * 932M (Darker Orange, Circle Marker) * 1.63B (Brown, Circle Marker) * 2.28B (Dark Brown, Circle Marker) * 3.35B (Very Dark Brown, Circle Marker) * 8.13B (Black, Asterisk Marker) * **Highlighted Regions:** Two rectangular regions are highlighted with black borders. ### Detailed Analysis The plot shows a general upward trend, indicating that as observed loss increases, predicted loss also tends to increase. However, the relationship isn't perfectly linear. * **275M:** (Light Orange) - Data points are scattered, mostly concentrated between observed loss of 2.5 and 3.5, and predicted loss of 2.5 and 3.5. * **464M:** (Orange) - Similar to 275M, but with a slightly wider spread and extending to higher observed loss values (up to ~3.8) and predicted loss values (up to ~3.8). * **932M:** (Darker Orange) - Data points are more densely clustered, extending to observed loss of ~3.8 and predicted loss of ~3.7. * **1.63B:** (Brown) - Data points continue the trend of increasing spread and higher loss values, reaching observed loss of ~4.0 and predicted loss of ~3.9. * **2.28B:** (Dark Brown) - Data points are concentrated in the upper-right quadrant, with observed loss ranging from ~3.0 to ~4.0 and predicted loss ranging from ~3.0 to ~4.0. * **3.35B:** (Very Dark Brown) - Data points are similar to 2.28B, but with a slightly more pronounced upward trend. * **8.13B:** (Black Asterisk) - Data points are concentrated along a dashed black line, indicating a strong correlation between predicted and observed loss. Observed loss ranges from ~2.5 to ~4.0, and predicted loss ranges from ~2.5 to ~4.0. Within the first highlighted region (bottom-left), the data points for the smaller models (275M, 464M, 932M) are visible. The second highlighted region (bottom-right) contains data points for the larger models (1.63B, 2.28B, 3.35B, 8.13B). ### Key Observations * The larger models (especially 8.13B) exhibit a stronger correlation between predicted and observed loss, as evidenced by the points clustering around the dashed black line. * The smaller models show more scatter, suggesting less accurate predictions. * The highlighted regions emphasize the difference in behavior between smaller and larger models. * The dashed black line represents a perfect prediction (predicted loss = observed loss). ### Interpretation The plot suggests that model size significantly impacts the accuracy of loss prediction. Larger models are better at predicting their own loss, while smaller models exhibit more variability. This could be due to several factors, such as increased model capacity, better generalization, or more stable training dynamics in larger models. The dashed line serves as a benchmark, and the deviation of data points from this line indicates the error in prediction. The two highlighted regions visually separate the performance of smaller and larger models, reinforcing the observation that model size is a key factor in prediction accuracy. The trend suggests that as model size increases, the predicted loss becomes more aligned with the observed loss, indicating improved model calibration. </details> that techniques for extending context length could be beneficial. Scaling laws for multimodal MoEs models. For MoEs, we consider only a single configuration (top-1 routing with 8 experts). We found this configuration to work reasonably well in our setup, and follow a standard MoEs implementation. However, the findings may vary when optimizing more the MoE architecture or exploring different load-balancing, routing strategies or different experts implementations. ## E. Mixture of experts and modality-specific specialization ## 3 . 5 E.1. MoEs configuration Observed loss We experiment with different MoEs configuration by changing the number of experts and the top-k. We report a sample of these experiments in Table 13. . Figure 24. Observed vs predicted loss. We visualize the loss predicted by our scaling laws Eq 1 and the actual loss achieved by each run. We can reliably predict the performance of models larger (8B params) than those used to fit the scaling laws. trends (Table 12). In particular, the exponents associated with model size ( N ) are substantially larger than those for training tokens ( β ), reinforcing the importance of scaling model size in sparse architectures. Additionally, we observe that the terms governing the scaling of active parameters decompose into two components. ## D. Discussion and Limitations Scaling laws for multimodal data mixtures. Our scaling laws study spans different model configurations and training mixtures. While results suggest that the scaling law coefficients remain largely consistent across mixtures, a broader exploration of mixture variations is needed to validate this observation and establish a unified scaling law that accounts for this factor. Scaling laws and performance on downstream tasks. Similar to previous scaling law studies, our analysis focuses on pretraining performance as measured by the validation loss. However, the extent to which these findings translate to downstream performance remains an open question and requires further investigation. Extrapolation to larger scales. The accuracy of scaling law predictions improves with increasing FLOPs Appendix C. Furthermore, we validate our laws when extrapolating to larger model sizes (Appendix C.6). However, whether these laws can be reliably extrapolated to extremely large model sizes remains an open question. High resolution and early-fusion models. Training earlyfusion models with high-resolution inputs leads to a significant increase in vision tokens. While pooling techniques have been widely adopted for late-fusion models, alternative approaches may be necessary for early fusion. Given the similarity of early-fusion models to LLMs, it appears ## E.2. MoEs specialization Figure 25. Modality-specific specialization. We visualize the experts specialization to text and image modalities. Models are evaluated on Obelics. <details> <summary>Image 24 Details</summary> ![26ed55cd](/v1/image/26ed55cd6154ed7dbb65394ed46b80ac3856a060687de51599034e70c425ac2e) ### Visual Description \n ## Line Chart: Modality Specialization vs. Layers ### Overview This image presents a line chart illustrating the relationship between the number of layers and modality specialization. Two data series are plotted: one representing "Text" and the other representing "Image". The chart appears to be investigating how specialization changes as the number of layers increases. ### Components/Axes * **X-axis:** "Layers" - ranging from 0 to 20, with tick marks at intervals of 5. * **Y-axis:** "Modality specialization" - ranging from 0 to 15, with tick marks at intervals of 2.5. * **Data Series 1:** "Text" - represented by an orange line with circular markers. * **Data Series 2:** "Image" - represented by a teal line with circular markers. * **Legend:** Located in the top-right corner, clearly labeling each data series with its corresponding color. ### Detailed Analysis **Image Data Series (Teal Line):** The teal line representing "Image" starts at approximately 15.0 at Layer 0. It then sharply declines to approximately 12.5 at Layer 5, then to approximately 5.0 at Layer 10. It rises to approximately 6.0 at Layer 15, and then declines to approximately 5.0 at Layer 20, before rising to approximately 7.5 at Layer 20. * Layer 0: ~15.0 * Layer 5: ~12.5 * Layer 10: ~5.0 * Layer 15: ~6.0 * Layer 20: ~7.5 **Text Data Series (Orange Line):** The orange line representing "Text" starts at approximately 8.0 at Layer 0. It then sharply declines to approximately 2.5 at Layer 5. It rises to approximately 4.0 at Layer 10, then declines to approximately 3.0 at Layer 15, and then rises to approximately 6.0 at Layer 20. * Layer 0: ~8.0 * Layer 5: ~2.5 * Layer 10: ~4.0 * Layer 15: ~3.0 * Layer 20: ~6.0 ### Key Observations * Both data series exhibit a significant initial decline in modality specialization as the number of layers increases from 0 to 5. * The "Image" data series generally maintains higher specialization values than the "Text" data series throughout the observed range. * The "Text" data series shows more fluctuation, with a more pronounced dip at Layer 5 and a more gradual increase towards Layer 20. * Both lines appear to converge towards the end of the chart, suggesting a potential leveling off of specialization differences at higher layer counts. ### Interpretation The chart suggests that increasing the number of layers in a model initially leads to a decrease in modality specialization for both text and image processing. This could indicate that early layers are responsible for capturing broad, general features, and as layers are added, the model begins to lose some of its initial specialization. However, the "Image" modality consistently demonstrates higher specialization than "Text," potentially indicating that image processing benefits more from deeper models or that image features are more easily captured and maintained across layers. The convergence of the lines at higher layer counts suggests that beyond a certain point, adding more layers does not significantly differentiate the specialization levels between the two modalities. This could be due to saturation effects or the emergence of shared representations. The initial drop in specialization could also be a result of overfitting or the introduction of noise as the model complexity increases. Further investigation would be needed to determine the underlying mechanisms driving these trends. </details> We investigate multimodal specialization in MoE architectures. We compute a specialization score as the average difference between the number of text/images tokens assigned to each expert and a uniform assignment ( 1 /E ). Additionally, we visualize the normalized number of text and image tokens assigned to each expert across layers. Figure 25 shows clear modality-specific experts, particularly in the early layers. Furthermore, the specialization score decreases as the number of layers increases but rises again in the very last layers. This suggests that early and final layers require more modality specialization compared to midlayers. Additionally, we observe several experts shared between text and image modalities, a phenomenon not present in hard-routed or predefined modality-specific experts. | L ( N,D ) = E + A N α + B D β | L ( N,D ) = E + A N α + B D β | L ( N,D ) = E + A N α + B D β | vs | vs | L ( N,D,S ) = A N α + B D β + C (1 - S ) λ + d (1 - S ) δ N γ + E | L ( N,D,S ) = A N α + B D β + C (1 - S ) λ + d (1 - S ) δ N γ + E | L ( N,D,S ) = A N α + B D β + C (1 - S ) λ + d (1 - S ) δ N γ + E | L ( N,D,S ) = A N α + B D β + C (1 - S ) λ + d (1 - S ) δ N γ + E | L ( N,D,S ) = A N α + B D β + C (1 - S ) λ + d (1 - S ) δ N γ + E | L ( N,D,S ) = A N α + B D β + C (1 - S ) λ + d (1 - S ) δ N γ + E | |---------------------------------|---------------------------------|---------------------------------|------|--------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------| | Model | E | A | B | α | β | λ | δ | γ | C | d | | L ( N,D ) (Eq 1) | 2.158 | 381773 | 4659 | 0.710 | 0.372 | - | - | - | - | - | | L ( N,D,S ) [2] | 1.0788 | 1 | 4660 | 0.5890 | 0.3720 | 0.2 | 0.2 | 0.70956 | 1.0788 | 381475 | Table 12. Scaling laws for sparse native multimodal models . Table 13. SFT results with different MoEs configurations. . | | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | Accuracy | CIDEr | CIDEr | |-----------------------|------------|------------|------------|------------|------------|------------|---------|----------| | | AVG | VQAv2 | TextVQA | OKVQA | GQA | VizWiz | COCO | TextCaps | | 4-E-top-1 | 40.0552 | 64.068 | 14.284 | 41.948 | 61.46 | 18.516 | 62.201 | 34.08 | | 8-E-top-1 | 41.6934 | 65.684 | 17.55 | 42.908 | 63.26 | 19.065 | 67.877 | 39.63 | | 8-E-top-2 | 42.8546 | 66.466 | 19.162 | 45.344 | 63.94 | 19.361 | 65.988 | 41.649 | | 8-E-top-2 finegrained | 39.904 | 62.76 | 15.58 | 41.88 | 61.6 | 17.7 | 57.52 | 35.42 | Figure 26. Scaling laws for native multimodal models. From left to right: late-fusion (dense), early-fusion (dense) and early-fusion MoEs. The scaling exponents are very close for all models. However, MoEs leads to overall lower loss (smaller multiplicative constant) and takes longer to saturate. <details> <summary>Image 25 Details</summary> ![de9da5ea](/v1/image/de9da5ea31d8bedb275106d0dec2bec12cd5b42c92fa53130d400561787a42a6) ### Visual Description \n ## Chart: Validation Loss vs. FLOPs for Different Model Sizes ### Overview The image presents three line charts, arranged horizontally, depicting the relationship between Validation Loss and FLOPs (Floating Point Operations) for various model sizes. Each chart represents a different scaling factor, indicated by the title "45-45-10". The charts aim to illustrate how model complexity (as measured by FLOPs) affects validation loss. ### Components/Axes * **X-axis:** FLOPs, ranging from approximately 1019 to 1022 (left chart), 1019 to 1022 (middle chart), and 1019 to 1022 (right chart). The scale is logarithmic. * **Y-axis:** Validation Loss, ranging from approximately 2.5 to 4.5. * **Title:** "45-45-10" appears above all three charts. * **Legend:** Located at the bottom of the image, containing labels for different model sizes: 0.289B, 0.499B, 1B, 1.748B, 2.430B, 3.714B, 0.275B, 0.464B, 0.932B, 1.627B, 2.280B, 3.354B. Each model size is associated with a specific color. * **Lines:** Each line represents a model size, with the color corresponding to the legend. Each line is labeled with an equation of the form "L = C1C2-C3", where C1, C2, and C3 are numerical values. ### Detailed Analysis or Content Details **Chart 1 (Left):** * The black line (L = 29.923C-0.0494) shows a steep downward slope, indicating a rapid decrease in validation loss as FLOPs increase. * The light blue line (0.289B) starts at approximately 4.0 and decreases to around 2.8 as FLOPs increase from 1019 to 1022. * The orange line (0.499B) starts at approximately 3.8 and decreases to around 2.7 as FLOPs increase. * The dark blue line (1B) starts at approximately 3.6 and decreases to around 2.6 as FLOPs increase. * The red line (1.748B) starts at approximately 3.4 and decreases to around 2.5 as FLOPs increase. * The green line (2.430B) starts at approximately 3.2 and decreases to around 2.4 as FLOPs increase. * The purple line (3.714B) starts at approximately 3.0 and decreases to around 2.3 as FLOPs increase. **Chart 2 (Middle):** * The black line (L = 29.574C-0.0492) shows a steep downward slope. * The orange line (0.275B) starts at approximately 4.0 and decreases to around 2.7 as FLOPs increase. * The light blue line (0.464B) starts at approximately 3.8 and decreases to around 2.6 as FLOPs increase. * The green line (0.932B) starts at approximately 3.6 and decreases to around 2.5 as FLOPs increase. * The red line (1.627B) starts at approximately 3.4 and decreases to around 2.4 as FLOPs increase. * The dark blue line (2.280B) starts at approximately 3.2 and decreases to around 2.3 as FLOPs increase. * The purple line (3.354B) starts at approximately 3.0 and decreases to around 2.2 as FLOPs increase. **Chart 3 (Right):** * The black line (L = 27.086C-0.048) shows a steep downward slope. * The green line (0.275B) starts at approximately 4.0 and decreases to around 2.7 as FLOPs increase. * The light blue line (0.464B) starts at approximately 3.8 and decreases to around 2.6 as FLOPs increase. * The orange line (0.932B) starts at approximately 3.6 and decreases to around 2.5 as FLOPs increase. * The red line (1.627B) starts at approximately 3.4 and decreases to around 2.4 as FLOPs increase. * The dark blue line (2.280B) starts at approximately 3.2 and decreases to around 2.3 as FLOPs increase. * The purple line (3.354B) starts at approximately 3.0 and decreases to around 2.2 as FLOPs increase. ### Key Observations * In all three charts, validation loss generally decreases as FLOPs increase, indicating that larger models (requiring more computation) tend to perform better. * The rate of decrease in validation loss appears to diminish as FLOPs increase, suggesting diminishing returns from increasing model size beyond a certain point. * The black lines in each chart represent a fitted curve, and the values of C1, C2, and C3 change across the charts. * The lines are relatively close together, suggesting that the impact of model size on validation loss is consistent across the different scaling factors. ### Interpretation The charts demonstrate a clear trade-off between model complexity (FLOPs) and validation loss. Increasing model size generally leads to lower validation loss, but the improvement becomes less significant as the model grows larger. The fitted curves (black lines) provide a mathematical representation of this relationship, and the changing coefficients suggest that the optimal scaling factor may influence the specific form of this relationship. The consistent trend across the three charts suggests that this trade-off is a fundamental characteristic of the model architecture and training process. The "45-45-10" title likely refers to specific hyperparameters or experimental settings used during training, and further investigation would be needed to understand their influence on the observed results. The diminishing returns observed at higher FLOPs suggest that there may be a point of diminishing returns where increasing model size no longer provides a significant improvement in performance. </details> Figure 27. Scaling laws for native multimodal models. From top to bottom: late-fusion (dense), early-fusion (dense) and early-fusion MoEs. From left to right: cross-entropy on the validation set of image-caption, interleaved and text-only data. <details> <summary>Image 26 Details</summary> ![f9ae8df1](/v1/image/f9ae8df1faf3bbf9218545b0b0a92638ac50818adfd67a0ef33c385d51865df5) ### Visual Description ## Charts: Performance Metrics vs. FLOPs ### Overview The image presents a 3x3 grid of charts, each displaying performance metrics (Image-Caption CE, Interleaved CE, and Text CE) plotted against FLOPs (Floating Point Operations per Second). Each chart corresponds to a specific configuration labeled "45-45-10", "45-45-10", and "45-45-10". Each chart contains multiple lines representing different model sizes, indicated by a legend at the bottom of the image. The charts appear to be evaluating the trade-off between computational cost (FLOPs) and performance (CE - Cross Entropy, a measure of error). ### Components/Axes Each chart shares the following components: * **X-axis:** FLOPs, labeled "FLOPs", with a logarithmic scale ranging from 1019 to 1022. * **Y-axis:** Cross Entropy (CE), labeled "Image-Caption CE", "Interleaved CE", or "Text CE" depending on the chart. The scale ranges from approximately 2 to 4.5. * **Title:** Each chart has a title indicating the configuration: "45-45-10". * **Legend:** Located at the bottom of the image, the legend identifies different model sizes with corresponding colors: 0.289B (green), 0.494B (blue), 1B (orange), 1.748B (cyan), 2.430B (magenta), 3.714B (red), 0.725B (dark green), 0.646B (dark blue), 0.932B (dark orange), 1.627B (dark cyan), 2.280B (dark magenta), 3.354B (dark red). * **L Value:** Each chart displays a value labeled "L" with a standard deviation, positioned near the top-right corner. This likely represents a loss value or a performance metric. ### Detailed Analysis or Content Details **Chart 1 (Top-Left): Image-Caption CE vs. FLOPs** * **Trend:** All lines slope downward, indicating that as FLOPs increase, Image-Caption CE decreases (performance improves). * **Data Points (approximate):** * 0.289B (Green): Starts at CE ≈ 3.9, ends at CE ≈ 2.6 at 1022 FLOPs. * 0.494B (Blue): Starts at CE ≈ 3.8, ends at CE ≈ 2.5 at 1022 FLOPs. * 1B (Orange): Starts at CE ≈ 3.7, ends at CE ≈ 2.4 at 1022 FLOPs. * 1.748B (Cyan): Starts at CE ≈ 3.6, ends at CE ≈ 2.3 at 1022 FLOPs. * 2.430B (Magenta): Starts at CE ≈ 3.5, ends at CE ≈ 2.2 at 1022 FLOPs. * 3.714B (Red): Starts at CE ≈ 3.4, ends at CE ≈ 2.1 at 1022 FLOPs. * 0.725B (Dark Green): Starts at CE ≈ 3.7, ends at CE ≈ 2.4 at 1022 FLOPs. * 0.646B (Dark Blue): Starts at CE ≈ 3.6, ends at CE ≈ 2.3 at 1022 FLOPs. * 0.932B (Dark Orange): Starts at CE ≈ 3.5, ends at CE ≈ 2.2 at 1022 FLOPs. * 1.627B (Dark Cyan): Starts at CE ≈ 3.4, ends at CE ≈ 2.1 at 1022 FLOPs. * 2.280B (Dark Magenta): Starts at CE ≈ 3.3, ends at CE ≈ 2.0 at 1022 FLOPs. * 3.354B (Dark Red): Starts at CE ≈ 3.2, ends at CE ≈ 1.9 at 1022 FLOPs. * **L Value:** L = 49.99C-0.062 **Chart 2 (Top-Center): Interleaved CE vs. FLOPs** * **Trend:** Similar to Chart 1, all lines slope downward. * **Data Points (approximate):** * 0.289B (Green): Starts at CE ≈ 3.8, ends at CE ≈ 2.5 at 1022 FLOPs. * 0.494B (Blue): Starts at CE ≈ 3.7, ends at CE ≈ 2.4 at 1022 FLOPs. * 1B (Orange): Starts at CE ≈ 3.6, ends at CE ≈ 2.3 at 1022 FLOPs. * 1.748B (Cyan): Starts at CE ≈ 3.5, ends at CE ≈ 2.2 at 1022 FLOPs. * 2.430B (Magenta): Starts at CE ≈ 3.4, ends at CE ≈ 2.1 at 1022 FLOPs. * 3.714B (Red): Starts at CE ≈ 3.3, ends at CE ≈ 2.0 at 1022 FLOPs. * 0.725B (Dark Green): Starts at CE ≈ 3.6, ends at CE ≈ 2.3 at 1022 FLOPs. * 0.646B (Dark Blue): Starts at CE ≈ 3.5, ends at CE ≈ 2.2 at 1022 FLOPs. * 0.932B (Dark Orange): Starts at CE ≈ 3.4, ends at CE ≈ 2.1 at 1022 FLOPs. * 1.627B (Dark Cyan): Starts at CE ≈ 3.3, ends at CE ≈ 2.0 at 1022 FLOPs. * 2.280B (Dark Magenta): Starts at CE ≈ 3.2, ends at CE ≈ 1.9 at 1022 FLOPs. * 3.354B (Dark Red): Starts at CE ≈ 3.1, ends at CE ≈ 1.8 at 1022 FLOPs. * **L Value:** L = 25.303-0.0460 **Chart 3 (Top-Right): Text CE vs. FLOPs** * **Trend:** All lines slope downward. * **Data Points (approximate):** * 0.289B (Green): Starts at CE ≈ 3.7, ends at CE ≈ 2.4 at 1022 FLOPs. * 0.494B (Blue): Starts at CE ≈ 3.6, ends at CE ≈ 2.3 at 1022 FLOPs. * 1B (Orange): Starts at CE ≈ 3.5, ends at CE ≈ 2.2 at 1022 FLOPs. * 1.748B (Cyan): Starts at CE ≈ 3.4, ends at CE ≈ 2.1 at 1022 FLOPs. * 2.430B (Magenta): Starts at CE ≈ 3.3, ends at CE ≈ 2.0 at 1022 FLOPs. * 3.714B (Red): Starts at CE ≈ 3.2, ends at CE ≈ 1.9 at 1022 FLOPs. * 0.725B (Dark Green): Starts at CE ≈ 3.5, ends at CE ≈ 2.2 at 1022 FLOPs. * 0.646B (Dark Blue): Starts at CE ≈ 3.4, ends at CE ≈ 2.1 at 1022 FLOPs. * 0.932B (Dark Orange): Starts at CE ≈ 3.3, ends at CE ≈ 2.0 at 1022 FLOPs. * 1.627B (Dark Cyan): Starts at CE ≈ 3.2, ends at CE ≈ 1.9 at 1022 FLOPs. * 2.280B (Dark Magenta): Starts at CE ≈ 3.1, ends at CE ≈ 1.8 at 1022 FLOPs. * 3.354B (Dark Red): Starts at CE ≈ 3.0, ends at CE ≈ 1.7 at 1022 FLOPs. * **L Value:** L = 22.642-0.042 **Charts 4-6 (Middle Row) and 7-9 (Bottom Row):** Follow the same trends and data point structure as the top row, with slightly different starting CE values and L values. ### Key Observations * **Consistent Trend:** Across all charts and model sizes, increasing FLOPs consistently reduces CE, indicating improved performance. * **Model Size Impact:** Larger models (represented by darker colors) generally achieve lower CE values for a given FLOPs level, suggesting better performance. * **Configuration Similarity:** The three configurations ("45-45-10") appear to yield similar performance curves, with minor variations in the L values. * **L Value Variation:** The L values differ across charts, potentially indicating variations in the training data or optimization process for each metric (Image-Caption, Interleaved, Text). ### Interpretation The charts demonstrate the relationship between computational resources (FLOPs) and model performance (CE) for different model sizes. The consistent downward trend suggests that increasing computational power leads to improved performance across all evaluated metrics. The larger models consistently outperform smaller models, highlighting the benefits of increased model capacity. The slight variations in L values across the charts suggest that the optimal trade-off between FLOPs and CE may differ depending on the specific task (image captioning, interleaved tasks, text processing). The data suggests a diminishing return on investment in FLOPs; while increasing FLOPs always improves performance, the improvement becomes smaller as FLOPs increase. This information is valuable for resource allocation and model selection, allowing developers to choose the most efficient model size for a given performance target and computational budget. The "45-45-10" configuration appears to be a consistent setting across these experiments. </details> Figure 28. Scaling laws for early-fusion native multimodal models. Our runs across different training mixtures (Image-captionInterleaved-Text) and FLOPs. We visulize the final validation loss on 3 data types: HQITP (left), Obelics (middle) and DCLM (right). <details> <summary>Image 27 Details</summary> ![b01a74d6](/v1/image/b01a74d69b3f3e2b25ab13d8c461393f742d18a0507ecce443a9b93504f4ded1) ### Visual Description \n ## Charts: Cross-Entropy vs. FLOPS for Different Model Sizes ### Overview The image presents a 3x4 grid of scatter plots, each representing the relationship between Cross-Entropy (CE) and Floating Point Operations (FLOPS) for different model sizes. Each plot shows three lines representing different data sources: "Image-Caption CE" (orange), "Interleaved CE" (purple), and "Text CE" (teal). Each plot also includes a linear regression line with its equation and R-squared value. The model sizes are indicated in the titles of the 3x3 plots. ### Components/Axes * **X-axis:** FLOPS, ranging from 1019 to 1022, on a logarithmic scale. * **Y-axis:** Cross-Entropy (CE), ranging from 2.0 to 4.0. * **Data Series:** * Image-Caption CE (orange circles) * Interleaved CE (purple diamonds) * Text CE (teal squares) * **Linear Regression:** Each plot includes a best-fit line with the equation `L = [slope]C - [intercept]` and the R2 value. * **Plot Titles:** Indicate the model size being evaluated (e.g., "45-45-10", "40-20-40", "30-30-40"). * **Legend:** Located in the top-right corner of each plot, identifying the data series by color and shape. ### Detailed Analysis or Content Details Here's a breakdown of the data from each plot, including the linear regression equation and R2 value. I will describe the trend first, then extract the values. **Row 1:** * **45-45-10:** * *Image-Caption CE:* Line slopes downward. L = 47.97C - 0.061, R2 ≈ 0.99. * *Interleaved CE:* Line slopes downward. L = 25.11C - 0.0458, R2 ≈ 0.99. * *Text CE:* Line slopes downward. L = 22.709C - 0.042, R2 ≈ 0.99. * **45-45-77:** * *Image-Caption CE:* Line slopes downward. L = 49.477C - 0.061, R2 ≈ 0.99. * *Interleaved CE:* Line slopes downward. L = 22.112C - 0.043, R2 ≈ 0.99. * *Text CE:* Line slopes downward. L = 21.352C - 0.042, R2 ≈ 0.99. **Row 2:** * **39-51-8C:** * *Image-Caption CE:* Line slopes downward. L = 39.518C - 0.056, R2 ≈ 0.99. * *Interleaved CE:* Line slopes downward. L = 22.111C - 0.043, R2 ≈ 0.99. * *Text CE:* Line slopes downward. L = 20.257C - 0.041, R2 ≈ 0.99. * **40-20-40:** * *Image-Caption CE:* Line slopes downward. L = 23.888C - 0.045, R2 ≈ 0.99. * *Interleaved CE:* Line slopes downward. L = 16.278C - 0.0281, R2 ≈ 0.99. * *Text CE:* Line slopes downward. L = 15.644C - 0.0266, R2 ≈ 0.99. **Row 3:** * **30-30-40:** * *Image-Caption CE:* Line slopes downward. L = 42.216C - 0.0589, R2 ≈ 0.99. * *Interleaved CE:* Line slopes downward. L = 23.888C - 0.045, R2 ≈ 0.99. * *Text CE:* Line slopes downward. L = 21.53C - 0.0425, R2 ≈ 0.99. * **40-40-40:** * *Image-Caption CE:* Line slopes downward. L = 40.232C - 0.055, R2 ≈ 0.99. * *Interleaved CE:* Line slopes downward. L = 16.278C - 0.0281, R2 ≈ 0.99. * *Text CE:* Line slopes downward. L = 15.644C - 0.0266, R2 ≈ 0.99. All plots exhibit a strong negative correlation between FLOPS and CE, as indicated by the consistently high R2 values (approximately 0.99). As FLOPS increase, CE decreases for all data series. ### Key Observations * The "Image-Caption CE" consistently has the highest slope and intercept in the linear regression equations, indicating a stronger relationship with FLOPS and a higher initial CE value. * The "Text CE" consistently has the lowest slope and intercept, suggesting a weaker relationship with FLOPS and a lower initial CE value. * The "Interleaved CE" falls between the other two data series in terms of slope and intercept. * The R2 values are remarkably consistent across all plots and data series, indicating a very strong linear relationship in each case. ### Interpretation The data suggests that increasing computational resources (FLOPS) leads to a reduction in cross-entropy, indicating improved model performance. The different data series (Image-Caption, Interleaved, and Text) represent different training or evaluation scenarios, and their varying slopes and intercepts suggest that the relationship between FLOPS and CE is dependent on the type of data being processed. The consistently high R2 values indicate that a linear model is a good approximation of the relationship between FLOPS and CE in these scenarios. The fact that Image-Caption CE has the highest slope suggests that this task benefits the most from increased computational resources. Conversely, Text CE appears to be less sensitive to changes in FLOPS. The interleaved CE represents a middle ground. The plots demonstrate a clear trade-off between computational cost (FLOPS) and model accuracy (CE). Model developers can use this information to make informed decisions about the appropriate level of computational resources to allocate to different tasks. The linear regression equations provide a quantitative way to estimate the expected reduction in CE for a given increase in FLOPS. The consistency of the R2 values across different model sizes suggests that the observed relationship is robust and generalizable. </details>

Rendering Paper...