CVApr 10, 2025

Scaling Laws for Native Multimodal Models

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby

arXiv:2504.07951v431.641 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses the architectural design problem for researchers and practitioners building multimodal AI systems, showing early-fusion can be more effective, which is incremental but impactful.

The study investigated whether late-fusion architectures are inherently superior for multimodal models by scaling laws analysis of 457 models, finding early-fusion architectures perform better at lower parameter counts, are more efficient to train, and easier to deploy, with Mixture of Experts improving performance.

Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)-those trained from the ground up on all modalities-and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.

View on arXiv PDF

Similar