CVMar 17

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

arXiv:2603.1613992.9h-index: 1Has Code

Predicted impact top 12% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of high computational cost and data dependency for researchers and practitioners in multimodal AI, though it is incremental as it builds on existing UMM paradigms.

The paper tackles the inefficiency and data scarcity in pre-training visual generation components of Unified Multimodal Models by proposing Image-Only Training for UMMs (IOMM), a two-stage framework that uses unlabeled images and minimal text-image pairs, achieving state-of-the-art performance with 0.89 on GenEval and 0.55 on WISE using only ~1050 GPU hours.

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

View on arXiv PDF Code

Similar