CVOct 27, 2025

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

arXiv:2510.22946v49 citationsh-index: 15
Originality Incremental advance
AI Analysis

It addresses the computational cost problem for researchers and practitioners in multimodal AI, though it is incremental as it builds on existing models.

The paper tackles the inefficiency of training unified multimodal models from scratch by proposing LightFusion, a framework that fuses existing specialized models with multimodal self-attention blocks, achieving strong results such as 0.91 on GenEval and 82.16 on DPG-Bench with only ~35B tokens of training.

Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes