MM CL CVJun 4, 2025

How Far Are We from Generating Missing Modalities with Foundation Models?

Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He

arXiv:2506.03530v2h-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating missing modalities in multimodal AI systems, which is an incremental improvement for applications like data augmentation and robust AI.

The paper tackles the problem of missing modality reconstruction using foundation models, identifying three paradigms and evaluating 42 model variants, and proposes an agentic framework with self-refinement that reduces FID for missing image reconstruction by at least 14% and MER for missing text reconstruction by at least 10% compared to baselines.

Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.

View on arXiv PDF Code

Similar