LGAICVMay 28, 2025

A Closer Look at Multimodal Representation Collapse

arXiv:2505.22483v221 citationsh-index: 6ICML
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in multimodal AI for researchers and practitioners, offering a novel solution to improve model robustness, though it is incremental in building on existing knowledge.

The paper tackles the problem of modality collapse in multimodal fusion, where models ignore some modalities, by showing it occurs due to noisy feature entanglement and proposing an algorithm that prevents collapse through explicit basis reallocation, validated on multiple benchmarks.

We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes