Closing the Modality Gap Aligns Group-Wise Semantics
This work addresses the modality gap problem for researchers in multimodal learning, offering incremental insights by re-evaluating its impact on group-wise tasks.
The paper tackles the modality gap in multimodal learning, showing that while it has limited impact on instance-wise tasks like retrieval, it strongly affects group-level tasks such as clustering, and their method reduces this gap to significantly improve performance in group-wise tasks.
In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.