LGCVJan 26

Closing the Modality Gap Aligns Group-Wise Semantics

arXiv:2601.18525v11 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses the modality gap problem for researchers in multimodal learning, offering incremental insights by re-evaluating its impact on group-wise tasks.

The paper tackles the modality gap in multimodal learning, showing that while it has limited impact on instance-wise tasks like retrieval, it strongly affects group-level tasks such as clustering, and their method reduces this gap to significantly improve performance in group-wise tasks.

In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes