CVLGMay 6, 2025

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

arXiv:2505.03703v16 citationsh-index: 1Has Code
Originality Highly original
AI Analysis

This addresses a key bottleneck for multimodal tasks like retrieval and classification, offering practical solutions to improve model performance.

The paper tackles the modality gap problem in vision-language models, where image and text embeddings are misaligned, and proposes novel measures and techniques to quantify and reduce this gap, demonstrating effectiveness across multiple datasets and models.

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes