LGAICVDec 18, 2024

I0T: Embedding Standardization Method Towards Zero Modality Gap

arXiv:2412.14384v15 citationsh-index: 4ACL
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in multimodal AI for tasks like retrieval and classification, offering an incremental improvement to existing CLIP frameworks.

The paper tackles the modality gap issue in CLIP-based models, where image and text embeddings diverge, by proposing two methods: a post-hoc standardization that reduces the gap to near-zero and a trainable normalization approach, both preserving original embeddings without retraining.

Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, $\text{I0T}_{\text{post}}$ that reduces the modality gap approximately to zero and (2) a trainable method, $\text{I0T}_{\text{async}}$, to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, $\text{I0T}_{\text{post}}$ can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes