CVMar 17

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

arXiv:2603.1610078.8h-index: 2
Predicted impact top 40% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

This work clarifies misconceptions about CLIP's limitations for image-only tasks, potentially guiding more effective model improvements.

The study challenges the intra-modal misalignment hypothesis in CLIP-like models, showing that theoretical arguments and empirical measures do not support it, and experiments on retrieval and few-shot classification indicate task ambiguity is more critical for performance.

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes