CVCLMay 22, 2024

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations

arXiv:2405.14030v32 citationsh-index: 32025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Synthesis-oriented
AI Analysis

This addresses bias issues in foundational AI models for downstream applications, but it is incremental as it builds on existing CLIP analysis with a simple linear probe.

The study tackled the problem of biases in vision-language contrastive models like CLIP, which arise from spurious correlations in pre-training data, and found that using visual representations instead of text embeddings is more effective for refining these skewed perceptions.

Large vision-language contrastive models (VLCMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLCM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more effective to refine the skewed perceptions in VLCMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our code can be found here.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes