CVAIMay 11

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

arXiv:2605.111078.1
Predicted impact top 74% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying VLMs in real-world settings, this method mitigates background spurious correlations without requiring real-world debiased data, offering a practical solution to a known robustness bottleneck.

The paper addresses systematic background biases in vision-language models (VLMs) like CLIP and SigLIP 2. By exploiting linear additivity in VLM embedding spaces, they propose a pre-training method that achieves over 90% worst-group accuracy on Waterbirds under perfect spurious correlation, with strong sim-to-real transfer.

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes