Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family
This addresses a fundamental limitation in CLIP models for vision-language tasks, though it is incremental as it builds on known fine-grained understanding issues.
The paper identifies a center bias in CLIP models where they disproportionately focus on central image regions, overlooking boundary objects, and shows this can be mitigated with training-free strategies like visual prompting and attention redistribution.
Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.