CVMMFeb 24

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

arXiv:2602.21035v14 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses a critical limitation in vision-language models for applications requiring nuanced language understanding, though it is an incremental improvement over existing methods.

The paper tackles CLIP's inability to understand negation in visual descriptions by proposing CLIPGlasses, a plug-and-play framework that improves performance without fine-tuning, achieving competitive in-domain results and superior cross-domain generalization, especially in low-resource settings.

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes