CVCLJan 19, 2025

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

arXiv:2501.10913v320 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses a specific limitation in multimodal AI models for applications requiring nuanced language understanding, but it is incremental as it builds on existing CLIP architectures with new data.

The paper tackled CLIP's inability to grasp negation, such as differentiating 'parking' from 'no parking', by introducing data generation pipelines using LLMs to produce negation-inclusive captions, resulting in NegationCLIP that enhances negation awareness while preserving generality and showing performance gains in tasks like text-to-image generation and referring image segmentation.

While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like "parking" from "no parking" - poses substantial challenges. By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately. Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes