CVAIMay 24, 2025

TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

arXiv:2505.18434v13 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of negation awareness in vision-language models for applications requiring accurate semantic understanding, though it is incremental as it builds on existing data generation methods.

The paper tackled the problem of limited negation understanding in CLIP by introducing a training-time negation data generation pipeline that reduces extra training time to 2.5% and achieves state-of-the-art performance on diverse negation benchmarks, including image-to-text matching, text-to-image retrieval, and image generation.

Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes