CLAICVCYLGMay 28, 2025

NegVQA: Can Vision Language Models Understand Negation?

Stanford
arXiv:2505.22946v18 citationsh-index: 19ACL
Originality Incremental advance
AI Analysis

This addresses a critical gap in VLM comprehension for high-stakes applications, though it is incremental as it focuses on benchmarking rather than solving the problem.

The paper tackled the problem of vision language models (VLMs) struggling to understand negation, a fundamental linguistic phenomenon, by introducing NegVQA, a benchmark with 7,379 two-choice questions, and found that 20 state-of-the-art VLMs exhibited a substantial performance drop on negated questions, with a U-shaped scaling trend in model size.

Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes