CV LGJun 5

Diagnosing Visual Ignorance in Vision-Language Models

Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang

arXiv:2606.0689020.8

Originality Incremental advance

AI Analysis

For researchers and practitioners evaluating VLMs, this work reveals that language-prior reliance is a systematic routing failure that undermines benchmark validity, highlighting the need for structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

This paper investigates how Vision-Language Models (VLMs) rely on language priors rather than visual evidence, showing through mechanistic analysis and a progressive visual decay metric that a substantial fraction of benchmark examples remain answerable even under severe visual obfuscation, indicating that current benchmarks inadvertently reward visual ignorance.

Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

View on arXiv PDF

Similar