CVAISep 3, 2025

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

arXiv:2509.03025v21 citationsh-index: 5EMNLP
Originality Incremental advance
AI Analysis

This addresses a specific error in LVLMs that can improve reliability for users in multimodal AI applications, but it is incremental as it builds on existing model architectures.

The study tackled the problem of large vision-language models (LVLMs) erroneously perceiving text inputs without visual evidence as part of the image, leading to incorrect responses, and developed a detection module using Visual Absence-aware neurons to mitigate this issue, showing effectiveness across various LVLMs.

Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models' tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes