CV AI CLAug 22, 2025

Can VLMs Recall Factual Associations From Visual References?

Dhananjay Ashok, Ashutosh Chaubey, Hirona J. Arai, Jonathan May, Jesse Thomason

arXiv:2508.18297v12 citationsh-index: 6EMNLP

Originality Incremental advance

AI Analysis

This addresses a critical problem in multimodal AI for users relying on VLMs for accurate visual understanding, though it is incremental as it focuses on detecting rather than solving the deficiency.

The study identified that Vision Language Models (VLMs) have a systematic deficiency in recalling factual knowledge when references are visual rather than textual, with their ability halved in such cases, and developed probes that achieve over 92% accuracy in flagging unreliable responses, increasing coverage by 7.87% and reducing error risk by 0.9% in visual question answering.

Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

View on arXiv PDF

Similar