CVNov 28, 2025

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

arXiv:2511.22843v2
Originality Incremental advance
AI Analysis

This addresses a critical flaw in MKB-VQA evaluation for researchers, though it is incremental as it focuses on benchmark improvement rather than a new paradigm.

The paper tackles the problem of visual shortcuts in Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks, where models exploit image cues to match primary entities, and introduces the RETINA benchmark to remove these shortcuts by using images of secondary subjects, causing existing models' performance to degrade significantly.

Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes