CVAIFeb 17

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

arXiv:2602.15915v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating external and internal knowledge for more accurate and interpretable VQA, though it is incremental as it builds on existing MLLM backbones with a novel filtering mechanism.

The paper tackles the problem of noisy and misaligned knowledge in knowledge-based visual question answering by proposing MaS-VQA, a framework that filters irrelevant visual and textual knowledge to improve reasoning, resulting in consistent performance gains on benchmarks like Encyclopedic-VQA and InfoSeek.

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes