CVIRMMApr 5, 2025

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

arXiv:2504.04065v21 citationsh-index: 6
AI Analysis

This work addresses performance bottlenecks in KB-VQA systems for applications requiring precise multimodal understanding, representing an incremental advancement through integration of existing mechanisms.

The paper tackles the problem of limited interaction between knowledge retrieval and answer generation in knowledge-based vision question answering (KB-VQA) by proposing a unified framework with collaborative parametric knowledge calibration, resulting in a 4.7% improvement in answering accuracy and a 7.5% boost in base multimodal large language models' VQA performance.

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes