Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering
This work addresses the challenge of insufficient external knowledge retrieval for multimodal models in visual question answering, representing an incremental advancement over existing RAG methods.
The paper tackled the problem of limited performance in knowledge-intensive visual question answering by proposing MI-RAG, a multimodal iterative RAG framework that enhances retrieval and knowledge synthesis, resulting in significant improvements in retrieval recall and answer accuracy on benchmarks like Encyclopedic VQA, InfoSeek, and OK-VQA.
Recent advances in Multimodal Large Language Models~(MLLMs) have significantly enhanced the ability of these models in multimodal understanding and reasoning. However, the performance of MLLMs for knowledge-intensive visual questions, which require external knowledge beyond the visual content of an image, still remains limited. While Retrieval-Augmented Generation (RAG) has become a promising solution to provide models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and incorporates knowledge synthesis to refine its understanding. At each iteration, the model formulates a reasoning-guided multi-query to explore multiple facets of knowledge. Subsequently, these queries drive a joint search across heterogeneous knowledge bases, retrieving diverse knowledge. This retrieved knowledge is then synthesized to enrich the reasoning record, progressively deepening the model's understanding. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.