Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
This work addresses inefficiencies in MRAG for multimodal AI systems, offering incremental improvements in computational efficiency and response accuracy.
The paper tackles the problem of static retrieval strategies and suboptimal information use in Multimodal Retrieval-Augmented Generation (MRAG) by introducing Windsock for adaptive retrieval and modality selection, and DANCE Instruction Tuning for better utilization, resulting in a 17.07% improvement in generation quality and an 8.95% reduction in retrieval times.
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.