AIFeb 24, 2025

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun

arXiv:2502.17297v213 citationsh-index: 10Has CodeMM

Originality Incremental advance

AI Analysis

This work addresses the problem of improving multi-modal RAG performance for researchers and practitioners, though it is incremental as it builds on existing RAG and MLLM frameworks.

The paper tackles the underexplored potential of Multi-modal Large Language Models (MLLMs) in leveraging multi-modal contextual information for Retrieval-Augmented Generation (RAG) by introducing the M²RAG benchmark and MM-RAIT instruction tuning method, resulting in significant performance gains of 34% and 33% over baseline models like MiniCPM-V 2.6 and Qwen2-VL.

With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M$^2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments demonstrate the effectiveness of MM-RAIT by significantly improving the quality of responses generated by different RAG models, outperforming MiniCPM-V 2.6 and Qwen2-VL with 34% and 33% gains, respectively. All data and code are available at https://github.com/NEUIR/M2RAG.

View on arXiv PDF Code

Similar