Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
This addresses the problem of assessing visual evidence integration in multimodal RAG systems for researchers and developers, though it is incremental as it builds on existing RAG and MLLM benchmarks.
The authors tackled the lack of benchmarks for evaluating how multimodal large language models use retrieved images in retrieval-augmented generation by introducing Visual-RAG, a benchmark for visually grounded, knowledge-intensive questions, and found that while images provide strong evidence, state-of-the-art models struggle to efficiently extract and utilize visual knowledge.
Retrieval-augmented generation (RAG) is a paradigm that augments large language models (LLMs) with external knowledge to tackle knowledge-intensive question answering. While several benchmarks evaluate Multimodal LLMs (MLLMs) under Multimodal RAG settings, they predominantly retrieve from textual corpora and do not explicitly assess how models exploit visual evidence during generation. Consequently, there still lacks benchmark that isolates and measures the contribution of retrieved images in RAG. We introduce Visual-RAG, a question-answering benchmark that targets visually grounded, knowledge-intensive questions. Unlike prior work, Visual-RAG requires text-to-image retrieval and the integration of retrieved clue images to extract visual evidence for answer generation. With Visual-RAG, we evaluate 5 open-source and 3 proprietary MLLMs, showcasing that images provide strong evidence in augmented generation. However, even state-of-the-art models struggle to efficiently extract and utilize visual knowledge. Our results highlight the need for improved visual retrieval, grounding, and attribution in multimodal RAG systems.