CVOct 30, 2025

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan

arXiv:2510.26160v16 citationsh-index: 28

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating MM-RAG systems for researchers and developers working on wearable devices, though it is incremental as it creates a benchmark rather than a new method.

The authors tackled the lack of comprehensive benchmarks for multi-modal retrieval-augmented generation (MM-RAG) in wearable device scenarios by creating CRAG-MM, a benchmark with 6.5K image-question-answer triplets and 2K multi-turn conversations across 13 domains. Their evaluation showed that straightforward RAG approaches achieved only 32% and 43% truthfulness on single- and multi-turn QA, respectively, with state-of-the-art solutions performing similarly (32%/45%), and winning solutions in a competition improved baseline performance by 28%.

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

View on arXiv PDF

Similar