IR CVFeb 17, 2025

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

Navve Wasserman, Roi Pony, Oshri Naparstek, Adi Raz Goldfarb, Eli Schwartz, Udi Barzelay, Leonid Karlinsky

arXiv:2502.12342v127.737 citationsh-index: 10ACL

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation and improvement of retrieval in multi-modal RAG systems, particularly for real-world applications, though it is incremental as it builds on existing RAG frameworks.

The authors tackled the problem of evaluating multi-modal document retrieval for RAG systems by introducing REAL-MM-RAG, a benchmark that captures real-world challenges like multi-modal documents and query rephrasing, which revealed significant model weaknesses in handling table-heavy documents and robustness to rephrasing. They improved retrieval performance by curating training datasets and fine-tuning models, achieving state-of-the-art results on their benchmark.

Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.

View on arXiv PDF

Similar