IR AI CL CVJan 15, 2025

MMDocIR: Benchmarking Multimodal Retrieval for Long Documents

Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu

arXiv:2501.08828v330.441 citationsh-index: 12EMNLP

Originality Incremental advance

AI Analysis

This provides a standardized evaluation framework for researchers and practitioners in document AI, though it is incremental as it builds on existing multimodal retrieval concepts.

The paper tackles the lack of a comprehensive benchmark for multimodal document retrieval by introducing MMDocIR, which includes page-level and layout-level tasks, and shows that visual retrievers outperform text-based ones, with improvements up to 15% in retrieval accuracy.

Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text. Our dataset is available at https://mmdocrag.github.io/MMDocIR/.

View on arXiv PDF

Similar