IRCVMay 1, 2025

A Multi-Granularity Retrieval Framework for Visually-Rich Documents

arXiv:2505.01457v25 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the limitation of text-only retrieval in RAG systems for multimodal documents, representing an incremental improvement with a training-free approach.

The paper tackles the problem of retrieving information from visually-rich documents containing text, images, tables, and charts, proposing a multi-granularity multimodal retrieval framework that achieves a top performance score of 65.56 on benchmark tasks.

Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybrid retrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and VLM-based candidate verification significantly enhances retrieval accuracy, achieving a top performance score of 65.56. This work underscores the potential of scalable and reproducible solutions in advancing multimodal document retrieval systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes