CVOct 31, 2025

RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents

Yinglu Li, Zhiying Lu, Zhihang Liu, Chuanbin Liu, Hongtao Xie

arXiv:2510.27261v111.83 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This addresses inefficiencies in visually-rich document processing for AI applications, though it is an incremental advance over existing methods.

The paper tackles the problem of irrelevant visual content in multi-modal retrieval-augmented generation by shifting retrieval from document-level to region-level, achieving a 10.02% improvement in retrieval accuracy and a 3.56% increase in question answering accuracy while using fewer visual tokens.

Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.

View on arXiv PDF Code

Similar