CVAug 26, 2025

Enhancing Document VQA Models via Retrieval-Augmented Generation

arXiv:2508.18984v21 citationsh-index: 31ICDAR
Originality Incremental advance
AI Analysis

This work addresses the practical challenge of handling long documents in Document VQA for real-world applications, though it is incremental as it applies an existing RAG framework to this domain.

The paper tackles the memory inefficiency of concatenating all pages or using large vision-language models in multi-page Document VQA by incorporating Retrieval-Augmented Generation (RAG) for evidence selection, resulting in improvements of up to +22.5 ANLS with text-based retrieval and +5.0 ANLS with visual retrieval on benchmarks like MP-DocVQA, DUDE, and InfographicVQA.

Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes