CV CLNov 14, 2025

DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon

arXiv:2511.11552v116.410 citationsh-index: 12

Originality Highly original

AI Analysis

This addresses the challenge of comprehending long visual documents for applications in fields like finance or research, where accurate information retrieval is critical, representing a strong incremental improvement over existing methods.

The paper tackles the problem of evidence localization in long visual document understanding by proposing DocLens, a tool-augmented multi-agent framework that navigates to relevant pages and visual elements, achieving state-of-the-art performance on benchmarks like MMLongBench-Doc and FinRAGBench-V, even surpassing human experts.

Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.

View on arXiv PDF

Similar