DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
This addresses the challenge of comprehending long visual documents for applications in fields like finance or research, where accurate information retrieval is critical, representing a strong incremental improvement over existing methods.
The paper tackles the problem of evidence localization in long visual document understanding by proposing DocLens, a tool-augmented multi-agent framework that navigates to relevant pages and visual elements, achieving state-of-the-art performance on benchmarks like MMLongBench-Doc and FinRAGBench-V, even surpassing human experts.
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.