CLApr 15

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

arXiv:2604.1373139.6h-index: 9Has Code
Predicted impact top 19% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the scalability and precision trade-off in multi-page document VQA for practitioners needing efficient, accurate reasoning over long documents.

Doc-V* proposes an OCR-free agentic framework for multi-page document VQA that uses coarse-to-fine navigation and structured memory, achieving up to 47.9% improvement over RAG baselines on out-of-domain benchmarks.

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^*$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^*$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9\%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes