AIMMOct 29, 2025

ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

arXiv:2510.25668v1h-index: 14Has Code
Originality Highly original
AI Analysis

This addresses the challenge of efficient and accurate long-document understanding for applications in document analysis, offering a novel approach beyond passive methods.

The paper tackles the problem of vision-language models struggling with long, visually complex documents by introducing ALDEN, a reinforcement learning framework that fine-tunes VLMs as interactive agents for active navigation, achieving state-of-the-art performance on five long-document benchmarks.

Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes