CVCLOct 22, 2025

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

arXiv:2510.21850v1h-index: 25
Originality Highly original
AI Analysis

This addresses the problem of inefficient document navigation for multimodal agents in applications like GUI control and web navigation, representing a novel approach rather than an incremental improvement.

The paper tackles the challenge of vision-language models understanding long-context visual information for document navigation tasks by proposing SCoPE VLM, which uses a Chain of Scroll mechanism to selectively process relevant document segments, reducing memory usage and modeling human-like reading behaviors.

Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes