CVApr 22

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

arXiv:2604.2107966.6h-index: 11
AI Analysis

For practitioners deploying vision-language models, this method reduces compute overhead while maintaining or improving accuracy, though it is an incremental improvement over existing attention-based approaches.

Foveated Reasoner reduces visual-token count in vision-language models by selectively acquiring high-resolution evidence from image regions only when needed, achieving stronger accuracy under tight token budgets across multiple benchmarks.

Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes