CVMay 29

VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning

arXiv:2605.3145789.8
AI Analysis

This work tackles the computational inefficiency of large multimodal models during inference, which is a critical problem for real-world deployment of LMMs.

The paper addresses the inference-time overhead in large multimodal models (LMMs) by demonstrating that critical visual evidence changes dynamically during reasoning. They propose VisionPulse, a framework that prunes visual tokens step-wise, retaining only 5% of tokens per step, which shortens reasoning traces by 11.2% with negligible accuracy loss.

With the rapid advancement of large multimodal models (LMMs), inference-time overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evidence remains static during reasoning. However, we empirically show that visual evidence is strongly step-dependent: only a sparse subset of visual tokens is critical at each decoding step, and the critical set evolves across reasoning. Furthermore, we identify a coupled bottleneck where redundant visual context can steer the model toward query-irrelevant regions, lengthening the reasoning trace. Guided by these insights, we propose VisionPulse, a step-wise visual token pruning framework during reasoning. VisionPulse computes a lightweight visual attention mass to estimate the step-wise retention budget by exploiting its strong positive correlation with LMMs' effective visual token usage and retain only the most critical tokens under this budget. By enforcing visual sparsity during reasoning, VisionPulse filters redundant visual context while preserving relevant visual evidence, shortening reasoning traces naturally. Extensive experiments show that VisionPulse only retains 5% of visual tokens per step with reasoning traces shortened by 11.2%, while keeping accuracy almost unchanged.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes