CVAIMay 9

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

arXiv:2605.1635971.4
AI Analysis

For practitioners deploying large multimodal models, this work provides a practical, training-free solution to reduce inference cost by pruning visual tokens, though the gains are incremental over existing methods.

The paper investigates how many visual tokens are needed for vision-language models under a fixed budget and proposes F^3A, a training-free pruning method that allocates tokens via question-conditioned cues and sparse sensing, achieving up to 2x speedup with minimal performance loss across model scales.

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes