CVAIJan 24, 2025

Dynamic Token Reduction during Generation for Vision Language Models

arXiv:2501.14204v13 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses practical limitations in VLMs for multimodal tasks, offering an incremental improvement over existing token reduction methods.

The paper tackles the computational inefficiency of Vision-Language Models (VLMs) during generation by introducing Dynamic Rate (DyRate), a method that progressively prunes visual tokens based on attention distribution analysis, reducing computational demands while maintaining response quality.

Vision-Language Models (VLMs) have achieved notable success in multimodal tasks but face practical limitations due to the quadratic complexity of decoder attention mechanisms and autoregressive generation. Existing methods like FASTV and VTW have achieved notable results in reducing redundant visual tokens, but these approaches focus on pruning tokens in a single forward pass without systematically analyzing the redundancy of visual tokens throughout the entire generation process. In this paper, we introduce a dynamic pruning strategy tailored for VLMs, namedDynamic Rate (DyRate), which progressively adjusts the compression rate during generation. Our analysis of the distribution of attention reveals that the importance of visual tokens decreases throughout the generation process, inspiring us to adopt a more aggressive compression rate. By integrating a lightweight predictor based on attention distribution, our approach enables flexible adjustment of pruning rates based on the attention distribution. Our experimental results demonstrate that our method not only reduces computational demands but also maintains the quality of responses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes