CVMay 24, 2025

ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

arXiv:2505.18757v25 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses inference efficiency for users of large vision-language models, offering a significant speed-up with minimal performance loss, though it is incremental as it builds on existing token pruning methods.

The paper tackles the problem of inefficient inference in large vision-language models by proposing ToDRE, a two-stage training-free framework that prunes visual tokens based on diversity and task relevance, achieving a 2.6x speed-up while maintaining 95.0% model performance.

Visual token pruning aims to compress and prune redundant visual tokens which play a critical role in efficient inference with large vision-language models (LVLMs). However, most existing work estimates visual redundancy using a single metric, such as cross-modal attention or visual token similarity. We show that visual token diversity and task-specific token relevance are two crucial yet orthogonal factors that complement each other in conveying useful information and should therefore be treated separately for more effective visual token pruning. Building upon this insight, we design TODRE, a two-stage and training-free framework that incorporates Token Diversity and task RElevance for effective token compression and efficient LVLM inference. Instead of pruning redundant tokens, we introduce a greedy max-sum diversification algorithm that selects and retains a subset of diverse and representative visual tokens after the vision encoder. On top of that, ToDRE leverages an "information migration" mechanism to eliminate task-irrelevant visual tokens within certain decoder layers of large language model(LLM) to further improve token pruning and LVLM inference. Extensive experiments show that ToDRE prunes 90% of visual tokens after the vision encoder as well as all visual tokens in certain LLM decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.0% model performance plus excellent model compatibility.

View on arXiv PDF

Similar