LGCVMay 18, 2025

STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

arXiv:2505.12359v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses efficiency issues for users of large vision-language models, offering a plug-and-play solution that is incremental but effective in reducing computational overhead.

The paper tackles the computational overhead of visual tokens in large vision-language models by proposing STAR, a two-stage token pruning framework that reduces inference cost while maintaining or improving performance across multiple benchmarks.

Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes