CV AIMar 6

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

arXiv:2603.05950v16.92 citationsh-index: 14

Predicted impact top 71% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses efficiency issues in VLMs for researchers and practitioners, offering an incremental improvement over fixed-budget methods.

The paper tackled the problem of inefficient visual token pruning in Vision-Language Models by proposing E-AdaPrune, an adaptive framework that adjusts token budgets based on image information density, resulting in up to a 0.6% average improvement and a +5.1% boost on the MMVet reasoning task under matched token budgets.

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

View on arXiv PDF

Similar