CVAIMar 6

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

arXiv:2603.05950v12 citations
Predicted impact top 71% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses efficiency issues in VLMs for researchers and practitioners, offering an incremental improvement over fixed-budget methods.

The paper tackled the problem of inefficient visual token pruning in Vision-Language Models by proposing E-AdaPrune, an adaptive framework that adjusts token budgets based on image information density, resulting in up to a 0.6% average improvement and a +5.1% boost on the MMVet reasoning task under matched token budgets.

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes