CVMay 27

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

arXiv:2605.2805165.7h-index: 3
Predicted impact top 45% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses the unreliable learning of token importance in VLM pruning by providing a fully differentiable optimization path, improving efficiency without sacrificing accuracy.

DiffPrune reformulates visual token pruning as continuous control of token information, achieving 96.5% accuracy retention and 2.85x LLM prefill acceleration across ten VLM benchmarks.

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes