Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models
This work addresses the unreliable learning of token importance in VLM pruning by providing a fully differentiable optimization path, improving efficiency without sacrificing accuracy.
DiffPrune reformulates visual token pruning as continuous control of token information, achieving 96.5% accuracy retention and 2.85x LLM prefill acceleration across ten VLM benchmarks.
Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.