CVOct 16, 2025

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

arXiv:2510.14624v14 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses scalability issues for researchers and practitioners using vision-language models on long videos, though it is incremental as it builds on existing methods with a plug-and-play approach.

The paper tackles the problem of high computational cost and token budget limitations in video-language models by introducing Efficient Video Sampling (EVS), a method that prunes temporally redundant tokens, resulting in up to 4x faster inference time with minimal accuracy loss.

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes