CVCLMay 25, 2025

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

arXiv:2505.19155v12 citationsh-index: 6ACL
Originality Incremental advance
AI Analysis

This addresses efficiency challenges for users of Video-LLMs in processing long videos, though it is an incremental improvement on existing acceleration methods.

The paper tackles the high inference latency in video large language models (Video-LLMs) due to long input sequences by introducing Sparse-to-Dense (StD), a decoding strategy that uses sparse and dense attention modules to achieve up to a 1.94× walltime speedup without performance loss.

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes