OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
This work addresses the training-inference gap and fine-grained token selection for researchers and practitioners in video processing and multimodal AI, representing a novel method for a known bottleneck rather than an incremental improvement.
The paper tackles the problem of suboptimal performance and limited acceleration in sparse attention methods for long-video multimodal large language models by introducing OmniSparse, a training-aware fine-grained sparse attention framework, which matches full attention performance while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.
Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.