CVNov 15, 2025

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

arXiv:2511.12201v26.21 citationsh-index: 8

Originality Highly original

AI Analysis

This work addresses the training-inference gap and fine-grained token selection for researchers and practitioners in video processing and multimodal AI, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the problem of suboptimal performance and limited acceleration in sparse attention methods for long-video multimodal large language models by introducing OmniSparse, a training-aware fine-grained sparse attention framework, which matches full attention performance while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.

View on arXiv PDF

Similar