CVAICLJan 29

Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

arXiv:2601.21444v1h-index: 39Has Code
Originality Highly original
AI Analysis

This addresses the problem of slow long-video understanding for users of LMMs, offering a significant acceleration over prior methods.

The paper tackles the bottleneck of long-video inference efficiency in Large Multimodal Models by proposing Spava, a sequence-parallel framework that accelerates processing across multiple GPUs, achieving speedups of up to 12.72x over existing methods without significant performance loss.

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes