CVOct 21, 2025

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

arXiv:2510.18692v111 citationsh-index: 9
Originality Highly original
AI Analysis

This addresses the efficiency problem for researchers and practitioners in video generation by enabling longer sequences without blockwise constraints, though it is incremental as it builds on existing sparse attention methods.

The paper tackles the quadratic scaling bottleneck of full attention in Diffusion Transformers for long video generation by introducing Mixture-of-Groups Attention (MoGA), which uses a learnable token router to enable efficient sparse attention and end-to-end generation of minute-level, 480p videos at 24 fps with a context length of approximately 580k.

Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes