CVMar 10

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

arXiv:2603.09721v18.8h-index: 13
Predicted impact top 62% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of efficient spatio-temporal modeling in video generation for AI and multimedia applications, representing an incremental improvement over existing methods.

The paper tackled the challenge of efficient high-fidelity video generation in diffusion models by proposing Matrix Attention, a frame-level temporal attention mechanism, and FrameDiT architectures, achieving state-of-the-art results with improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes