CVDec 13, 2024

MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

arXiv:2412.09828v11 citations
Originality Incremental advance
AI Analysis

This work addresses computational inefficiencies in video diffusion models for researchers and practitioners in generative AI, though it appears incremental as it builds on existing diffusion transformer methods.

The authors tackled the challenge of generating high-resolution videos with rich semantics and complex motion using diffusion transformers by proposing a Multi-Scale Spatio-Temporal Causal Attention framework, which reduces computational complexity and enables efficient autoregressive video generation without violating frame order.

Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes