CVApr 16, 2025

Understanding Attention Mechanism in Video Diffusion Models

arXiv:2504.12027v24 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses a fundamental gap in understanding attention mechanisms for researchers and practitioners in video synthesis, offering practical improvements for video generation and editing.

The paper tackled the unclear role of attention mechanisms in text-to-video diffusion models by conducting a perturbation analysis, finding that high-entropy attention maps correlate with superior video quality and low-entropy ones with intra-frame structure, and proposed lightweight methods to enhance video quality and enable text-guided editing.

Text-to-video (T2V) synthesis models, such as OpenAI's Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video's intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes