CVAIMar 22, 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

MicrosoftPeking U
arXiv:2303.12346v1298 citationsh-index: 54
Originality Incremental advance
AI Analysis

This addresses the inefficiency and quality issues in long video generation for applications like entertainment or simulation, though it is incremental as it builds on existing diffusion models.

The paper tackles the problem of generating extremely long videos by proposing NUWA-XL, a Diffusion over Diffusion architecture that uses a coarse-to-fine process to generate videos in parallel, reducing the training-inference gap and achieving high-quality results with global and local coherence. Experiments show it decreases inference time from 7.55 minutes to 26 seconds (by 94.26%) for generating 1024 frames.

In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{https://msra-nuwa.azurewebsites.net/}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes