CVJan 23, 2024

Lumiere: A Space-Time Diffusion Model for Video Generation

arXiv:2401.12945v2491 citationsh-index: 33SIGGRAPH Asia
AI Analysis

This addresses the problem of global temporal consistency in video generation for content creation applications.

The authors tackled the challenge of generating realistic, diverse, and coherent motion in text-to-video synthesis by introducing Lumiere, a space-time diffusion model that generates entire videos in a single pass, achieving state-of-the-art results.

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes