CVAILGDec 11, 2023

Photorealistic Video Generation with Diffusion Models

CMUDeepMind
arXiv:2312.06662v1326 citationsh-index: 47ECCV
Originality Incremental advance
AI Analysis

This addresses the problem of generating realistic videos for applications in media and AI, representing a strong incremental advance in diffusion-based video synthesis.

The paper tackles photorealistic video generation by introducing W.A.L.T, a transformer-based diffusion model that achieves state-of-the-art performance on video and image generation benchmarks without classifier-free guidance, and generates high-resolution videos at 512x896 resolution and 8 fps through a cascade of models.

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes