Video Probabilistic Diffusion Models in Projected Latent Space
This work addresses computational inefficiency in video generation for AI and media applications, representing an incremental advance by optimizing existing diffusion models for efficiency.
The paper tackles the challenge of synthesizing high-resolution, temporally coherent videos by proposing a projected latent video diffusion model (PVDM) that learns video distributions in a low-dimensional latent space, achieving a FVD score of 639.7 on the UCF-101 benchmark, improving over the prior state-of-the-art by 1773.4.
Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art.