CVLGMar 2, 2021

Predicting Video with VQVAE

arXiv:2103.01950v179 citations
Originality Incremental advance
AI Analysis

This work addresses video prediction for unconstrained, large-scale datasets, representing an incremental improvement in resolution and scalability over prior methods.

The paper tackles video prediction by compressing high-resolution videos into discrete latent variables using VQ-VAE, enabling scalable autoregressive models to forecast future frames, achieving prediction at 256x256 resolution on diverse datasets like Kinetics-600 and validating results through human evaluation.

In recent years, the task of video prediction-forecasting future video given past video frames-has attracted attention in the research community. In this paper we propose a novel approach to this problem with Vector Quantized Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video. In contrast to previous work that has largely emphasized highly constrained datasets, we focus on very diverse, large-scale datasets such as Kinetics-600. We predict video at a higher resolution on unconstrained videos, 256x256, than any other previous method to our knowledge. We further validate our approach against prior work via a crowdsourced human evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes