LGCVDec 20, 2014

Video (language) modeling: a baseline for generative models of natural videos

arXiv:1412.6604v5482 citations
Originality Incremental advance
AI Analysis

This work provides a baseline for generative models of natural videos, addressing the challenge of representing complex deformations and motion patterns in an unsupervised manner.

The paper tackles the problem of unsupervised feature learning from video data by adapting language modeling techniques to predict missing or future frames, demonstrating that the model can predict non-trivial motions in short video sequences after training on natural videos.

We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes