CVApr 14, 2021

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

arXiv:2104.06697v129 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of generating coherent video frames over very long time horizons, which is incremental as it builds on hierarchical approaches but extends prediction time significantly.

The paper tackles the problem of long-term video prediction by revisiting hierarchical models, using a method that predicts semantic structures first and then translates them to pixels, achieving successful prediction over thousands of frames on datasets like car driving and human dancing.

Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Full videos and codes are available at https://1konny.github.io/HVP/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes