Hierarchical Model for Long-term Video Prediction
This work addresses blur and disintegration issues in video prediction for computer vision applications, but it is largely adopted from prior work, making it incremental.
The paper tackles long-term video prediction by using a hierarchical approach that first estimates high-level structure and then recovers realistic images, demonstrating good results on the Penn Action dataset.
Video prediction has been an active topic of research in the past few years. Many algorithms focus on pixel-level predictions, which generates results that blur and disintegrate within a few frames. In this project, we use a hierarchical approach for long-term video prediction. We aim at estimating high-level structure in the input frame first, then predict how that structure grows in the future. Finally, we use an image analogy network to recover a realistic image from the predicted structure. Our method is largely adopted from the work by Villegas et al. The method is built with a combination of LSTMs and analogy-based convolutional auto-encoder networks. Additionally, in order to generate more realistic frame predictions, we also adopt adversarial loss. We evaluate our method on the Penn Action dataset, and demonstrate good results on high-level long-term structure prediction.