Video Ladder Networks
This work addresses video frame prediction for computer vision applications, but it is incremental as it builds on existing encoder-decoder and residual network architectures.
The authors tackled the problem of efficiently generating future video frames by introducing the Video Ladder Network (VLN), a neural encoder-decoder model with recurrent and feedforward lateral connections, achieving competitive results on the Moving MNIST dataset with a simple structure and fast inference.
We present the Video Ladder Network (VLN) for efficiently generating future video frames. VLN is a neural encoder-decoder model augmented at all layers by both recurrent and feedforward lateral connections. At each layer, these connections form a lateral recurrent residual block, where the feedforward connection represents a skip connection and the recurrent connection represents the residual. Thanks to the recurrent connections, the decoder can exploit temporal summaries generated from all layers of the encoder. This way, the top layer is relieved from the pressure of modeling lower-level spatial and temporal details. Furthermore, we extend the basic version of VLN to incorporate ResNet-style residual blocks in the encoder and decoder, which help improving the prediction results. VLN is trained in self-supervised regime on the Moving MNIST dataset, achieving competitive results while having very simple structure and providing fast inference.