CVMar 29, 2022

VPTR: Efficient Transformers for Video Prediction

arXiv:2203.15836v134 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses video prediction efficiency and accuracy for applications like robotics and autonomous driving, but it is incremental as it builds on existing Transformer methods.

The paper tackles video future frames prediction by proposing efficient Transformer blocks with local spatial-temporal separation attention, achieving competitive performance with state-of-the-art models.

In this paper, we propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism. Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed. In addition, a non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart. In order to avoid the prediction of very similar future frames, a contrastive feature loss is applied to maximize the mutual information between predicted and ground-truth future frame features. This work is the first that makes a formal comparison of the two types of attention-based video future frames prediction models over different scenarios. The proposed models reach a performance competitive with more complex state-of-the-art models. The source code is available at \emph{https://github.com/XiYe20/VPTR}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes