CVNov 25, 2022

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

arXiv:2211.14308v37 citationsh-index: 151
Originality Highly original
AI Analysis

This addresses the problem of generating realistic future video frames for applications in autonomous driving and motion analysis, representing a novel method rather than an incremental improvement.

The paper tackles future video frame prediction by decomposing images into object layers with shared structures and predicting parametric geometric transformations, achieving state-of-the-art performance on benchmarks like Cityscapes and UCF-Sports with significant margins.

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes