CVAIMay 4, 2022

Video Extrapolation in Space and Time

arXiv:2205.02084v32 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of disjoint spatial and temporal scene understanding in computer vision, offering a unified approach for applications like robotics and AR/VR, though it is incremental as it builds on existing NVS and VP tasks.

The paper tackles the problem of jointly synthesizing novel views and predicting future frames by combining novel view synthesis (NVS) and video prediction (VP) into a single task called Video Extrapolation in Space and Time (VEST). It achieves performance better than or comparable to state-of-the-art methods on indoor and outdoor datasets.

Novel view synthesis (NVS) and video prediction (VP) are typically considered disjoint tasks in computer vision. However, they can both be seen as ways to observe the spatial-temporal world: NVS aims to synthesize a scene from a new point of view, while VP aims to see a scene from a new point of time. These two tasks provide complementary signals to obtain a scene representation, as viewpoint changes from spatial observations inform depth, and temporal observations inform the motion of cameras and individual objects. Inspired by these observations, we propose to study the problem of Video Extrapolation in Space and Time (VEST). We propose a model that leverages the self-supervision and the complementary cues from both tasks, while existing methods can only solve one of them. Experiments show that our method achieves performance better than or comparable to several state-of-the-art NVS and VP methods on indoor and outdoor real-world datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes