CVJun 22, 2018

Video Inpainting by Jointly Learning Temporal Structure and Spatial Details

arXiv:1806.08482v2179 citations
AI Analysis

This addresses the problem of recovering missing regions in videos for applications like video editing, with incremental improvements in method design.

The paper tackles video inpainting by proposing a deep learning architecture with two sub-networks for temporal structure and spatial detail recovery, jointly trained end-to-end, and shows it outperforms previous learning-based methods on three datasets.

We present a new data-driven video inpainting method for recovering missing regions of video frames. A novel deep learning architecture is proposed which contains two sub-networks: a temporal structure inference network and a spatial detail recovering network. The temporal structure inference network is built upon a 3D fully convolutional architecture: it only learns to complete a low-resolution video volume given the expensive computational cost of 3D convolution. The low resolution result provides temporal guidance to the spatial detail recovering network, which performs image-based inpainting with a 2D fully convolutional network to produce recovered video frames in their original resolution. Such two-step network design ensures both the spatial quality of each frame and the temporal coherence across frames. Our method jointly trains both sub-networks in an end-to-end manner. We provide qualitative and quantitative evaluation on three datasets, demonstrating that our method outperforms previous learning-based video inpainting methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes