CVJul 20, 2020

Learning Joint Spatial-Temporal Transformations for Video Inpainting

arXiv:2007.10247v1346 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses inconsistent attention and artifacts in video inpainting, which is important for video editing and restoration applications, representing an incremental improvement over existing attention-based methods.

The paper tackles video inpainting by proposing a joint Spatial-Temporal Transformer Network (STTN) that simultaneously fills missing regions in all frames using self-attention and optimizes with a spatial-temporal adversarial loss, achieving high-quality results as demonstrated through quantitative and qualitative evaluations with standard and moving object masks.

High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes