CVSep 28, 2022

DeViT: Deformed Vision Transformers in Video Inpainting

Jiayin Cai, Changlin Li, Xin Tao, Chun Yuan, Yu-Wing Tai

arXiv:2209.13925v111.719 citationsh-index: 72

Originality Incremental advance

AI Analysis

This addresses video inpainting for media editing applications, representing an incremental improvement with novel components.

The paper tackles video inpainting by introducing DeViT, which uses Deformed Patch-based Homography, Mask Pruning-based Patch Attention, and a Spatial-Temporal weighting Adaptor to improve feature alignment and matching, achieving state-of-the-art results with unspecified quantitative gains.

This paper proposes a novel video inpainting method. We make three main contributions: First, we extended previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH), which improves patch-level feature alignments without additional supervision and benefits challenging scenes with various deformation. Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching by pruning out less essential features and using saliency map. MPPA enhances matching accuracy between warped tokens with invalid pixels. Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens under the guidance of the Deformation Factor learned from DePtH, especially for videos with agile motions. Experimental results demonstrate that our method outperforms recent methods qualitatively and quantitatively and achieves a new state-of-the-art.

View on arXiv PDF

Similar