Spatial-Temporal Residual Aggregation for High Resolution Video Inpainting
This work addresses video inpainting for high-resolution content, offering an incremental improvement by refining low-resolution results with aggregated residuals.
The paper tackles the problem of high-resolution video inpainting, which is limited by memory constraints in existing methods, and proposes STRA-Net to achieve more temporal-coherent and visually appealing results than state-of-the-art approaches.
Recent learning-based inpainting algorithms have achieved compelling results for completing missing regions after removing undesired objects in videos. To maintain the temporal consistency among the frames, 3D spatial and temporal operations are often heavily used in the deep networks. However, these methods usually suffer from memory constraints and can only handle low resolution videos. We propose STRA-Net, a novel spatial-temporal residual aggregation framework for high resolution video inpainting. The key idea is to first learn and apply a spatial and temporal inpainting network on the downsampled low resolution videos. Then, we refine the low resolution results by aggregating the learned spatial and temporal image residuals (details) to the upsampled inpainted frames. Both the quantitative and qualitative evaluations show that we can produce more temporal-coherent and visually appealing results than the state-of-the-art methods on inpainting high resolution videos.