CVMay 15, 2022

Video Frame Interpolation with Transformer

arXiv:2205.07230v18 citationsh-index: 106
Originality Highly original
AI Analysis

This work improves video frame interpolation for applications like video editing and slow-motion generation, though it is incremental as it builds on existing deep learning methods.

The paper tackles video frame interpolation by addressing the limitation of convolutional networks in handling large motion, introducing a Transformer-based framework with a cross-scale window attention mechanism that achieves new state-of-the-art results on various benchmarks.

Video frame interpolation (VFI), which aims to synthesize intermediate frames of a video, has made remarkable progress with development of deep convolutional networks over past years. Existing methods built upon convolutional networks generally face challenges of handling large motion due to the locality of convolution operations. To overcome this limitation, we introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Further, our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other. This design effectively enlarges the receptive field and aggregates multi-scale information. Extensive quantitative and qualitative experiments demonstrate that our method achieves new state-of-the-art results on various benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes