Cross-Attention Transformer for Video Interpolation
This work addresses video interpolation for applications like video editing and compression, presenting an incremental improvement by combining transformers with attention mechanisms to enhance efficiency without flow estimation.
The authors tackled video frame interpolation by proposing TAIN, a residual neural network that uses a novel Cross Similarity transformer module and Image Attention to refine predictions, achieving performance comparable to flow-based methods with computational efficiency on benchmarks like Vimeo90k, UCF101, and SNU-FILM.
We propose TAIN (Transformers and Attention for video INterpolation), a residual neural network for video interpolation, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel vision transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted interpolated frame. These CS features are then used to refine the interpolated prediction. To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. TAIN outperforms existing methods that do not require flow estimation and performs comparably to flow-based methods while being computationally efficient in terms of inference time on Vimeo90k, UCF101, and SNU-FILM benchmarks.