CVDec 28, 2021

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Philipp Harzig, Moritz Einfalt, Rainer Lienhart

arXiv:2112.14088v12.64 citations

Originality Incremental advance

AI Analysis

This work addresses video description generation for applications like aiding visually impaired people, but it is incremental as it builds on existing Transformer methods with a new synchronization technique.

The paper tackles video-to-text translation by developing a Transformer architecture with a novel Fractional Positional Encoding method to synchronize audio and video features, improving CIDEr and BLEU-4 scores by 37.13 and 12.83 points over a baseline and achieving state-of-the-art results on multiple datasets.

Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance. Transformer architectures have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT. However, there is no comprehensive study on different strategies and advice for video description generation including exploiting the accompanying audio with fully self-attentive networks. Thus, we explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture. Additionally, we present a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset to determine a configuration applicable to unseen datasets that helps describe short video clips in natural language and improved the CIDEr and BLEU-4 scores by 37.13 and 12.83 points compared to a vanilla Transformer network and achieve state-of-the-art results on the MSR-VTT and MSVD datasets. Also, FPE helps increase the CIDEr score by a relative factor of 8.6%.

View on arXiv PDF

Similar