CVDec 28, 2021

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg

arXiv:2112.14100v11 citations
Originality Synthesis-oriented
AI Analysis

This work addresses video captioning for multimedia applications, but it is incremental as it applies existing Transformer methods to a specific dataset.

The researchers tackled the Video-to-Text (VTT) task by adapting Transformer-based architectures, finding that traditional image captioning pipelines performed poorly, but switching to Transformers significantly improved results, with self-critical sequence training boosting validation performance.

The Multimedia and Computer Vision Lab of the University of Augsburg participated in the VTT task only. We use the VATEX and TRECVID-VTT datasets for training our VTT models. We base our model on the Transformer approach for both of our submitted runs. For our second model, we adapt the X-Linear Attention Networks for Image Captioning which does not yield the desired bump in scores. For both models, we train on the complete VATEX dataset and 90% of the TRECVID-VTT dataset for pretraining while using the remaining 10% for validation. We finetune both models with self-critical sequence training, which boosts the validation performance significantly. Overall, we find that training a Video-to-Text system on traditional Image Captioning pipelines delivers very poor performance. When switching to a Transformer-based architecture our results greatly improve and the generated captions match better with the corresponding video.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes