CVJun 12, 2020

Video Understanding as Machine Translation

arXiv:2006.07203v229 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of simplifying and improving video understanding for researchers and practitioners by offering a unified framework that avoids complex sample selection, though it is incremental as it builds on existing multimodal learning approaches.

The paper tackles the challenge of self-supervised video representation learning by proposing a generative modeling approach that frames it as a translation problem between modalities, eliminating the need for negative sampling and achieving performance gains over state-of-the-art methods on tasks like video classification, question answering, captioning, and retrieval.

With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes