CVJul 29, 2020

Learning Video Representations from Textual Web Supervision

Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross

arXiv:2007.14937v217.951 citations

Originality Incremental advance

AI Analysis

This provides a scalable method for video understanding by leveraging abundant web data, though it is incremental in using text as a known supervision source.

The authors tackled the problem of learning video representations by using paired text from online videos as supervision, collecting 70M clips and training a model to align videos with text. This approach outperformed existing methods on action recognition tasks like Kinetics, HMDB-51, and UCF-101.

Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We evaluate the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pre-training video representations. Specifically, it outperforms all existing methods for self-supervised and cross-modal video representation learning.

View on arXiv PDF

Similar