CVNov 25, 2020

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille

arXiv:2011.13046v117.741 citations

Originality Highly original

AI Analysis

This work provides a general paradigm to enhance unsupervised video representation learning, which is significant for researchers and practitioners developing video understanding models, offering consistent improvements over existing methods.

The authors investigated how to incorporate temporal information into contrastive self-supervised learning (CSL) for video understanding, finding that direct temporal augmentations were ineffective. They developed TaCo, a new paradigm that uses temporal transformations for both data augmentation and self-supervision, achieving 85.1% (UCF-101) and 51.6% (HMDB-51) top-1 accuracy, representing a 3% and 2.4% relative improvement over prior state-of-the-art.

Leveraging temporal information has been regarded as essential for developing video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair video CSL in general. This counter-intuitive observation motivates us to re-design existing video CSL frameworks, for better integration of temporal knowledge. To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL. Specifically, TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. By jointly contrasting instances with enriched temporal transformations and learning these transformations as self-supervised signals, TaCo can significantly enhance unsupervised video representation learning. For instance, TaCo demonstrates consistent improvement in downstream classification tasks over a list of backbones and CSL approaches. Our best model achieves 85.1% (UCF-101) and 51.6% (HMDB-51) top-1 accuracy, which is a 3% and 2.4% relative improvement over the previous state-of-the-art.

View on arXiv PDF

Similar