CVSep 2, 2022

Temporal Contrastive Learning with Curriculum

arXiv:2209.00760v12.64 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses video understanding for action recognition and retrieval, but it is incremental as it builds on existing contrastive learning and curriculum techniques.

The authors tackled video representation learning by introducing ConCur, a contrastive method that uses curriculum learning to dynamically sample clips from easy to hard positives, achieving state-of-the-art performance on UCF101 and HMDB51 datasets for action recognition and retrieval tasks.

We present ConCur, a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy in contrastive training. More specifically, ConCur starts the contrastive training with easy positive samples (temporally close and semantically similar clips), and as the training progresses, it increases the temporal span effectively sampling hard positives (temporally away and semantically dissimilar). To learn better context-aware representations, we also propose an auxiliary task of predicting the temporal distance between a positive pair of clips. We conduct extensive experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance on two benchmark tasks of video action recognition and video retrieval. We explore the impact of encoder backbones and pre-training strategies by using R(2+1)D and C3D encoders and pre-training on Kinetics-400 and Kinetics-200 datasets. Moreover, a detailed ablation study shows the effectiveness of each of the components of our proposed method.

View on arXiv PDF

Similar