Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
This addresses the need for better video representation learning in computer vision, but it is incremental as it builds on existing contrastive learning frameworks.
The paper tackled the problem of learning spatio-temporally fine-grained video representations via self-supervision, which is sub-optimal with existing methods, and achieved competitive results on 6 datasets including Kinetics, UCF, HMDB, AVA-Kinetics, AVA, and OTB.
Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform in-stance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB.