CVDec 9, 2021

Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

arXiv:2112.05181v220 citations
Originality Incremental advance
AI Analysis

This addresses the need for better video representation learning in computer vision, but it is incremental as it builds on existing contrastive learning frameworks.

The paper tackled the problem of learning spatio-temporally fine-grained video representations via self-supervision, which is sub-optimal with existing methods, and achieved competitive results on 6 datasets including Kinetics, UCF, HMDB, AVA-Kinetics, AVA, and OTB.

Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform in-stance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes