CVApr 30, 2021

CoCon: Cooperative-Contrastive Learning

arXiv:2104.14764v120 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of labeling videos at scale for efficient video analysis, though it appears incremental as it builds on existing contrastive learning frameworks.

The paper tackles the problem of self-supervised visual representation learning for videos, where contrastive learning can separate semantically similar events, by introducing a cooperative variant that leverages complementary information across views to achieve competitive performance on action recognition benchmarks like UCF101, HMDB51, and Kinetics400.

Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to the separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to utilize complementary information across views and address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts to explore exploiting inter-instance relationships to drive learning. We experimentally evaluate our representations on the downstream task of action recognition. Our method achieves competitive performance on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes