CVAIApr 10, 2022

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

arXiv:2204.04607v11 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work improves video representation learning for computer vision tasks by reducing reliance on background information, though it is incremental as it builds on existing contrastive learning and pretext task methods.

The paper tackles the problem of self-supervised video representation learning by addressing models that focus on unimportant background information, proposing a Motion-Contrastive Perception Network that uses long-range residual frames to emphasize motion-specific features, resulting in outperforming state-of-the-art methods on UCF-101 and HMDB-51 datasets.

Visual-only self-supervised learning has achieved significant improvement in video representation learning. Existing related methods encourage models to learn video representations by utilizing contrastive learning or designing specific pretext tasks. However, some models are likely to focus on the background, which is unimportant for learning video representations. To alleviate this problem, we propose a new view called long-range residual frame to obtain more motion-specific information. Based on this, we propose the Motion-Contrastive Perception Network (MCPNet), which consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP), to learn generic video representations by focusing on the changing areas in videos. Specifically, the MIP branch aims to learn fine-grained motion features, and the CIP branch performs contrastive learning to learn overall semantics information for each instance. Experiments on two benchmark datasets UCF-101 and HMDB-51 show that our method outperforms current state-of-the-art visual-only self-supervised approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes