CVSep 26, 2021

Self-Supervised Video Representation Learning by Video Incoherence Detection

arXiv:2109.12493v13 citations
Originality Incremental advance
AI Analysis

It addresses the problem of learning effective video representations without labeled data, which is crucial for computer vision applications, though it appears incremental as it builds on coherence-based methods.

The paper tackles video representation learning by proposing a self-supervised method based on detecting incoherence in videos, achieving state-of-the-art performance in action recognition and video retrieval across various backbones and datasets.

This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval utilizing various backbone networks. Experiments show that our proposed method achieves state-of-the-art performance across different backbone networks and different datasets compared with previous coherence-based methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes