Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos
This addresses the problem of limited positive diversity in self-supervised video learning for researchers and practitioners, though it is incremental as it builds on prior contrastive methods.
The paper tackles the limitation of existing video contrastive learning methods that only use clips from the same video as positives by introducing nearest-neighbor videos from the global space as additional positive pairs, which improves performance on various video tasks.
Contrastive learning has recently narrowed the gap between self-supervised and supervised methods in image and video domain. State-of-the-art video contrastive learning methods such as CVRL and $ρ$-MoCo spatiotemporally augment two clips from the same video as positives. By only sampling positive clips locally from a single video, these methods neglect other semantically related videos that can also be useful. To address this limitation, we leverage nearest-neighbor videos from the global space as additional positive pairs, thus improving positive key diversity and introducing a more relaxed notion of similarity that extends beyond video and even class boundaries. Our method, Inter-Intra Video Contrastive Learning (IIVCL), improves performance on a range of video tasks.