Pixel-level Correspondence for Self-Supervised Learning from Video
This addresses the challenge of leveraging video for self-supervised learning in computer vision, offering a method that improves dense prediction tasks, though it appears incremental as it builds on existing contrastive learning and optical flow techniques.
The paper tackles the problem of learning dense visual representations from video without labels by proposing Pixel-level Correspondence (PiCo), which uses optical flow tracking to match local features across time, resulting in outperforming self-supervised baselines on multiple dense prediction tasks while maintaining image classification performance.
While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision. To address this, we propose Pixel-level Correspondence (PiCo), a method for dense contrastive learning from video. By tracking points with optical flow, we obtain a correspondence map which can be used to match local features at different points in time. We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks, without compromising performance on image classification.