Self-Supervised Image Representation Learning with Geometric Set Consistency
This work addresses the challenge of learning image representations without semantic labels for computer vision researchers, offering an incremental improvement by integrating geometric constraints into existing contrastive methods.
The paper tackles the problem of self-supervised image representation learning by incorporating 3D geometric consistency priors into a contrastive learning framework, resulting in superior performance on downstream tasks like semantic segmentation, object detection, and instance segmentation compared to state-of-the-art methods.
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency. Our intuition is that 3D geometric consistency priors such as smooth regions and surface discontinuities may imply consistent semantics or object boundaries, and can act as strong cues to guide the learning of 2D image representations without semantic labels. Specifically, we introduce 3D geometric consistency into a contrastive learning framework to enforce the feature consistency within image views. We propose to use geometric consistency sets as constraints and adapt the InfoNCE loss accordingly. We show that our learned image representations are general. By fine-tuning our pre-trained representations for various 2D image-based downstream tasks, including semantic segmentation, object detection, and instance segmentation on real-world indoor scene datasets, we achieve superior performance compared with state-of-the-art methods.