Self-Supervised YOLO: Leveraging Contrastive Learning for Label-Efficient Object Detection
This work addresses label efficiency for real-time object detection applications, offering an incremental improvement by adapting existing SSL methods to YOLO detectors.
The paper tackled the problem of reducing dependency on labeled datasets for YOLO object detectors by pretraining backbones with contrastive self-supervised learning on unlabeled images, resulting in improved mAP, faster convergence, and better precision-recall, with a SimCLR-pretrained YOLOv8 achieving a mAP@50:95 of 0.7663.
One-stage object detectors such as the YOLO family achieve state-of-the-art performance in real-time vision applications but remain heavily reliant on large-scale labeled datasets for training. In this work, we present a systematic study of contrastive self-supervised learning (SSL) as a means to reduce this dependency by pretraining YOLOv5 and YOLOv8 backbones on unlabeled images using the SimCLR framework. Our approach introduces a simple yet effective pipeline that adapts YOLO's convolutional backbones as encoders, employs global pooling and projection heads, and optimizes a contrastive loss using augmentations of the COCO unlabeled dataset (120k images). The pretrained backbones are then fine-tuned on a cyclist detection task with limited labeled data. Experimental results show that SSL pretraining leads to consistently higher mAP, faster convergence, and improved precision-recall performance, especially in low-label regimes. For example, our SimCLR-pretrained YOLOv8 achieves a mAP@50:95 of 0.7663, outperforming its supervised counterpart despite using no annotations during pretraining. These findings establish a strong baseline for applying contrastive SSL to one-stage detectors and highlight the potential of unlabeled data as a scalable resource for label-efficient object detection.