Self-Supervised Contrastive Learning for Multi-Label Images
This work addresses the need for more efficient and applicable self-supervised learning methods for multi-label image scenarios, though it is incremental as it tailors existing approaches.
The paper tackles the problem of self-supervised learning being inefficient and limited to single-label datasets like ImageNet by adapting it for multi-label images, achieving competitive performance in linear fine-tuning and transfer learning with fewer samples.
Self-supervised learning (SSL) has demonstrated its effectiveness in learning representations through comparison methods that align with human intuition. However, mainstream SSL methods heavily rely on high body datasets with single label, such as ImageNet, resulting in intolerable pre-training overhead. Besides, more general multi-label images are frequently overlooked in SSL, despite their potential for richer semantic information and broader applicability in downstream scenarios. Therefore, we tailor the mainstream SSL approach to guarantee excellent representation learning capabilities using fewer multi-label images. Firstly, we propose a block-wise augmentation module aimed at extracting additional potential positive view pairs from multi-label images. Subsequently, an image-aware contrastive loss is devised to establish connections between these views, thereby facilitating the extraction of semantically consistent representations. Comprehensive linear fine-tuning and transfer learning validate the competitiveness of our approach despite challenging sample quality and quantity.