On the Discriminability of Self-Supervised Representation Learning
This work addresses a key limitation in self-supervised learning for visual tasks, offering an incremental improvement to enhance discriminability and reduce the performance gap with supervised methods.
The paper tackles the discriminability gap between self-supervised and supervised learning by identifying the 'crowding problem' of poor class separation and high intra-class variance, and proposes the Dynamic Semantic Adjuster (DSA) to enhance feature aggregation, resulting in substantial performance gains that narrow the gap with supervised learning on benchmark datasets.
Self-supervised learning (SSL) has recently shown notable success in various visual tasks. However, in terms of discriminability, SSL is still not on par with supervised learning (SL). This paper identifies a key issue, the ``crowding problem," where features from different classes are not well-separated, and there is high intra-class variance. In contrast, SL ensures clear class separation. Our analysis reveals that SSL objectives do not adequately constrain the relationships between samples and their augmentations, leading to poorer performance in complex tasks. We further establish a theoretical framework that connects SSL objectives to cross-entropy risk bounds, explaining how reducing intra-class variance and increasing inter-class separation can improve generalization. To address this, we propose the Dynamic Semantic Adjuster (DSA), a learnable regulator that enhances feature aggregation and separation while being robust to outliers. Comprehensive experiments conducted on diverse benchmark datasets validate that DSA leads to substantial gains in SSL performance, narrowing the performance gap with SL.