CVJul 13, 2022

Unsupervised Visual Representation Learning by Synchronous Momentum Grouping

Bo Pang, Yifan Zhang, Yaoyi Li, Jia Cai, Cewu Lu

arXiv:2207.06167v115.337 citationsh-index: 27

Originality Highly original

AI Analysis

This addresses the challenge of false negatives and hysteresis in unsupervised learning for computer vision, representing a significant advance rather than an incremental improvement.

The paper tackles the problem of inefficient supervisory signals in unsupervised visual representation learning by proposing SMoG, a group-level contrastive method that integrates advantages from instance-level and clustering-based approaches, achieving linear evaluation performance on ImageNet that surpasses both current SOTA unsupervised methods and vanilla supervised learning.

In this paper, we propose a genuine group-level contrastive visual representation learning method whose linear evaluation performance on ImageNet surpasses the vanilla supervised learning. Two mainstream unsupervised learning schemes are the instance-level contrastive framework and clustering-based schemes. The former adopts the extremely fine-grained instance-level discrimination whose supervisory signal is not efficient due to the false negatives. Though the latter solves this, they commonly come with some restrictions affecting the performance. To integrate their advantages, we design the SMoG method. SMoG follows the framework of contrastive learning but replaces the contrastive unit from instance to group, mimicking clustering-based methods. To achieve this, we propose the momentum grouping scheme which synchronously conducts feature grouping with representation learning. In this way, SMoG solves the problem of supervisory signal hysteresis which the clustering-based method usually faces, and reduces the false negatives of instance contrastive methods. We conduct exhaustive experiments to show that SMoG works well on both CNN and Transformer backbones. Results prove that SMoG has surpassed the current SOTA unsupervised representation learning methods. Moreover, its linear evaluation results surpass the performances obtained by vanilla supervised learning and the representation can be well transferred to downstream tasks.

View on arXiv PDF

Similar