CVSep 7, 2022

MimCo: Masked Image Modeling Pre-training with Contrastive Teacher

Qiang Zhou, Chaohui Yu, Hao Luo, Zhibin Wang, Hao Li

arXiv:2209.03063v214.129 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in self-supervised learning for computer vision, offering an incremental improvement over existing methods.

The paper tackles the problem of improving the linear separability of masked image modeling (MIM) pre-trained representations by proposing MimCo, a framework that combines MIM with contrastive learning through two-stage pre-training, achieving 82.53% top-1 accuracy on ImageNet-1K with ViT-S in 100 epochs.

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses. Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.

View on arXiv PDF

Similar