CVOct 19, 2022

A Unified View of Masked Image Modeling

Microsoft
arXiv:2210.10615v144 citationsh-index: 102
Originality Incremental advance
AI Analysis

This work addresses the label-hungry problem in computer vision for researchers and practitioners, presenting an incremental improvement in masked image modeling.

The paper tackles the problem of training large-scale vision Transformers without extensive labeled data by proposing MaskDistill, a masked image modeling method that reconstructs normalized semantic features from teacher models. It achieves 88.3% top-1 accuracy on ImageNet-1k and 58.8% mIoU on ADE20k, showing comparable or superior performance to state-of-the-art methods.

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes