CVMay 8, 2022

ConvMAE: Masked Convolution Meets Masked Autoencoders

arXiv:2205.03892v2159 citationsh-index: 82Has Code
Originality Highly original
AI Analysis

This work addresses efficiency and performance issues in vision model pretraining for researchers and practitioners in computer vision.

The paper tackles the problem of high computational cost and pretraining-finetuning discrepancy in masked autoencoding for vision transformers by introducing ConvMAE, which uses masked convolution and a block-wise masking strategy, resulting in a 1.4% improvement in ImageNet-1K accuracy and up to 2.9% gains in object detection AP.

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes