Improving Masked Autoencoders by Learning Where to Mask
This work addresses a bottleneck in self-supervised learning for computer vision by optimizing masking strategies, offering incremental improvements over random masking methods.
The paper tackled the problem of random masking in masked image modeling by proposing AutoMAE, a framework that learns adaptive masking strategies to focus on patches with higher information density, resulting in improved pretraining models on self-supervised benchmarks and downstream tasks.
Masked image modeling is a promising self-supervised learning method for visual data. It is typically built upon image patches with random masks, which largely ignores the variation of information density between them. The question is: Is there a better masking strategy than random sampling and how can we learn it? We empirically study this problem and initially find that introducing object-centric priors in mask sampling can significantly improve the learned representations. Inspired by this observation, we present AutoMAE, a fully differentiable framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In this way, our approach can adaptively find patches with higher information density for different images, and further strike a balance between the information gain obtained from image reconstruction and its practical training difficulty. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.