Improving Pixel-based MIM by Reducing Wasted Modeling Capability
This work addresses inefficiencies in pixel-based MIM for computer vision researchers, offering incremental improvements in performance for downstream tasks.
The paper tackles the limitation of pixel-based Masked Image Modeling (MIM) being biased toward high-frequency details by proposing a method that uses low-level features from shallow layers to aid reconstruction, improving convergence and achieving gains like 1.2% on fine-tuning and 2.8% on linear probing for smaller models.
There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at https://github.com/open-mmlab/mmpretrain.