SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations
This work addresses scale invariance for computer vision tasks, offering incremental improvements in low-resolution recognition.
The paper tackled the problem of scale variance in images by using image scale as a self-supervised signal in Masked Image Modeling, achieving 82.1% accuracy on ImageNet-1K and surpassing DeriveNet by 1.3% on a very low-resolution recognition task.
Due to the prevalence of scale variance in nature images, we propose to use image scale as a self-supervised signal for Masked Image Modeling (MIM). Our method involves selecting random patches from the input image and downsampling them to a low-resolution format. Our framework utilizes the latest advances in super-resolution (SR) to design the prediction head, which reconstructs the input from low-resolution clues and other patches. After 400 epochs of pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to capture scale invariance representation. For the very low resolution (VLR) recognition task, our model achieves the best performance, surpassing DeriveNet by 1.3%. Our method also achieves an accuracy of 74.84% on the task of recognizing low-resolution facial expressions, surpassing the current state-of-the-art FMD by 9.48%.