CVApr 12, 2023

Hard Patches Mining for Masked Image Modeling

Haochen Wang, Kaiyou Song, Junsong Fan, Yuxi Wang, Jin Xie, Zhaoxiang Zhang

arXiv:2304.05919v125.090 citationsh-index: 59Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of optimizing mask strategies in MIM for scalable visual representation learning, offering a novel approach that could benefit computer vision tasks, though it appears incremental as it builds on existing MIM paradigms.

The paper tackles the problem of improving masked image modeling (MIM) by proposing Hard Patches Mining (HPM), a framework where the model acts as both student and teacher to dynamically mask hard-to-reconstruct patches based on predicted losses, resulting in enhanced representation learning and effectiveness in constructing masked images across various settings.

Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.

View on arXiv PDF Code

Similar