CVLGMay 28, 2022

Object-wise Masked Autoencoders for Fast Pre-training

arXiv:2205.14338v116 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the problem of slow self-supervised pre-training for computer vision researchers, offering a more efficient method that is incremental in optimizing existing approaches.

The paper tackles the high computational cost of masked image encoding models by introducing ObjMAE, which uses object selection and division to drop non-object patches, reducing compute time by 72% while maintaining competitive performance on four datasets.

Self-supervised pre-training for images without labels has recently achieved promising performance in image classification. The success of transformer-based methods, ViT and MAE, draws the community's attention to the design of backbone architecture and self-supervised task. In this work, we show that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation. Therefore, those methods bring a lot of compute time for self-supervised pre-training. To solve this issue, we introduce a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks. We refer to this method ObjMAE. Extensive experiments on four commonly-used datasets demonstrate the effectiveness of our model in reducing the compute cost by 72% while achieving competitive performance. Furthermore, we investigate the inter-object and intra-object relationship and find that the latter is crucial for self-supervised pre-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes