Rethinking Patch Dependence for Masked Autoencoders
This work addresses the computational inefficiency in visual pretraining for computer vision researchers, offering a more efficient alternative to traditional MAE methods.
The study investigated the role of inter-patch dependencies in masked autoencoders (MAE) for representation learning, finding that reconstruction relies on global encoder representations rather than decoder interactions, leading to the proposal of CrossMAE, which uses only cross-attention in the decoder and achieves comparable or superior performance to MAE while reducing computational requirements.
In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io