CVOct 13, 2022

Exploring Long-Sequence Masked Autoencoders

Ronghang Hu, Shoubhik Debnath, Saining Xie, Xinlei Chen

arXiv:2210.07224v116.723 citationsh-index: 39Has Code

Originality Incremental advance

AI Analysis

This work addresses scaling challenges in computer vision pre-training, offering an incremental improvement for tasks like detection and segmentation.

The researchers investigated how input specifications affect Masked Autoencoder (MAE) pre-training, identifying sequence length as a key scaling factor, and developed a long-sequence MAE variant that improves object detection and semantic segmentation performance without extra computational cost during transfer.

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains. In contrast to discrete tokens in natural languages, the input for image MAE is continuous and subject to additional specifications. We systematically study each input specification during the pre-training stage, and find sequence length is a key axis that further scales MAE. Our study leads to a long-sequence version of MAE with minimal changes to the original recipe, by just decoupling the mask size from the patch size. For object detection and semantic segmentation, our long-sequence MAE shows consistent gains across all the experimental setups without extra computation cost during the transfer. While long-sequence pre-training is discerned most beneficial for detection and segmentation, we also achieve strong results on ImageNet-1K classification by keeping a standard image size and only increasing the sequence length. We hope our findings can provide new insights and avenues for scaling in computer vision.

View on arXiv PDF Code

Similar