CVOct 13, 2022

Exploring Long-Sequence Masked Autoencoders

arXiv:2210.07224v123 citationsh-index: 39
Originality Incremental advance
AI Analysis

This work addresses scaling challenges in computer vision pre-training, offering an incremental improvement for tasks like detection and segmentation.

The researchers investigated how input specifications affect Masked Autoencoder (MAE) pre-training, identifying sequence length as a key scaling factor, and developed a long-sequence MAE variant that improves object detection and semantic segmentation performance without extra computational cost during transfer.

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains. In contrast to discrete tokens in natural languages, the input for image MAE is continuous and subject to additional specifications. We systematically study each input specification during the pre-training stage, and find sequence length is a key axis that further scales MAE. Our study leads to a long-sequence version of MAE with minimal changes to the original recipe, by just decoupling the mask size from the patch size. For object detection and semantic segmentation, our long-sequence MAE shows consistent gains across all the experimental setups without extra computation cost during the transfer. While long-sequence pre-training is discerned most beneficial for detection and segmentation, we also achieve strong results on ImageNet-1K classification by keeping a standard image size and only increasing the sequence length. We hope our findings can provide new insights and avenues for scaling in computer vision.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes