CVJun 11, 2024

Autoregressive Pretraining with Mamba in Vision

arXiv:2406.07537v125 citationsHas Code
AI Analysis

This work addresses the need for more efficient and scalable vision models using Mamba, offering incremental improvements in performance for the computer vision community.

The paper tackles the problem of enhancing Mamba's visual capabilities by introducing autoregressive pretraining, which improves accuracy and enables scaling to larger model sizes, achieving 83.2% ImageNet accuracy for a base-size model (2.0% gain over supervised training) and 85.0% for a huge-size model.

The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes