CVAug 30, 2024

Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang

arXiv:2408.17081v22.0h-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses the training challenges for Vision Mamba models, which are important for efficient visual data processing, but it is incremental as it focuses on a regularization technique rather than a fundamental breakthrough.

The paper tackles the under-explored training methodologies of Vision Mamba (Vim) models by proposing Stochastic Layer-Wise Shuffle (SLWS), a regularization method that improves Vim training without architectural changes, achieving leading performance on ImageNet-1K compared to similar models.

Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: \textit{(1) Plug-and-play:} No architectural modifications are needed, and it is deactivated during inference. \textit{(2) Simple but effective:} The four-step process introduces only random permutations and negligible overhead. \textit{(3) Intuitive design:} Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability.

View on arXiv PDF Code

Similar