CV AIDec 17, 2024

Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training

Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You

arXiv:2412.12496v47.65 citationsh-index: 172Has Code

Originality Incremental advance

AI Analysis

This work addresses efficiency improvements for Vision Mamba models in computer vision, offering a fast recovery method that is incremental over existing token reduction techniques.

The paper tackles the performance degradation in Vision Mamba models when using token merging for efficiency by proposing a quick retraining method, achieving up to 35.9% accuracy recovery in minutes with minimal drops (e.g., 1.3% for Vim-S) and speedups up to 1.5x.

Vision Mamba has shown close to state of the art performance on computer vision tasks, drawing much interest in increasing it's efficiency. A promising approach is token reduction (that has been successfully implemented in ViTs). Pruning informative tokens in Mamba leads to a high loss of key knowledge and degraded performance. An alternative, of merging tokens preserves more information than pruning, also suffers for large compression ratios. Our key insight is that a quick round of retraining after token merging yeilds robust results across various compression ratios. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drops 1.3% with 1.2x (up to 1.5x) speed up in inference.

View on arXiv PDF Code

Similar