CVNov 24, 2024

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

arXiv:2411.15941v170 citationsh-index: 34CVPR
Originality Incremental advance
AI Analysis

This work addresses the need for efficient, high-performance lightweight vision models for mobile or edge computing applications, representing an incremental improvement over existing Mamba-based approaches.

The paper tackles the problem of inefficient throughput in lightweight Mamba-based vision models by proposing MobileMamba, a framework that balances efficiency and performance through a three-stage network and a Multi-Receptive Field Feature Interaction module. It achieves up to 83.6% Top-1 accuracy and is up to 21 times faster than prior methods like LocalVim on GPU.

Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction(MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba(WTE-Mamba), Efficient Multi-Kernel Depthwise Convolution(MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes