CVJul 10, 2024

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

arXiv:2407.08083v2375 citationsh-index: 26Has Code
AI Analysis

This work addresses the need for efficient and high-performance vision models for computer vision applications, representing an incremental improvement by integrating existing Mamba and Transformer components.

The authors tackled the problem of improving vision backbones by proposing MambaVision, a hybrid Mamba-Transformer model that achieves state-of-the-art Top-1 accuracy and throughput on ImageNet-1K and outperforms comparably sized backbones in downstream tasks like object detection and segmentation.

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes