CVAISep 15, 2024

Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion

arXiv:2409.09808v315 citationsh-index: 16
AI Analysis

This work addresses efficiency bottlenecks for researchers and practitioners using Vision Mamba models, though it is incremental as it builds on existing token fusion methods.

The paper tackled the training efficiency of Vision Mamba models by introducing Famba-V, a cross-layer token fusion technique that selectively fuses similar tokens across layers, resulting in reduced training time and memory usage while improving accuracy-efficiency trade-offs on CIFAR-100.

Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces Fast Mamba for Vision (Famba-V), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes