LG AIAug 14, 2025

eMamba: Efficient Acceleration Framework for Mamba Models in Edge Computing

Jiyong Kim, Jaeho Lee, Jiahao Lin, Alish Kanani, Miao Sun, Umit Y. Ogras, Jaehyun Park

arXiv:2508.10370v13 citationsh-index: 23Has CodeACM Trans Embed Comput Syst

Originality Incremental advance

AI Analysis

This work addresses the problem of deploying efficient sequence models on resource-constrained edge devices, offering a domain-specific incremental improvement.

The paper tackles the lack of hardware acceleration frameworks for Mamba models on edge devices by introducing eMamba, an end-to-end framework that achieves comparable accuracy with 1.63-19.9× fewer parameters and demonstrates 4.95-5.62× lower latency, 2.22-9.95× higher throughput, and 48.6× lower energy consumption on FPGA and ASIC implementations.

State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9$\times$ fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62$\times$ lower latency and 2.22-9.95$\times$ higher throughput, with 4.77$\times$ smaller area, 9.84$\times$ lower power, and 48.6$\times$ lower energy consumption than baseline solutions while maintaining competitive accuracy.

View on arXiv PDF

Similar