LGAICLJan 28, 2025

Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models

arXiv:2501.17088v114 citationsh-index: 10Has CodeNAACL
Originality Synthesis-oriented
AI Analysis

This work addresses efficiency improvements for SSM-based models, which are incremental optimizations for sequence modeling tasks.

The paper tackles the inefficiency of Selective Structured State Space Models (SSMs) like Mamba by compressing them through removal of redundant components, achieving up to 1.4x inference speedup with minimal accuracy loss.

Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes