CLAIDCLGFeb 1, 2024

BlackMamba: Mixture of Experts for State-Space Models

arXiv:2402.01771v140 citationsh-index: 25Has Code
Originality Highly original
AI Analysis

This work addresses efficiency problems for AI researchers and practitioners by providing an incremental hybrid architecture that reduces computational costs in language modeling.

The paper tackles the challenge of improving efficiency in large-scale language models by combining state-space models (SSMs) for linear complexity with mixture-of-experts (MoE) for reduced compute costs, resulting in BlackMamba models that perform competitively against baselines while offering benefits in inference and training FLOPs, with fully trained models up to 2.8B parameters on 300B tokens.

State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes