LG AI CLJan 8, 2024

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur

arXiv:2401.04081v231.295 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient scaling for sequential modeling in AI, offering a novel hybrid approach that is incremental but impactful for improving training efficiency.

The paper tackled the challenge of scaling State Space Models (SSMs) by combining them with Mixture of Experts (MoE), resulting in MoE-Mamba, which outperforms Mamba and Transformer-MoE and reaches the same performance as Mamba in 2.35× fewer training steps.

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.

View on arXiv PDF Code

Similar