CLLGAug 22, 2024

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

arXiv:2408.12570v152 citationsh-index: 72Has Code
Originality Incremental advance
AI Analysis

This work addresses efficiency and scalability challenges for deploying large language models in long-context applications, though it is incremental as it builds on existing hybrid architectures.

The authors tackled the problem of scaling large language models for long-context tasks by introducing Jamba-1.5, a hybrid Transformer-Mamba architecture with up to 256K token context length, achieving high throughput and competitive performance on benchmarks while enabling cost-effective inference on 8 GPUs.

We present Jamba-1.5, new instruction-tuned large language models based on our Jamba architecture. Jamba is a hybrid Transformer-Mamba mixture of experts architecture, providing high throughput and low memory usage across context lengths, while retaining the same or better quality as Transformer models. We release two model sizes: Jamba-1.5-Large, with 94B active parameters, and Jamba-1.5-Mini, with 12B active parameters. Both models are fine-tuned for a variety of conversational and instruction-following capabilties, and have an effective context length of 256K tokens, the largest amongst open-weight models. To support cost-effective inference, we introduce ExpertsInt8, a novel quantization technique that allows fitting Jamba-1.5-Large on a machine with 8 80GB GPUs when processing 256K-token contexts without loss of quality. When evaluated on a battery of academic and chatbot benchmarks, Jamba-1.5 models achieve excellent results while providing high throughput and outperforming other open-weight models on long-context benchmarks. The model weights for both sizes are publicly available under the Jamba Open Model License and we release ExpertsInt8 as open source.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes