LGAIFeb 20, 2025

Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

arXiv:2502.14458v224 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the need for more accessible and efficient language models for resource-constrained devices like smartphones and edge platforms, offering a practical alternative to Transformers.

The paper tackles the problem of inefficient inference in large language models by introducing Llamba, a family of efficient recurrent models distilled from Llama-3.x into the Mamba architecture, which achieves higher inference throughput and handles larger batch sizes than Transformers while maintaining comparable benchmark performance with less than 0.1% of typical training data.

We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes