CLAILGNov 20, 2024

Hymba: A Hybrid-head Architecture for Small Language Models

arXiv:2411.13676v189 citationsh-index: 45ICLR
Originality Incremental advance
AI Analysis

This addresses the need for more efficient and high-performing small language models, particularly for resource-constrained applications, though it appears incremental as it builds on existing transformer and SSM methods.

The paper tackles the problem of improving efficiency and performance in small language models by proposing Hymba, a hybrid-head architecture integrating transformer attention with state space models, which achieves state-of-the-art results including a 1.32% higher average accuracy than Llama-3.2-3B, an 11.67x cache size reduction, and 3.49x throughput.

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes