LGCLFeb 29, 2024

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

DeepMind
arXiv:2402.19427v1239 citationsh-index: 79
Originality Highly original
AI Analysis

This work addresses the problem of inefficient training and inference in large language models for AI practitioners, offering a novel hybrid approach that is not purely incremental.

The authors tackled the challenge of creating efficient language models by proposing Griffin, a hybrid model that mixes gated linear recurrences with local attention, which matches the performance of Llama-2 while being trained on over 6 times fewer tokens and achieves lower latency and higher throughput during inference.

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes