LGJun 13, 2024

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji

arXiv:2406.09297v323.520 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses memory efficiency for deploying large transformer models at scale, representing an incremental improvement over existing methods like MQA and GQA.

The paper tackles the memory bottleneck in transformer decoding by introducing Multi-Layer Key-Value (MLKV) sharing, which reduces KV cache size by up to 6x compared to Multi-Query Attention with minimal performance loss.

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

View on arXiv PDF Code

Similar