CLMar 14, 2024

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

arXiv:2403.09636v2114 citationsICML
Originality Incremental advance
AI Analysis

This addresses memory bottlenecks for LLM inference, enabling longer contexts and larger batches, though it is incremental as it retrofits existing models.

The paper tackles the inefficiency of storing key-value caches in large language models during inference by proposing Dynamic Memory Compression (DMC), which compresses the cache online and achieves up to 7x throughput increase on GPUs while preserving performance with up to 4x compression.

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes