CL AIMar 13, 2025

KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

Vivek Chari, Guanghui Qin, Benjamin Van Durme

Microsoft

arXiv:2503.10337v117.612 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses memory efficiency for large language models, enabling longer contexts with reduced computational cost, though it is incremental as it builds on existing compression and distillation techniques.

The paper tackles the problem of high GPU memory usage from KV caches in Transformers during long-context generation by introducing KV-Distill, a framework that compresses these caches into shorter representations, achieving up to 99% length reduction while preserving performance in tasks like question answering and summarization.

Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

View on arXiv PDF

Similar