CLAIOct 20, 2024

Lossless KV Cache Compression to 2%

Tencent
arXiv:2410.15252v15 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses a bottleneck in efficient inference for large language models, offering a practical solution with incremental improvements through integration of existing techniques.

The paper tackles the problem of large key-value (KV) cache memory in large language models by introducing Cross-Layer Latent Attention (CLLA), which compresses the KV cache to less than 2% of its original size while maintaining comparable performance levels.

Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes