CLDec 17, 2024

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Peking U
arXiv:2412.12706v27 citationsh-index: 15Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the inference memory bottleneck for LLM users by proposing an incremental optimization strategy for KV cache compression.

The paper tackles the memory bottleneck of KV cache in large language models by exploring the trade-off between token count and precision in compression, finding that storing more tokens with lower precision (quantized pruning) significantly enhances long-context performance, with substantial improvements in retrieval-related tasks and consistent effectiveness across various conditions.

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code is available at https://github.com/zhzihao/QPruningKV.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes