CLDec 17, 2024

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li

Peking U

arXiv:2412.12706v26.17 citationsh-index: 15Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the inference memory bottleneck for LLM users by proposing an incremental optimization strategy for KV cache compression.

The paper tackles the memory bottleneck of KV cache in large language models by exploring the trade-off between token count and precision in compression, finding that storing more tokens with lower precision (quantized pruning) significantly enhances long-context performance, with substantial improvements in retrieval-related tasks and consistent effectiveness across various conditions.

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code is available at https://github.com/zhzihao/QPruningKV.

View on arXiv PDF Code

Similar