LG AIMar 31, 2025

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen

arXiv:2503.24000v115.76 citationsh-index: 13Has CodeMLSys

Originality Synthesis-oriented

AI Analysis

This work addresses practical deployment challenges for LLM serving systems, but it is incremental as it reviews and benchmarks existing methods rather than proposing a new solution.

The paper tackles the problem of optimizing Large Language Model (LLM) serving by re-evaluating Key-Value cache compression techniques, finding that current implementations lead to suboptimal throughput and increased latency despite reducing memory consumption.

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.

View on arXiv PDF Code

Similar