LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
This addresses the problem of GPU memory and speed bottlenecks in LLM decoding for users handling long contexts, representing an incremental improvement with specific hardware optimizations.
The paper tackles the efficiency challenge of large language models (LLMs) in long-context tasks by proposing LeanK, a learning-based method that prunes unimportant key (K) cache channels, resulting in up to 70% K cache and 16%-18% V cache memory reduction and a 1.3x speedup in attention computation.
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.