LGAICLAug 4, 2025

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Microsoft
arXiv:2508.02215v14 citationsh-index: 18Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the problem of GPU memory and speed bottlenecks in LLM decoding for users handling long contexts, representing an incremental improvement with specific hardware optimizations.

The paper tackles the efficiency challenge of large language models (LLMs) in long-context tasks by proposing LeanK, a learning-based method that prunes unimportant key (K) cache channels, resulting in up to 70% K cache and 16%-18% V cache memory reduction and a 1.3x speedup in attention computation.

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes