BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference
This addresses memory constraints for deploying long-context LLMs, offering a domain-specific optimization that is incremental over prior eviction and compression methods.
The paper tackles the problem of GPU memory inefficiency in Large Language Model inference due to linear growth of Key-Value (KV) caches with context length, by introducing BaKlaVa, which allocates optimal memory budgets to individual KV-caches, achieving up to 70% compression ratio while maintaining baseline performance and up to an order-of-magnitude accuracy improvement at higher compression levels.
In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.