LGDec 12, 2024

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

arXiv:2412.08890v116 citationsh-index: 40ICML
Originality Highly original
AI Analysis

This addresses memory efficiency for LLM deployment, offering a novel compression method that outperforms existing techniques like quantization and token eviction.

The paper tackles the problem of compressing the key-value (KV) cache in large language models to reduce memory usage, achieving 90-95% of original performance on GSM8K while using only 15-25% of the full KV-cache memory, with up to 1.7x better compression than 2-bit quantization in low memory regimes.

We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes