CLMar 17

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

arXiv:2603.1643590.3h-index: 8
AI Analysis

This addresses the memory bottleneck for deploying LLMs in resource-limited settings, representing a strong incremental improvement over prior compression methods.

The paper tackles the problem of KV cache compression in Large Language Models, which limits deployment in resource-limited environments, by proposing VQKV, a training-free vector quantization method that achieves an 82.8% compression ratio on LLaMA3.1-8B while retaining 98.6% of baseline performance on LongBench and enabling 4.3x longer generation length.

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes