CLDec 16, 2024

CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation

Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen

arXiv:2412.11741v11.91 citationsh-index: 18AAAI

Originality Incremental advance

AI Analysis

This addresses memory constraints for deploying long-context LLMs, though it appears incremental as it builds on existing KV cache optimization techniques.

The paper tackles the memory scalability challenge in large language models (LLMs) by proposing Cache Sparse Representation (CSR), which compresses the Key-Value cache into sparse indexes and weights, achieving performance comparable to state-of-the-art quantization algorithms while reducing memory usage.

The emergence of long-context text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache responsible for storing attention keys and values to minimize redundant computations can lead to substantial increases in memory consumption, potentially causing models to fail to serve with limited memory resources. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method for automatically generating the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms while maintaining robust functionality in memory-constrained environments.

View on arXiv PDF

Similar