LGDCAug 7, 2024

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference

arXiv:2408.04107v35 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks in large-language model inference for AI practitioners, representing an incremental improvement over existing compression techniques.

The paper tackles the memory constraints in Key-Value Cache during LLM inference by proposing FDC, a fast KV dimensionality compression system that reduces Job Completion Time by up to 64% and increases throughput by up to 1.97X while maintaining 99% accuracy compared to prior methods.

In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose FDC, a fast KV dimensionality compression system that eliminates the decompression overhead incurred in the existing KV dimensionality compression system, Palu, and reduces attention time. Moreover, FDC employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, FDC enhances the attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that compared to Palu, FDC can reduce Job Completion Time (JCT) by up to 64%, and delivers up to 1.97X throughput under the same latency, while maintaining 99% of the accuracy without compression. When state-of-the-art eviction and quantization methods are combined with FDC, they exhibit similar improvements compared to those combined with Palu. We open-sourced the code.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes