CLDec 7, 2024

Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

arXiv:2412.05693v3h-index: 16
Originality Incremental advance
AI Analysis

This work addresses efficiency issues for users deploying large language models in memory-constrained settings, representing an incremental improvement over existing KV cache compression methods.

The paper tackles the problem of limited GPU memory in LLM inference by compressing the KV cache during input processing, enabling larger batch sizes and achieving significantly higher throughput while maintaining original model accuracy.

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes