DC AIDec 24, 2024

KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving

Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen

arXiv:2412.18169v57.36 citationsh-index: 15

Originality Highly original

AI Analysis

This addresses memory management inefficiencies in LLM serving systems for applications requiring low latency, offering a novel solution to a known bottleneck.

The paper tackles the problem of GPU memory throttling in LLM serving due to KVCache states under workload spikes, which causes high latency, by proposing a parameter-centric approach that selectively drops replicated parameters to free memory instantly, reducing tail TTFT by up to 72.2 times compared to state-of-the-art systems.

Serving LLMs with a cluster of GPUs is common nowadays, where the serving system must meet strict latency SLOs required by applications. However, the stateful nature of LLM serving requires maintaining huge states (i.e., KVCache) in limited GPU memory. Under spikes in real-world workloads, GPU memory can be easily throttled, leading to orders of magnitude higher response latency due to queuing introduced by waiting for KVCache to be reclaimed. Prior KVCache-centric approaches handle load throttling by dropping, migrating, or swapping KVCache. These methods fail to release sufficient memory quickly with requests still queued. This paper proposes the first parameter-centric approach to handling throttling by selectively dropping replicated parameters to instantly free memory for requests, based on an unnoticed observation that model parameters are commonly replicated across GPUs for serving LLMs. With additional memory, all requests can be served with a larger batch without queuing. To make the parameter-centric approach correct and efficient, we cooperatively execute requests on GPUs with a complete copy of parameters using pipeline parallelism, and derive an appropriate drop plan without unnecessary cooperation. We also design techniques to minimize the performance overhead due to pipeline parallelism with the execution patterns of requests under drop. Evaluations show that {\sys} reduces the tail TTFT of requests under throttling by up to 72.2 times compared to the state-of-the-art systems including Llumnix, vLLM and InferCept.

View on arXiv PDF

Similar