LGDCSep 25, 2024

No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Georgia Tech
arXiv:2409.17264v59 citationsh-index: 47
Originality Highly original
AI Analysis

This addresses a critical performance bottleneck in deploying million-token LLMs for production workloads, benefiting users by enhancing system responsiveness and efficiency.

The paper tackles the problem of heterogeneous workloads in long-context LLM inference, where long requests stall short ones, and presents Medha, a serving system that improves throughput by 5.7x and reduces median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes