LG DCSep 25, 2024

No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

Georgia Tech

arXiv:2409.17264v515.79 citationsh-index: 39

Originality Highly original

AI Analysis

This addresses a critical performance bottleneck in deploying million-token LLMs for production workloads, benefiting users by enhancing system responsiveness and efficiency.

The paper tackles the problem of heterogeneous workloads in long-context LLM inference, where long requests stall short ones, and presents Medha, a serving system that improves throughput by 5.7x and reduces median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

View on arXiv PDF

Similar