DCMay 16

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

Boxiao Du, Boning Huangfu, Yizhou Luo, Chen Chen, Zijun Li, Minchen Yu, Xiaoyi Fan, Minyi Guo

arXiv:2605.1686782.2

AI Analysis

For operators of LLM serving systems with heterogeneous GPU resources, GoodServe provides a practical solution to improve the proportion of requests that meet their latency SLOs.

GoodServe is a serving system that optimizes goodput for agentic LLM inferences on heterogeneous GPUs by routing requests to meet end-to-end latency requirements, achieving up to 27.4% improvement over existing methods.

Large Language Models (LLMs) play a critical role in emerging agentic applications, where the timely completion of each entire inference is critical. Meanwhile, agentic LLM inferences are increasingly served on heterogeneous GPUs in operator's resource pools. Therefore, it is crucial to route incoming inference requests to appropriate GPUs so that their end-to-end latency requirements are satisfied whenever possible, thereby achieving high goodput. In this paper, we propose GoodServe, a goodput-optimized serving system for agentic inferences over heterogeneous resources. GoodServe performs inference routing in a predict-and-rectify manner. It estimates the request output lengths as well as the GPU serving status in an accurate and also practical manner. Based on information from both the demand and resource sides, it then makes high-quality routing decisions using a just-enough instance selection heuristic. It also periodically monitors SLO-violation risks of active requests and triggers runtime request migrations to address unexpected dynamics. Our evaluations show that GoodServe improves goodput by up to 27.4% over existing routing methods.

View on arXiv PDF

Similar