DCLGApr 5, 2025

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

arXiv:2504.08784v114 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the challenge of efficiently serving LLMs with diverse SLO requirements for applications like summarization and coding, though it is incremental as it builds on existing serving systems.

The paper tackles the problem of serving multi-stage large language model requests with specific service level objectives by introducing SLOs-Serve, which customizes token allocation to meet these SLOs, resulting in a 2.2x average improvement in per-GPU serving capacity compared to prior state-of-the-art systems.

This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements. SLOs-Serve uses a multi-SLO dynamic programming-based algorithm to continuously optimize token allocations under SLO constraints by exploring the full design space of chunked prefill and (optional) speculative decoding. Leveraging this resource planning algorithm, SLOs-Serve effectively supports multi-SLOs and multi-replica serving with dynamic request routing while being resilient to bursty arrivals. Our evaluation across 6 LLM application scenarios (including summarization, coding, chatbot, tool calling, and reasoning) demonstrates that SLOs-Serve improves per-GPU serving capacity by 2.2x on average compared to prior state-of-the-art systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes