DC AIAug 21, 2025

HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang

arXiv:2508.15919v22 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the problem of efficient and cost-effective LLM serving for users requiring diverse SLOs, representing a strong domain-specific improvement.

The paper tackles the challenge of serving large language models with variable requests and multiple service-level objectives by presenting HyperFlexis, a unified system that jointly optimizes scheduling and scaling, achieving up to 4.44× higher SLO attainment and 65.82% lower request latency.

Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present HyperFlexis, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to 19.39$\times$. These optimizations allow the system to achieve up to 4.44$\times$ higher SLO attainment, 65.82% lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon.

View on arXiv PDF

Similar