DC AIJan 14, 2025

Hierarchical Autoscaling for Large Language Model Serving with Chiron

Archit Patke, Dhemath Reddy, Saurabh Jha, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

arXiv:2501.08090v18.014 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient resource management for cloud providers handling LLM inference workloads with varying SLOs, representing an incremental improvement over prior autoscalers.

The paper tackles the problem of resource autoscaling for large language model serving by introducing Chiron, an autoscaler that uses hierarchical backpressure based on queue size, utilization, and service-level objectives, resulting in up to 90% higher SLO attainment and up to 70% improved GPU efficiency compared to existing solutions.

Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.

View on arXiv PDF

Similar