DC LGOct 12, 2022

KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

Baolin Li, Siddharth Samsi, Vijay Gadepally, Devesh Tiwari

arXiv:2210.05889v35.115 citationsh-index: 34

Originality Highly original

AI Analysis

This addresses the challenge for businesses deploying ML inference services to balance cost and performance, representing an incremental improvement with novel techniques for a known bottleneck.

The paper tackles the problem of optimizing cost and performance for online machine learning inference in cloud systems by introducing KAIROS, a runtime framework that maximizes query throughput under QoS and budget constraints, achieving up to 2x higher throughput than homogeneous solutions and outperforming state-of-the-art schemes by up to 70%.

Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.

View on arXiv PDF

Similar