KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
This addresses the challenge for businesses deploying ML inference services to balance cost and performance, representing an incremental improvement with novel techniques for a known bottleneck.
The paper tackles the problem of optimizing cost and performance for online machine learning inference in cloud systems by introducing KAIROS, a runtime framework that maximizes query throughput under QoS and budget constraints, achieving up to 2x higher throughput than homogeneous solutions and outperforming state-of-the-art schemes by up to 70%.
Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.