DCIRLGFeb 23, 2023

Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

arXiv:2302.11750v11 citationsh-index: 26
Originality Incremental advance
AI Analysis

This work addresses cost-effective deployment of personalized recommendation services in datacenters by improving resource utility while maintaining low latency, representing an incremental optimization for multi-tenant inference systems.

The paper tackles the problem of interference from co-locating multiple recommendation model workers in inference servers, which can violate latency SLAs, by proposing Hera, a heterogeneity-aware multi-tenant server that intelligently selects and allocates resources to co-located models, resulting in a 37.3% average improvement in effective machine utilization and a 26% reduction in required servers.

While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter. Co-locating multiple workers of a model is an effective way to maximize query-level parallelism and server throughput, but the interference caused by concurrent workers at shared resources can prevent server queries from meeting its SLA. Hera utilizes the heterogeneous memory requirement of multi-tenant recommendation models to intelligently determine a productive set of co-located models and its resource allocation, providing fast response time while achieving high throughput. We show that Hera achieves an average 37.3% improvement in effective machine utilization, enabling 26% reduction in required servers, significantly improving upon the baseline recommedation inference server.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes