Acela: Predictable Datacenter-level Maintenance Job Scheduling
This work addresses a specific problem for datacenter operators by improving maintenance scheduling efficiency, though it is incremental as it builds on existing machine learning techniques with a novel cost-aware adaptation.
The paper tackles the challenge of predicting maintenance job durations in datacenters, where asymmetric costs make low-error predictions suboptimal for scheduling. Acela, using quantile regression to bias predictions toward overprediction, reduces servers taken offline by 1.87-4.28X and server offline time by 1.40-2.80X compared to prior methods.
Datacenter operators ensure fair and regular server maintenance by using automated processes to schedule maintenance jobs to complete within a strict time budget. Automating this scheduling problem is challenging because maintenance job duration varies based on both job type and hardware. While it is tempting to use prior machine learning techniques for predicting job duration, we find that the structure of the maintenance job scheduling problem creates a unique challenge. In particular, we show that prior machine learning methods that produce the lowest error predictions do not produce the best scheduling outcomes due to asymmetric costs. Specifically, underpredicting maintenance job duration has results in more servers being taken offline and longer server downtime than overpredicting maintenance job duration. The system cost of underprediction is much larger than that of overprediction. We present Acela, a machine learning system for predicting maintenance job duration, which uses quantile regression to bias duration predictions toward overprediction. We integrate Acela into a maintenance job scheduler and evaluate it on datasets from large-scale, production datacenters. Compared to machine learning based predictors from prior work, Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.