DCLGSep 29, 2023

Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer

arXiv:2309.17340v25 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the problem of preventing revenue loss and improving reliability for cloud service providers by enabling early outage prediction, though it appears incremental as it builds on existing prediction methods with a novel regularizer.

The paper tackles the problem of predicting critical cloud service outages in advance by defining them as deteriorations in Quality of Service metrics and using a mixture of Gaussians with an extreme event regularizer to improve learning in the tail of the distribution. The result shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98, detects all outages exhibiting metric changes, and reduces Mean Time To Detection by up to 88% in a real-world SaaS dataset.

Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes