LGApr 24, 2017

Automatic Anomaly Detection in the Cloud Via Statistical Learning

arXiv:1704.07706v1186 citations
Originality Incremental advance
AI Analysis

This addresses the need for high availability and performance in web services like social networks, though it is incremental as it adapts existing statistical methods to a specific domain.

The paper tackles the problem of automatic anomaly detection in cloud infrastructure data, which is challenging due to seasonal and trend components in time series, by developing two novel statistical techniques that use seasonal decomposition and robust metrics like median and MAD, achieving reported Precision, Recall, and F-measure on production data.

Performance and high availability have become increasingly important drivers, amongst other drivers, for user retention in the context of web services such as social networks, and web search. Exogenic and/or endogenic factors often give rise to anomalies, making it very challenging to maintain high availability, while also delivering high performance. Given that service-oriented architectures (SOA) typically have a large number of services, with each service having a large set of metrics, automatic detection of anomalies is non-trivial. Although there exists a large body of prior research in anomaly detection, existing techniques are not applicable in the context of social network data, owing to the inherent seasonal and trend components in the time series data. To this end, we developed two novel statistical techniques for automatically detecting anomalies in cloud infrastructure data. Specifically, the techniques employ statistical learning to detect anomalies in both application, and system metrics. Seasonal decomposition is employed to filter the trend and seasonal components of the time series, followed by the use of robust statistical metrics -- median and median absolute deviation (MAD) -- to accurately detect anomalies, even in the presence of seasonal spikes. We demonstrate the efficacy of the proposed techniques from three different perspectives, viz., capacity planning, user behavior, and supervised learning. In particular, we used production data for evaluation, and we report Precision, Recall, and F-measure in each case.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes