LGAIApr 24, 2024

Anomaly Detection for Incident Response at Scale

arXiv:2404.16887v12 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement for large-scale business and system monitoring at Walmart, offering faster detection with fewer false positives.

The paper tackles real-time anomaly detection for incident response at Walmart, presenting AIDR, which covered 63% of major incidents and reduced mean-time-to-detect by over 7 minutes during a 3-month validation.

We present a machine learning-based anomaly detection product, AI Detect and Respond (AIDR), that monitors Walmart's business and system health in real-time. During the validation over 3 months, the product served predictions from over 3000 models to more than 25 application, platform, and operation teams, covering 63\% of major incidents and reducing the mean-time-to-detect (MTTD) by more than 7 minutes. Unlike previous anomaly detection methods, our solution leverages statistical, ML and deep learning models while continuing to incorporate rule-based static thresholds to incorporate domain-specific knowledge. Both univariate and multivariate ML models are deployed and maintained through distributed services for scalability and high availability. AIDR has a feedback loop that assesses model quality with a combination of drift detection algorithms and customer feedback. It also offers self-onboarding capabilities and customizability. AIDR has achieved success with various internal teams with lower time to detection and fewer false positives than previous methods. As we move forward, we aim to expand incident coverage and prevention, reduce noise, and integrate further with root cause recommendation (RCR) to enable an end-to-end AIDR experience.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes