SELGJan 25, 2024

McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions

arXiv:2401.14093v11 citations
Originality Synthesis-oriented
AI Analysis

This addresses the costly need for expert-labeled data in AIOps maintenance, offering a domain-specific incremental improvement.

The paper tackles performance degradation in AIOps failure prediction models due to changing operational data by introducing McUDI, an unsupervised degradation indicator that detects when retraining is needed, reducing required labeled samples by 30k for job failure and 260k for disk failure while maintaining similar performance to periodic retraining.

Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes