LG ARApr 30, 2021

DRAM Failure Prediction in AIOps: Empirical Evaluation, Challenges and Opportunities

Zhiyue Wu, Hongzuo Xu, Guansong Pang, Fengyuan Yu, Yijie Wang, Songlei Jian, Yongjun Wang

arXiv:2104.15052v24.49 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of improving reliability in large-scale data centers for operators, but it is incremental as it focuses on evaluation rather than introducing new methods.

The paper tackles DRAM failure prediction in AIOps by conducting an empirical evaluation of machine learning techniques using a large-scale dataset from Alibaba Cloud, including over three million records, and finds that multi-class classification and anomaly detection methods are assessed, though specific performance numbers are not provided.

DRAM failure prediction is a vital task in AIOps, which is crucial to maintain the reliability and sustainable service of large-scale data centers. However, limited work has been done on DRAM failure prediction mainly due to the lack of public available datasets. This paper presents a comprehensive empirical evaluation of diverse machine learning techniques for DRAM failure prediction using a large-scale multi-source dataset, including more than three millions of records of kernel, address, and mcelog data, provided by Alibaba Cloud through PAKDD 2021 competition. Particularly, we first formulate the problem as a multi-class classification task and exhaustively evaluate seven popular/state-of-the-art classifiers on both the individual and multiple data sources. We then formulate the problem as an unsupervised anomaly detection task and evaluate three state-of-the-art anomaly detectors. Further, based on the empirical results and our experience of attending this competition, we discuss major challenges and present future research opportunities in this task.

View on arXiv PDF

Similar