DBLGPFSEApr 24, 2021

Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning

arXiv:2105.04547v21 citationsHas Code
AI Analysis

This addresses memory failure prediction for data center stability, but it is incremental as it builds on existing methods with optimizations.

The paper tackles memory failure prediction in data centers to prevent downtime, achieving 14th place in a competition with a solution that completes online testing in 30 minutes compared to others taking over 3 hours.

In the data center, unexpected downtime caused by memory failures can lead to a decline in the stability of the server and even the entire information technology infrastructure, which harms the business. Therefore, whether the memory failure can be accurately predicted in advance has become one of the most important issues to be studied in the data center. However, for the memory failure prediction in the production system, it is necessary to solve technical problems such as huge data noise and extreme imbalance between positive and negative samples, and at the same time ensure the long-term stability of the algorithm. This paper compares and summarizes some commonly used skills and the improvement they can bring. The single model we proposed won the top 14th in the 2nd Alibaba Cloud AIOps Competition belonging to the 25th PAKDD conference. It takes only 30 minutes to pass the online test, while most of the other contestants' solution need more than 3 hours. Codes has been open source to https://www.github.com/ycd2016/acaioc2.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes