LGSep 11, 2021

Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks

Austin Coursey, Gopal Nath, Srikanth Prabhu, Saptarshi Sengupta

arXiv:2109.05351v14.413 citations

Originality Incremental advance

AI Analysis

This work addresses the reliability issue of hard disk drives in data centers, enabling proactive maintenance to reduce downtime and operational losses, though it appears incremental as it builds on existing LSTM and data-driven methods.

The paper tackled the problem of predicting hard disk failures in data centers by developing a Bidirectional LSTM model that uses health statistics data, achieving 96.4% accuracy and a mean absolute error of 0.12 for predictions up to 60 days before failure.

Physical and cloud storage services are well-served by functioning and reliable high-volume storage systems. Recent observations point to hard disk reliability as one of the most pressing reliability issues in data centers containing massive volumes of storage devices such as HDDs. In this regard, early detection of impending failure at the disk level aids in reducing system downtime and reduces operational loss making proactive health monitoring a priority for AIOps in such settings. In this work, we introduce methods of extracting meaningful attributes associated with operational failure and of pre-processing the highly imbalanced health statistics data for subsequent prediction tasks using data-driven approaches. We use a Bidirectional LSTM with a multi-day look back period to learn the temporal progression of health indicators and baseline them against vanilla LSTM and Random Forest models to come up with several key metrics that establish the usefulness of and superiority of our model under some tightly defined operational constraints. For example, using a 15 day look back period, our approach can predict the occurrence of disk failure with an accuracy of 96.4% considering test data 60 days before failure. This helps to alert operations maintenance well in-advance about potential mitigation needs. In addition, our model reports a mean absolute error of 0.12 for predicting failure up to 60 days in advance, placing it among the state-of-the-art in recent literature.

View on arXiv PDF

Similar