LG DB MEOct 13, 2023

A Survey of Methods for Handling Disk Data Imbalance

Shuangshuang Yuan, Peng Wu, Yuehui Chen, Qiang Li

arXiv:2310.08867v1h-index: 7

Originality Synthesis-oriented

AI Analysis

It provides a comprehensive overview for researchers working on imbalanced data classification, but it is incremental as it surveys existing methods without introducing new ones.

This paper surveys methods for handling class imbalance in classification problems, using the Backblaze hard disk dataset as an example of severe imbalance, and organizes the discussion into data-level, algorithmic-level, and hybrid approaches to help researchers select appropriate techniques.

Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs.

View on arXiv PDF

Similar