LGFeb 13, 2025

A Comprehensive Survey on Imbalanced Data Learning

Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Conghui He, Hongzhi Yin, Wentao Zhang

arXiv:2502.08960v311.429 citationsh-index: 11Has Code

Originality Synthesis-oriented

AI Analysis

It addresses the pervasive issue of data imbalance affecting ML applications across various domains, but is incremental as it synthesizes existing knowledge rather than introducing new methods.

This survey tackles the problem of imbalanced data distributions hindering machine learning performance by systematically categorizing existing research into four approaches and analyzing real-world data formats, providing a structured overview to guide future work.

With the expansion of data availability, machine learning (ML) has achieved remarkable breakthroughs in both academia and industry. However, imbalanced data distributions are prevalent in various types of raw data and severely hinder the performance of ML by biasing the decision-making processes. To deepen the understanding of imbalanced data and facilitate the related research and applications, this survey systematically analyzes various real-world data formats and concludes existing researches for different data formats into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning. This structured analysis helps researchers comprehensively understand the pervasive nature of imbalance across diverse data formats, thereby paving a clearer path toward achieving specific research goals. We provide an overview of relevant open-source libraries, spotlight current challenges, and offer novel insights aimed at fostering future advancements in this critical area of study.

View on arXiv PDF

Similar