A Comprehensive Survey on Imbalanced Data Learning
It addresses the pervasive issue of data imbalance affecting ML applications across various domains, but is incremental as it synthesizes existing knowledge rather than introducing new methods.
This survey tackles the problem of imbalanced data distributions hindering machine learning performance by systematically categorizing existing research into four approaches and analyzing real-world data formats, providing a structured overview to guide future work.
With the expansion of data availability, machine learning (ML) has achieved remarkable breakthroughs in both academia and industry. However, imbalanced data distributions are prevalent in various types of raw data and severely hinder the performance of ML by biasing the decision-making processes. To deepen the understanding of imbalanced data and facilitate the related research and applications, this survey systematically analyzes various real-world data formats and concludes existing researches for different data formats into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning. This structured analysis helps researchers comprehensively understand the pervasive nature of imbalance across diverse data formats, thereby paving a clearer path toward achieving specific research goals. We provide an overview of relevant open-source libraries, spotlight current challenges, and offer novel insights aimed at fostering future advancements in this critical area of study.