Learning Classifiers for Imbalanced and Overlapping Data
This work addresses the challenge of imbalanced data for machine learning practitioners, but it appears incremental as it builds on existing resampling methods with a new optimization technique.
The study tackled the problem of learning classifiers from imbalanced and overlapping data by creating artificial datasets and testing decision trees and rule-based classifiers, then improved performance using resampling methods and a new Sparsity technique, with results showing comparisons of different pre-processing approaches.
This study is about inducing classifiers using data that is imbalanced, with a minority class being under-represented in relation to the majority classes. The first section of this research focuses on the main characteristics of data that generate this problem. Following a study of previous, relevant research, a variety of artificial, imbalanced data sets influenced by important elements were created. These data sets were used to create decision trees and rule-based classifiers. The second section of this research looks into how to improve classifiers by pre-processing data with resampling approaches. The results of the following trials are compared to the performance of distinct pre-processing re-sampling methods: two variants of random over-sampling and focused under-sampling NCR. This paper further optimises class imbalance with a new method called Sparsity. The data is made more sparse from its class centers, hence making it more homogenous.