An empirical evaluation of imbalanced data strategies from a practitioner's point of view
It addresses the problem of imbalanced data for practitioners by providing an empirical comparison, but it is incremental as it builds on existing strategies without introducing new methods.
This paper evaluated six strategies for handling imbalanced data on 58 real-life binary datasets with imbalance rates from 3 to 120, finding that the effectiveness of each strategy varies significantly depending on the performance metric used.
This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach referred to as the baseline. These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120. We conducted a comparative analysis of 10 under-sampling algorithms, 5 over-sampling algorithms, 2 ensemble methods, and 3 specialized algorithms across eight different performance metrics: accuracy, area under the ROC curve (AUC), balanced accuracy, F1-measure, G-mean, Matthew's correlation coefficient, precision, and recall. Additionally, we assessed the six strategies on altered datasets, derived from real-life data, with both low (3) and high (100 or 300) imbalance ratios (IR). The principal finding indicates that the effectiveness of each strategy significantly varies depending on the metric used. The paper also examines a selection of newer algorithms within the categories of specialized algorithms, oversampling, and ensemble methods. The findings suggest that the current hierarchy of best-performing strategies for each metric is unlikely to change with the introduction of newer algorithms.