MLLGFeb 6, 2024

Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

arXiv:2402.03819v513 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the problem of handling imbalanced data for machine learning practitioners, providing theoretical insights and empirical evidence that challenge common practices, though it is incremental in refining existing methods.

The paper tackles the effectiveness of rebalancing strategies like SMOTE for imbalanced tabular data, finding that applying no rebalancing is often competitive in predictive performance, but a modified SMOTE variant shows promise under high imbalance.

Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-world data-set.bFurthermore, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants. These strategies are compared on 13 tabular data sets with 10 state-of-the-art rebalancing procedures, including deep generative and diffusion models. One of our key findings is that, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. However, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other state-of-the-art strategies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes