LGMLApr 7, 2020

CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification

arXiv:2004.03409v243 citations
AI Analysis

This work addresses data imbalance in classification tasks, which is a common issue in machine learning, but it is incremental as it builds on existing SMOTE methods.

The paper tackles the problem of imbalanced data classification by proposing a novel undersampling technique called SMUTE and combining it with SMOTE oversampling into CSMOUTE, with results showing improved performance when used with complex classifiers like MLP and SVM on datasets with many outliers.

In this paper we propose a novel data-level algorithm for handling data imbalance in the classification task, Synthetic Majority Undersampling Technique (SMUTE). SMUTE leverages the concept of interpolation of nearby instances, previously introduced in the oversampling setting in SMOTE. Furthermore, we combine both in the Combined Synthetic Oversampling and Undersampling Technique (CSMOUTE), which integrates SMOTE oversampling with SMUTE undersampling. The results of the conducted experimental study demonstrate the usefulness of both the SMUTE and the CSMOUTE algorithms, especially when combined with more complex classifiers, namely MLP and SVM, and when applied on datasets consisting of a large number of outliers. This leads us to a conclusion that the proposed approach shows promise for further extensions accommodating local data characteristics, a direction discussed in more detail in the paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes