LG MLSep 29, 2020

Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

arXiv:2009.14096v22.349 citations

Originality Incremental advance

AI Analysis

This work addresses imbalanced classification, a common issue in industrial applications, by improving oversampling methods to enhance label credibility and classification performance, though it appears incremental as it builds on existing techniques like SMOTE.

The paper tackles the problem of inaccurate labeling of synthetic samples in oversampling for imbalanced classification by introducing weakly supervised learning, specifically Graph semi-supervised SMOTE, along with cost-sensitive neighborhood components analysis and a bootstrap ensemble framework. It achieved good classification performance, with average results and robustness outperforming benchmarks on 8 synthetic and 3 real-world datasets, particularly for high imbalance and high dimensionality.

With the abundance of industrial datasets, imbalanced classification has become a common problem in several application domains. Oversampling is an effective method to solve imbalanced classification. One of the main challenges of the existing oversampling methods is to accurately label the new synthetic samples. Inaccurate labels of the synthetic samples would distort the distribution of the dataset and possibly worsen the classification performance. This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples caused by traditional oversampling methods. Graph semi-supervised SMOTE is developed to improve the credibility of the synthetic samples' labels. In addition, we propose cost-sensitive neighborhood components analysis for high dimensional datasets and bootstrap based ensemble framework for highly imbalanced datasets. The proposed method has achieved good classification performance on 8 synthetic datasets and 3 real-world datasets, especially for high imbalance and high dimensionality problems. The average performances and robustness are better than the benchmark methods.

View on arXiv PDF

Similar