LG AINov 1, 2024

A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data

Ismail Hakki Karaman, Gulser Koksal, Levent Eriskin, Salih Salihoglu

arXiv:2411.01013v32 citationsh-index: 2Int J Data Sci Anal

Originality Incremental advance

AI Analysis

This addresses data imbalance issues for multi-label text classification tasks, but it is incremental as it builds on existing oversampling techniques.

The paper tackles the problem of data imbalance in multi-label text classification by proposing a similarity-based oversampling method that identifies and adds unlabeled instances similar to underrepresented classes to improve classifier performance, with experimental results showing effective enhancement.

In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.

View on arXiv PDF

Similar