LGApr 12, 2023

Towards Understanding How Data Augmentation Works with Imbalanced Data

arXiv:2304.05895v15 citationsh-index: 75
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding data augmentation mechanisms for researchers and practitioners dealing with imbalanced datasets, but it is incremental as it builds on existing techniques without introducing new methods.

The study investigated how data augmentation affects classifiers like CNNs, SVMs, and logistic regression on imbalanced data, finding that it induces significant changes in model parameters despite modest improvements in global metrics such as balanced accuracy or F1 scores.

Data augmentation forms the cornerstone of many modern machine learning training pipelines; yet, the mechanisms by which it works are not clearly understood. Much of the research on data augmentation (DA) has focused on improving existing techniques, examining its regularization effects in the context of neural network over-fitting, or investigating its impact on features. Here, we undertake a holistic examination of the effect of DA on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models, which are commonly used in supervised classification of imbalanced data. We support our examination with testing on three image and five tabular datasets. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection; even though it may only yield relatively modest changes to global metrics, such as balanced accuracy or F1 measure. We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels. By diversifying the range of feature amplitudes that a model must recognize to predict a label, DA improves a model's capacity to generalize when learning with imbalanced data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes