Data Augmentation with Variational Autoencoder for Imbalanced Dataset
This addresses the challenge of imbalanced regression, a significant but less-explored issue compared to classification, for predictive modeling applications.
The paper tackles the problem of imbalanced regression in tabular data by proposing a novel method that combines variational autoencoders with smoothed bootstrap for synthetic data generation, showing improved performance in numerical comparisons against competitors on simulations and known datasets.
Learning from an imbalanced distribution presents a major challenge in predictive modeling, as it generally leads to a reduction in the performance of standard algorithms. Various approaches exist to address this issue, but many of them concern classification problems, with a limited focus on regression. In this paper, we introduce a novel method aimed at enhancing learning on tabular data in the Imbalanced Regression (IR) framework, which remains a significant problem. We propose to use variational autoencoders (VAE) which are known as a powerful tool for synthetic data generation, offering an interesting approach to modeling and capturing latent representations of complex distributions. However, VAEs can be inefficient when dealing with IR. Therefore, we develop a novel approach for generating data, combining VAE with a smoothed bootstrap, specifically designed to address the challenges of IR. We numerically investigate the scope of this method by comparing it against its competitors on simulations and datasets known for IR.