Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification
This work addresses data imbalance issues in sequence classification, particularly for medical device applications, but is incremental as it combines existing GAN and autoencoder components.
The authors tackled the problem of imbalanced sequence classification by introducing a GAN-AE architecture to generate synthetic data, showing that it outperforms standard oversampling and other GAN-based models in improving classification accuracy on a medical device dataset and other benchmarks.
Generative Adversarial Networks (GANs) have been used in many different applications to generate realistic synthetic data. We introduce a novel GAN with Autoencoder (GAN-AE) architecture to generate synthetic samples for variable length, multi-feature sequence datasets. In this model, we develop a GAN architecture with an additional autoencoder component, where recurrent neural networks (RNNs) are used for each component of the model in order to generate synthetic data to improve classification accuracy for a highly imbalanced medical device dataset. In addition to the medical device dataset, we also evaluate the GAN-AE performance on two additional datasets and demonstrate the application of GAN-AE to a sequence-to-sequence task where both synthetic sequence inputs and sequence outputs must be generated. To evaluate the quality of the synthetic data, we train encoder-decoder models both with and without the synthetic data and compare the classification model performance. We show that a model trained with GAN-AE generated synthetic data outperforms models trained with synthetic data generated both with standard oversampling techniques such as SMOTE and Autoencoders as well as with state of the art GAN-based models.