Subpopulation-Specific Synthetic EHR for Better Mortality Prediction
This addresses a fairness and generalization issue in healthcare AI for underrepresented patient groups, but it is incremental as it builds on existing generative methods.
The paper tackled the problem of underrepresented subpopulations in EHR data leading to poor model generalization by proposing an ensemble framework using GAN-based synthetic data generation for each subpopulation, resulting in increased model performance for underrepresented groups.
Electronic health records (EHR) often contain different rates of representation of certain subpopulations (SP). Factors like patient demographics, clinical condition prevalence, and medical center type contribute to this underrepresentation. Consequently, when training machine learning models on such datasets, the models struggle to generalize well and perform poorly on underrepresented SPs. To address this issue, we propose a novel ensemble framework that utilizes generative models. Specifically, we train a GAN-based synthetic data generator for each SP and incorporate synthetic samples into each SP training set. Ultimately, we train SP-specific prediction models. To properly evaluate this method, we design an evaluation pipeline with 2 real-world use case datasets, queried from the MIMIC database. Our approach shows increased model performance over underrepresented SPs. Our code and models are given as supplementary and will be made available on a public repository.