LG AIDec 18, 2020

EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, Jimeng Sun

arXiv:2012.10020v113.659 citations

Originality Incremental advance

AI Analysis

This work addresses the critical need for realistic, privacy-preserving longitudinal EHR data for machine learning research in healthcare, benefiting researchers and health systems.

The paper introduces EVA, a conditional variational autoencoder, to synthesize realistic longitudinal electronic health records (EHRs) for research while preserving patient privacy. EVA can generate EHR sequences, account for individual patient differences, and be conditioned on specific disease conditions. The generated synthetic EHRs were found to be realistic by clinicians and improved predictive model performance by up to 8% in top-20 recall when used for data augmentation.

Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.

View on arXiv PDF

Similar