LGCYMLMar 17, 2020

Generating Electronic Health Records with Multiple Data Types and Constraints

arXiv:2003.07904v244 citations
AI Analysis

This addresses privacy concerns for healthcare data sharing by providing a more comprehensive simulation method, though it is incremental as it builds on existing GAN frameworks.

The paper tackled the problem of generating realistic electronic health records (EHRs) with multiple data types and feature constraints to mitigate privacy risks, achieving higher performance in retaining statistics, correlations, and patterns from real data without sacrificing privacy, as demonstrated with over 770,000 EHRs from Vanderbilt University Medical Center.

Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs) and 2) do not represent constraints between features. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over $770,000$ EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms of retaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes