LGOct 23, 2023

Reliable Generation of Privacy-preserving Synthetic Electronic Health Record Time Series via Diffusion Models

arXiv:2310.15290v643 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses privacy issues and data scarcity in medical research by providing a method to generate synthetic EHR time series, though it is incremental as it applies an existing diffusion model framework to a specific domain.

The study tackled the problem of generating realistic and privacy-preserving synthetic electronic health record (EHR) time series to address privacy concerns and limited data access in medical research. It introduced a method using Denoising Diffusion Probabilistic Models (DDPM) that significantly outperformed eight existing methods in data fidelity and required less training effort, with lower discriminative accuracy indicating reduced privacy risk.

Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR de-identification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancement of medical research using EHR. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic electronic health records (EHRs) time series efficiently. We introduce a new method for generating diverse and realistic synthetic EHR time series data using Denoising Diffusion Probabilistic Models (DDPM). We conducted experiments on six databases: Medical Information Mart for Intensive Care III and IV (MIMIC-III/IV), the eICU Collaborative Research Database (eICU), and non-EHR datasets on Stocks and Energy. We compared our proposed method with eight existing methods. Our results demonstrate that our approach significantly outperforms all existing methods in terms of data fidelity while requiring less training effort. Additionally, data generated by our method yields a lower discriminative accuracy compared to other baseline methods, indicating the proposed method can generate data with less privacy risk. The proposed diffusion-model-based method can reliably and efficiently generate synthetic EHR time series, which facilitates the downstream medical data analysis. Our numerical results show the superiority of the proposed method over all other existing methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes