h-index30
2papers

2 Papers

LGFeb 11
Evaluation metrics for temporal preservation in synthetic longitudinal patient data

Katariina Perkonoja, Parisa Movahedi, Antti Airola et al.

This study introduces a set of metrics for evaluating temporal preservation in synthetic longitudinal patient data, defined as artificially generated data that mimic real patients' repeated measurements over time. The proposed metrics assess how synthetic data reproduces key temporal characteristics, categorized into marginal, covariance, individual-level and measurement structures. We show that strong marginal-level resemblance may conceal distortions in covariance and disruptions in individual-level trajectories. Temporal preservation is influenced by factors such as original data quality, measurement frequency, and preprocessing strategies, including binning, variable encoding and precision. Variables with sparse or highly irregular measurement times provide limited information for learning temporal dependencies, resulting in reduced resemblance between the synthetic and original data. No single metric adequately captures temporal preservation; instead, a multidimensional evaluation across all characteristics provides a more comprehensive assessment of synthetic data quality. Overall, the proposed metrics clarify how and why temporal structures are preserved or degraded, enabling more reliable evaluation and improvement of generative models and supporting the creation of temporally realistic synthetic longitudinal patient data.

MESep 21, 2023
Methods for generating and evaluating synthetic longitudinal patient data: a systematic review

Katariina Perkonoja, Kari Auranen, Joni Virta

The rapid growth in data availability has facilitated research and development, yet not all industries have benefited equally due to legal and privacy constraints. The healthcare sector faces significant challenges in utilizing patient data because of concerns about data security and confidentiality. To address this, various privacy-preserving methods, including synthetic data generation, have been proposed. Synthetic data replicate existing data as closely as possible, acting as a proxy for sensitive information. While patient data are often longitudinal, this aspect remains underrepresented in existing reviews of synthetic data generation in healthcare. This paper maps and describes methods for generating and evaluating synthetic longitudinal patient data in real-life settings through a systematic literature review, conducted following the PRISMA guidelines and incorporating data from five databases up to May 2024. Thirty-nine methods were identified, with four addressing all challenges of longitudinal data generation, though none included privacy-preserving mechanisms. Resemblance was evaluated in most studies, utility in the majority, and privacy in just over half. Only a small fraction of studies assessed all three aspects. Our findings highlight the need for further research in this area.