12.8LGApr 17
Evaluating quality in synthetic data generation for large tabular health datasetsJean-Baptiste Escudié, Benjamin Barnes, Stefan Meisegeier et al.
There is no consensus in the field of synthetic data on concise metrics for quality evaluations or benchmarks on large health datasets, such as historical epidemiological data. This study presents an evaluation of seven recent models from major machine learning families. The models were evaluated using four different datasets, each with a distinct scale. To ensure a fair comparison, we systematically tuned the hyperparameters of each model for each dataset. We propose a methodology for evaluating the fidelity of synthesized joint distributions, aligning metrics with visualization on a single plot. This method is applicable to any dataset and is complemented by a domain-specific analysis of the German Cancer Registries' epidemiological dataset. The analysis reveals the challenges models face in strictly adhering to the medical domain. We hope this approach will serve as a foundational framework for guiding the selection of synthesizers and remain accessible to all stakeholders involved in releasing synthetic datasets.
CRMar 6, 2025
A Consensus Privacy Metrics Framework for Synthetic DataLisa Pilgram, Fida K. Dankar, Jorg Drechsler et al.
Synthetic data generation is one approach for sharing individual-level data. However, to meet legislative requirements, it is necessary to demonstrate that the individuals' privacy is adequately protected. There is no consolidated standard for measuring privacy in synthetic data. Through an expert panel and consensus process, we developed a framework for evaluating privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure, and their use is discouraged. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information about an individual without necessarily revealing their identity. The resultant framework provides precise recommendations for metrics that address these types of disclosures effectively. Our findings further present specific opportunities for future research that can help with widespread adoption of synthetic data.
MEMay 12, 2023
Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis resultsLisa Kühnel, Julian Schneider, Ines Perrar et al.
Access to individual-level health data is essential for gaining new insights and advancing science. In particular, modern methods based on artificial intelligence rely on the availability of and access to large datasets. In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data, i.e. data generated through a randomised process that have similar statistical properties as the original data, but do not have a one-to-one correspondence with the original individual-level records. In this study, we use a state-of-the-art synthetic data generation method and perform in-depth quality analyses of the generated data for a specific use case in the field of nutrition. We demonstrate the need for careful analyses of synthetic data that go beyond descriptive statistics and provide valuable insights into how to realise the full potential of synthetic datasets. By extending the methods, but also by thoroughly analysing the effects of sampling from a trained model, we are able to largely reproduce significant real-world analysis results in the chosen use case.