Differentially Private Synthetic Mixed-Type Data Generation For Unsupervised Learning
This addresses the need for privacy-preserving data sharing in unsupervised learning, particularly for mixed-type data, though it is incremental as it combines existing autoencoder and GAN methods.
The authors tackled the problem of generating synthetic data that preserves statistical properties of sensitive datasets while ensuring differential privacy, achieving competitive performance with existing private algorithms on binary and mixed-type datasets like MIMIC-III and ADULT.
We introduce the DP-auto-GAN framework for synthetic data generation, which combines the low dimensional representation of autoencoders with the flexibility of Generative Adversarial Networks (GANs). This framework can be used to take in raw sensitive data and privately train a model for generating synthetic data that will satisfy similar statistical properties as the original data. This learned model can generate an arbitrary amount of synthetic data, which can then be freely shared due to the post-processing guarantee of differential privacy. Our framework is applicable to unlabeled mixed-type data, that may include binary, categorical, and real-valued data. We implement this framework on both binary data (MIMIC-III) and mixed-type data (ADULT), and compare its performance with existing private algorithms on metrics in unsupervised settings. We also introduce a new quantitative metric able to detect diversity, or lack thereof, of synthetic data.