CRDec 30, 2020

PrivSyn: Differentially Private Data Synthesis

Zhikun Zhang, Tianhao Wang, Ninghui Li, Jean Honorio, Michael Backes, Shibo He, Jiming Chen, Yang Zhang

arXiv:2012.15128v128.338 citationsh-index: 78

Originality Incremental advance

AI Analysis

This work addresses the problem of generating privacy-preserving synthetic data for general tabular datasets, which is crucial for data sharing and analysis without compromising individual privacy.

This paper introduces PrivSyn, a novel method for generating differentially private synthetic datasets from general tabular data, even with high dimensionality (100 attributes) and large domain sizes (>2^500). The method automatically identifies data correlations and generates samples from a dense graphical model.

In differential privacy (DP), a challenging problem is to generate synthetic datasets that efficiently capture the useful information in the private data. The synthetic dataset enables any task to be done without privacy concern and modification to existing algorithms. In this paper, we present PrivSyn, the first automatic synthetic data generation method that can handle general tabular datasets (with 100 attributes and domain size $>2^{500}$). PrivSyn is composed of a new method to automatically and privately identify correlations in the data, and a novel method to generate sample data from a dense graphic model. We extensively evaluate different methods on multiple datasets to demonstrate the performance of our method.

View on arXiv PDF

Similar