LGCRMay 28, 2022

MC-GEN:Multi-level Clustering for Private Synthetic Data Generation

arXiv:2205.14298v28 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses privacy concerns for companies and research institutes sharing data, but appears incremental as it builds on existing differential privacy and generative models.

The paper tackles the problem of privacy leakage in data sharing by proposing MC-GEN, a method for generating private synthetic datasets under differential privacy, which outperforms existing methods in utility on multiple classification tasks.

With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy leakage. A reliable solution is to utilize private synthetic datasets which preserve statistical information from original datasets. In this paper, we propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee for machine learning classification tasks. MC-GEN applies multi-level clustering and differential private generative model to improve the utility of synthetic data. In the experimental evaluation, we evaluated the effects of parameters and the effectiveness of MC-GEN. The results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks. Moreover, we compare MC-GEN with three existing methods. The results showed that MC-GEN outperforms other methods in terms of utility.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes