LG CRDec 1, 2022

Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

Tânia Carvalho, Nuno Moniz, Luís Antunes, Nitesh Chawla

arXiv:2212.00484v36.96 citationsh-index: 75Has Code

Originality Incremental advance

AI Analysis

This addresses privacy protection for data users, particularly in high-risk scenarios, but is incremental as it builds on existing synthetic data and differential privacy methods.

The paper tackles the problem of protecting user data privacy against re-identification and linkage attacks by proposing ε-PrivateSMOTE, a technique that combines synthetic data generation with differential privacy to obfuscate high-risk cases, achieving competitive privacy risk and better predictive performance compared to existing methods while improving time requirements by at least a factor of 9.

Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, all of them have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose $ε$-PrivateSMOTE, a technique designed for safeguarding against re-identification and linkage attacks, particularly addressing cases with a high \sloppy re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how $ε$-PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.

View on arXiv PDF Code

Similar