LG AI CR CYSep 23, 2021

Robin Hood and Matthew Effects: Differential Privacy Has Disparate Impact on Synthetic Data

Georgi Ganev, Bristena Oprisanu, Emiliano De Cristofaro

arXiv:2109.11429v322.080 citationsh-index: 48

Originality Incremental advance

AI Analysis

This highlights a fairness issue in privacy-preserving data generation, affecting users relying on synthetic data for unbiased model training.

The paper analyzes how Differential Privacy (DP) in generative models affects underrepresented classes in synthetic data, showing that DP can either reduce or increase the gap between majority and minority classes, leading to disparate impacts on classification accuracy.

Generative models trained with Differential Privacy (DP) can be used to generate synthetic data while minimizing privacy risks. We analyze the impact of DP on these models vis-a-vis underrepresented classes/subgroups of data, specifically, studying: 1) the size of classes/subgroups in the synthetic data, and 2) the accuracy of classification tasks run on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our analysis uses three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN) and shows that DP yields opposite size distributions in the generated synthetic data. It affects the gap between the majority and minority classes/subgroups; in some cases by reducing it (a "Robin Hood" effect) and, in others, by increasing it (a "Matthew" effect). Either way, this leads to (similar) disparate impacts on the accuracy of classification tasks on the synthetic data, affecting disproportionately more the underrepresented subparts of the data. Consequently, when training models on synthetic data, one might incur the risk of treating different subpopulations unevenly, leading to unreliable or unfair conclusions.

View on arXiv PDF

Similar