ML CY LGJun 15, 2021

An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises

Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres

arXiv:2106.10241v16.310 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of ensuring fairness and accurate evaluation in machine learning models trained on private synthetic data, which is crucial for data privacy and ethical AI deployment, though it is incremental in nature.

The study systematically analyzes the effects of differentially private synthetic data generation on classification, finding that while increased privacy reduces model accuracy, it does not necessarily increase bias, and synthetic data can misestimate real-world performance.

Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.

View on arXiv PDF

Similar