Private Synthetic Data Meets Ensemble Learning
This addresses the problem of distribution shift in synthetic data for machine learning practitioners, but it is incremental as it builds on existing ensemble and DP methods.
The paper tackled the performance drop when models trained on synthetic data are deployed on real data by introducing an ensemble strategy using multiple differentially private synthetic datasets, finding that it improves accuracy and calibration for GAN-based mechanisms but not for marginal- or workload-based ones.
When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop due to the distribution shift between synthetic and real data. In this paper, we introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data. We generate multiple synthetic datasets by applying a differential privacy (DP) mechanism several times in parallel and then ensemble the downstream models trained on these datasets. While each synthetic dataset might deviate more from the real data distribution, they collectively increase sample diversity. This may enhance the robustness of downstream models against distribution shifts. Our extensive experiments reveal that while ensembling does not enhance downstream performance (compared with training a single model) for models trained on synthetic data generated by marginal-based or workload-based DP mechanisms, our proposed ensemble strategy does improve the performance for models trained using GAN-based DP mechanisms in terms of both accuracy and calibration of downstream models.