LGMay 16, 2023

Synthetic data, real errors: how (not) to publish and use synthetic data

arXiv:2305.09235v245 citations
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable synthetic data for ML practitioners, offering an incremental improvement over naive methods.

The paper tackles the problem of synthetic data causing errors in downstream machine learning tasks by showing that naive use of synthetic data leads to poor generalization on real data, and introduces Deep Generative Ensemble (DGE) to improve training, evaluation, and uncertainty quantification, with significant gains for minority classes and low-density regions.

Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach -- using synthetic data as if it is real -- leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE) -- a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes