CRLGFeb 18, 2025

Does Training with Synthetic Data Truly Protect Privacy?

arXiv:2502.12976v113 citationsh-index: 1ICLR
Originality Synthesis-oriented
AI Analysis

This work highlights a critical gap in privacy protection for machine learning practitioners using synthetic data, cautioning against overreliance on unverified claims.

The paper investigates whether training with synthetic data effectively protects privacy by examining four paradigms, finding that they yield inconsistent privacy outcomes and warning that empirical methods without rigorous evaluation can create false security.

As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes