LG CR MLJun 20, 2025

Private Training & Data Generation by Clustering Embeddings

Felix Zhou, Samson Zhou, Vahab Mirrokni, Alessandro Epasto, Vincent Cohen-Addad

arXiv:2506.16661v1h-index: 61

Originality Incremental advance

AI Analysis

It addresses privacy concerns in machine learning for applications handling sensitive data, though it is incremental as it builds on existing DP synthetic data approaches.

The paper tackles the problem of training deep neural networks on sensitive data while preserving privacy, by introducing a method for differentially private synthetic image embedding generation using Gaussian Mixture Models, achieving state-of-the-art classification accuracy on benchmark datasets.

Deep neural networks often use large, high-quality datasets to achieve high performance on many machine learning tasks. When training involves potentially sensitive data, this process can raise privacy concerns, as large models have been shown to unintentionally memorize and reveal sensitive information, including reconstructing entire training samples. Differential privacy (DP) provides a robust framework for protecting individual data and in particular, a new approach to privately training deep neural networks is to approximate the input dataset with a privately generated synthetic dataset, before any subsequent training algorithm. We introduce a novel principled method for DP synthetic image embedding generation, based on fitting a Gaussian Mixture Model (GMM) in an appropriate embedding space using DP clustering. Our method provably learns a GMM under separation conditions. Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy on standard benchmark datasets. Additionally, we demonstrate that our method can generate realistic synthetic images that achieve downstream classification accuracy comparable to SOTA methods. Our method is quite general, as the encoder and decoder modules can be freely substituted to suit different tasks. It is also highly scalable, consisting only of subroutines that scale linearly with the number of samples and/or can be implemented efficiently in distributed systems.

View on arXiv PDF

Similar