CRLGJun 1, 2022

Privacy for Free: How does Dataset Condensation Help Privacy?

arXiv:2206.00240v1156 citationsh-index: 27
Originality Incremental advance
AI Analysis

This work addresses the challenge of data leakage for machine learning practitioners by offering a method that combines efficiency and privacy, though it appears incremental as it applies an existing technique to a new problem.

The paper tackles the problem of achieving both training efficiency and data privacy in machine learning by identifying that dataset condensation, originally designed for efficiency, also provides privacy benefits. They theoretically prove limited impact of individual samples on model parameters and empirically validate privacy against membership inference attacks.

To prevent unintentional data leakage, research community has resorted to data generators that can produce differentially private data for model training. However, for the sake of the data privacy, existing solutions suffer from either expensive training cost or poor generalization performance. Therefore, we raise the question whether training efficiency and privacy can be achieved simultaneously. In this work, we for the first time identify that dataset condensation (DC) which is originally designed for improving training efficiency is also a better solution to replace the traditional data generators for private data generation, thus providing privacy for free. To demonstrate the privacy benefit of DC, we build a connection between DC and differential privacy, and theoretically prove on linear feature extractors (and then extended to non-linear feature extractors) that the existence of one sample has limited impact ($O(m/n)$) on the parameter distribution of networks trained on $m$ samples synthesized from $n (n \gg m)$ raw samples by DC. We also empirically validate the visual privacy and membership privacy of DC-synthesized data by launching both the loss-based and the state-of-the-art likelihood-based membership inference attacks. We envision this work as a milestone for data-efficient and privacy-preserving machine learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes