Generating private data with user customization
This work provides a method for individuals to control the privacy of their data contributions to machine learning models, which is an incremental improvement in the field of privacy-preserving machine learning.
This paper addresses the challenge of privatizing user data from personal devices for machine learning while allowing user customization. The authors propose a two-stage approach: first, a VAE creates a fixed latent representation, and then a small generative filter, trained via GAN-type robust optimization on distributed devices, perturbs this representation based on user preferences. This method aims to reduce correlation with private information while retaining utility.
Personal devices such as mobile phones can produce and store large amounts of data that can enhance machine learning models; however, this data may contain private information specific to the data owner that prevents the release of the data. We want to reduce the correlation between user-specific private information and the data while retaining the useful information. Rather than training a large model to achieve privatization from end to end, we first decouple the creation of a latent representation, and then privatize the data that allows user-specific privatization to occur in a setting with limited computation and minimal disturbance on the utility of the data. We leverage a Variational Autoencoder (VAE) to create a compact latent representation of the data that remains fixed for all devices and all possible private labels. We then train a small generative filter to perturb the latent representation based on user specified preferences regarding the private and utility information. The small filter is trained via a GAN-type robust optimization that can take place on a distributed device such as a phone or tablet. Under special conditions of our linear filter, we disclose the connections between our generative approach and renyi differential privacy. We conduct experiments on multiple datasets including MNIST, UCI-Adult, and CelebA, and give a thorough evaluation including visualizing the geometry of the latent embeddings and estimating the empirical mutual information to show the effectiveness of our approach.