Bayesian Pseudo Posterior Mechanism for Differentially Private Machine Learning
This addresses privacy protection in machine learning for applications like official statistics, offering improved performance on imbalanced data, though it is incremental as it builds on existing DP methods.
The paper tackles the challenge of differential privacy (DP) mechanisms struggling with real-world distributions like imbalanced datasets by proposing SWAG-PPM, a scalable DP mechanism for deep learning that uses a pseudo posterior distribution to downweight records by disclosure risk, resulting in modest utility degradation and outperforming DP-SGD on a workplace injury text classification task.
Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work, we propose a new scalable DP mechanism for deep learning models, SWAG-PPM, by using a pseudo posterior distribution that downweights by-record likelihood contributions proportionally to their disclosure risks as the randomized mechanism. As a motivating example from official statistics, we demonstrate SWAG-PPM on a workplace injury text classification task using a highly imbalanced public dataset published by the U.S. Occupational Safety and Health Administration (OSHA). We find that SWAG-PPM exhibits only modest utility degradation against a non-private comparator while greatly outperforming the industry standard DP-SGD for a similar privacy budget.