Joshua E. Allen

h-index41

6papers

179citations

Novelty51%

AI Score29

Ranked #144,662 of 194,257 authors (top 74%)#3,778 in CR (top 56%)

6 Papers

4.6LGApr 27, 2022

Spending Privacy Budget Fairly and Wisely

Lucas Rosenblatt, Joshua Allen, Julia Stoyanovich

Differentially private (DP) synthetic data generation is a practical method for improving access to data as a means to encourage productive partnerships. One issue inherent to DP is that the "privacy budget" is generally "spent" evenly across features in the data set. This leads to good statistical parity with the real data, but can undervalue the conditional probabilities and marginals that are critical for predictive quality of synthetic data. Further, loss of predictive quality may be non-uniform across the data set, with subsets that correspond to minority groups potentially suffering a higher loss. In this paper, we develop ensemble methods that distribute the privacy budget "wisely" to maximize predictive accuracy of models trained on DP data, and "fairly" to bound potential disparities in accuracy across groups and reduce inequality. Our methods are based on the insights that feature importance can inform how privacy budget is allocated, and, further, that per-group feature importance and fairness-related performance objectives can be incorporated in the allocation. These insights make our methods tunable to social contexts, allowing data owners to produce balanced synthetic data for predictive analysis.

8.7LGMay 9, 2022

Evaluating the Fairness Impact of Differentially Private Synthetic Data

Blake Bullwinkel, Kristen Grabarz, Lily Ke et al.

Differentially private (DP) synthetic data is a promising approach to maximizing the utility of data containing sensitive information. Due to the suppression of underrepresented classes that is often required to achieve privacy, however, it may be in conflict with fairness. We evaluate four DP synthesizers and present empirical results indicating that three of these models frequently degrade fairness outcomes on downstream binary classification tasks. We draw a connection between fairness and the proportion of minority groups present in the generated synthetic data, and find that training synthesizers on data that are pre-processed via a multi-label undersampling method can promote more fair outcomes without degrading accuracy.

8.8CRMar 24, 2021Code

U.S. Broadband Coverage Data Set: A Differentially Private Data Release

Mayana Pereira, Allen Kim, Joshua Allen et al.

Broadband connectivity is a key metric in today's economy. In an era of rapid expansion of the digital economy, it directly impacts GDP. Furthermore, with the COVID-19 guidelines of social distancing, internet connectivity became necessary to everyday activities such as work, learning, and staying in touch with family and friends. This paper introduces a publicly available U.S. Broadband Coverage data set that reports broadband coverage percentages at a zip code-level. We also explain how we used differential privacy to guarantee that the privacy of individual households is preserved. Our data set also contains error ranges estimates, providing information on the expected error introduced by differential privacy per zip code. We describe our error range calculation method and show that this additional data metric does not induce any privacy losses.

17.9LGNov 11, 2020Code

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar et al.

Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practioners alike. In addition, we propose QUAIL, an ensemble-based modeling approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint.

20.8CRJul 2, 2018

An Algorithmic Framework For Differentially Private Data Analysis on Trusted Processors

Joshua Allen, Bolin Ding, Janardhan Kulkarni et al.

Differential privacy has emerged as the main definition for private data analysis and machine learning. The {\em global} model of differential privacy, which assumes that users trust the data collector, provides strong privacy guarantees and introduces small errors in the output. In contrast, applications of differential privacy in commercial systems by Apple, Google, and Microsoft, use the {\em local model}. Here, users do not trust the data collector, and hence randomize their data before sending it to the data collector. Unfortunately, local model is too strong for several important applications and hence is limited in its applicability. In this work, we propose a framework based on trusted processors and a new definition of differential privacy called {\em Oblivious Differential Privacy}, which combines the best of both local and global models. The algorithms we design in this framework show interesting interplay of ideas from the streaming algorithms, oblivious algorithms, and differential privacy.

17.6CRMar 24, 2018

Comparing Population Means under Local Differential Privacy: with Significance and Power

Bolin Ding, Harsha Nori, Paul Li et al.

A statistical hypothesis test determines whether a hypothesis should be rejected based on samples from populations. In particular, randomized controlled experiments (or A/B testing) that compare population means using, e.g., t-tests, have been widely deployed in technology companies to aid in making data-driven decisions. Samples used in these tests are collected from users and may contain sensitive information. Both the data collection and the testing process may compromise individuals' privacy. In this paper, we study how to conduct hypothesis tests to compare population means while preserving privacy. We use the notation of local differential privacy (LDP), which has recently emerged as the main tool to ensure each individual's privacy without the need of a trusted data collector. We propose LDP tests that inject noise into every user's data in the samples before collecting them (so users do not need to trust the data collector), and draw conclusions with bounded type-I (significance level) and type-II errors (1 - power). Our approaches can be extended to the scenario where some users require LDP while some are willing to provide exact data. We report experimental results on real-world datasets to verify the effectiveness of our approaches.