73.1DSApr 12
Tradeoffs in Privacy, Welfare, and Fairness for Facility LocationSara Fish, Yannai A. Gonczarowski, Jason Z. Tang et al.
The differentially private (DP) facility location problem seeks to determine a socially optimal placement for a public facility while ensuring that each participating agent's location remains private. To privatize its input data, a DP mechanism must inject noise into its output distribution, producing a placement that will have lower expected social welfare than the optimal spot for the facility. The privacy-induced welfare loss can be viewed as the "cost of privacy," illustrating a tradeoff between social welfare and privacy that has been the focus of prior work. Yet, the imposition of privacy also induces a third consideration that has not been similarly studied: fairness in how the "cost of privacy" is distributed across individuals. For instance, a mechanism may satisfy DP with minimal social welfare loss, yet still be undesirable if that loss falls entirely on one individual. In this paper, we quantify this new notion of unfairness and design mechanisms for facility location that attempt to simultaneously optimize across privacy, social welfare, and fairness. We first derive an impossibility result, showing that privacy and fairness cannot be simultaneously guaranteed over all possible datasets that could represent the locations of individuals in a population. We then consider a relaxation that still requires worst-case DP, but only seeks fairness and social welfare over smaller, more "realistic-looking" families of datasets. For this relaxation, we construct a DP mechanism and demonstrate that it is simultaneously optimal (or, for a harder family of datasets, near-optimal up to small factors) on fairness and social welfare. This suggests that while there is a tradeoff between privacy and each of social welfare and fairness, there is no additional tradeoff when we consider all three objectives simultaneously, provided that the population data is sufficiently natural.
95.0DSMar 16
Concurrent Composition for Differentially Private Continual MechanismsMonika Henzinger, Roodabeh Safavi, Salil Vadhan
Many intended uses of differential privacy involve a $\textit{continual mechanism}$ that is set up to run continuously over a long period of time, making more statistical releases as either queries come in or the dataset is updated. In this paper, we give the first general treatment of privacy against $\textit{adaptive}$ adversaries for mechanisms that support dataset updates and a variety of queries, all arbitrarily interleaved. It also models a very general notion of neighboring, that includes both event-level and user-level privacy. We prove several $\textit{concurrent}$ composition theorems for continual mechanisms, which ensure privacy even when an adversary can interleave queries and dataset updates to the different composed mechanisms. Previous concurrent composition theorems for differential privacy were only for the case when the dataset is static, with no adaptive updates. Moreover, we also give the first interactive and continual generalizations of the "parallel composition theorem" for noninteractive differential privacy. Specifically, we show that the analogue of the noninteractive parallel composition theorem holds if either there are no adaptive dataset updates or each of the composed mechanisms satisfies pure differential privacy, but it fails to hold for composing approximately differentially private mechanisms with dataset updates. We then formalize a set of general conditions on a continual mechanism $M$ that runs multiple continual sub-mechanisms such that the privacy guarantees of $M$ follow directly using the above concurrent composition theorems on the sub-mechanisms, without further privacy loss. This enables us to give a simpler and more modular privacy analysis of a recent continual histogram mechanism of Henzinger, Sricharan, and Steiner. In the case of approximate DP, ours is the first proof showing that its privacy holds against adaptive adversaries.
15.1DSMar 26
Bounded Independence Edge Sampling for Combinatorial Graph PropertiesAaron Putterman, Salil Vadhan, Vadim Zaripov
Random subsampling of edges is a commonly employed technique in graph algorithms, underlying a vast array of modern algorithmic breakthroughs. Unfortunately, using this technique often leads to randomized algorithms with no clear path to derandomization because the analyses rely on a union bound on exponentially many events. In this work, we revisit this goal of derandomizing randomized sampling in graphs. We give several results related to bounded-independence edge subsampling, and in the process of doing so, generalize several of the results of Alon and Nussboim (FOCS 2008), who studied bounded-independence analogues of random graphs (which can be viewed as edge subsamples of the complete graph). Most notably, we show: 1. $O(\log(m))$-wise independence suffices for preserving connectivity when sampling at rate $1/2$ in a graph with minimum cut $\geq κ\log(m)$ with probability $1 - \frac{1}{\mathrm{poly}(m)}$ (for a sufficiently large constant $κ$). 2. $O(\log(m))$-wise $\frac{1}{\mathrm{poly}(m)}$-almost independence suffices for ensuring cycle-freeness when sampling at rate $1/2$ in a graph with minimum cycle length $\geq κ\log(m)$ with probability $1 - \frac{1}{\mathrm{poly}(m)}$ (for a sufficiently large constant $κ$). To demonstrate the utility of our results, we revisit the classic problem of using parallel algorithms to find graphic matroid bases, first studied in the work of Karp, Upfal, and Wigderson (FOCS 1985). In this regime, we show that the optimal algorithms of Khanna, Putterman, and Song (arxiv 2025) can be explicitly derandomized while maintaining near-optimality.
LGOct 18, 2023
Black-Box Training Data Identification in GANs via Detector NetworksLukman Olagoke, Salil Vadhan, Seth Neel
Since their inception Generative Adversarial Networks (GANs) have been popular generative models across images, audio, video, and tabular data. In this paper we study whether given access to a trained GAN, as well as fresh samples from the underlying distribution, if it is possible for an attacker to efficiently identify if a given point is a member of the GAN's training data. This is of interest for both reasons related to copyright, where a user may want to determine if their copyrighted data has been used to train a GAN, and in the study of data privacy, where the ability to detect training set membership is known as a membership inference attack. Unlike the majority of prior work this paper investigates the privacy implications of using GANs in black-box settings, where the attack only has access to samples from the generator, rather than access to the discriminator as well. We introduce a suite of membership inference attacks against GANs in the black-box setting and evaluate our attacks on image GANs trained on the CIFAR10 dataset and tabular GANs trained on genomic data. Our most successful attack, called The Detector, involve training a second network to score samples based on their likelihood of being generated by the GAN, as opposed to a fresh sample from the distribution. We prove under a simple model of the generator that the detector is an approximately optimal membership inference attack. Across a wide range of tabular and image datasets, attacks, and GAN architectures, we find that adversaries can orchestrate non-trivial privacy attacks when provided with access to samples from the generator. At the same time, the attack success achievable against GANs still appears to be lower compared to other generative and discriminative models; this leaves the intriguing open question of whether GANs are in fact more private, or if it is a matter of developing stronger attacks.
CCJul 8, 2025
Generalized and Unified Equivalences between Hardness and PseudoentropyLunjia Hu, Salil Vadhan
Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.
CRMar 7, 2024
Membership Inference Attacks and Privacy in Topic ModelingNico Manzonelli, Wanrong Zhang, Salil Vadhan
Recent research shows that large language models are susceptible to privacy attacks that infer aspects of the training data. However, it is unclear if simpler generative models, like topic models, share similar vulnerabilities. In this work, we propose an attack against topic models that can confidently identify members of the training data in Latent Dirichlet Allocation. Our results suggest that the privacy risks associated with generative modeling are not restricted to large neural models. Additionally, to mitigate these vulnerabilities, we explore differentially private (DP) topic modeling. We propose a framework for private topic modeling that incorporates DP vocabulary selection as a pre-processing step, and show that it improves privacy while having limited effects on practical utility.
CRAug 9, 2021
Canonical Noise Distributions and Private Hypothesis TestsJordan Awan, Salil Vadhan
$f$-DP has recently been proposed as a generalization of differential privacy allowing a lossless analysis of composition, post-processing, and privacy amplification via subsampling. In the setting of $f$-DP, we propose the concept of a canonical noise distribution (CND), the first mechanism designed for an arbitrary $f$-DP guarantee. The notion of CND captures whether an additive privacy mechanism perfectly matches the privacy guarantee of a given $f$. We prove that a CND always exists, and give a construction that produces a CND for any $f$. We show that private hypothesis tests are intimately related to CNDs, allowing for the release of private $p$-values at no additional privacy cost as well as the construction of uniformly most powerful (UMP) tests for binary data, within the general $f$-DP framework. We apply our techniques to the problem of difference of proportions testing, and construct a UMP unbiased (UMPU) "semi-private" test which upper bounds the performance of any $f$-DP test. Using this as a benchmark we propose a private test, based on the inversion of characteristic functions, which allows for optimal inference for the two population parameters and is nearly as powerful as the semi-private UMPU. When specialized to the case of $(ε,0)$-DP, we show empirically that our proposed test is more powerful than any $(ε/\sqrt 2)$-DP test and has more accurate type I errors than the classic normal approximation test.
CRMay 30, 2021
Concurrent Composition of Differential PrivacySalil Vadhan, Tianhao Wang
We initiate a study of the composition properties of interactive differentially private mechanisms. An interactive differentially private mechanism is an algorithm that allows an analyst to adaptively ask queries about a sensitive dataset, with the property that an adversarial analyst's view of the interaction is approximately the same regardless of whether or not any individual's data is in the dataset. Previous studies of composition of differential privacy have focused on non-interactive algorithms, but interactive mechanisms are needed to capture many of the intended applications of differential privacy and a number of the important differentially private primitives. We focus on concurrent composition, where an adversary can arbitrarily interleave its queries to several differentially private mechanisms, which may be feasible when differentially private query systems are deployed in practice. We prove that when the interactive mechanisms being composed are pure differentially private, their concurrent composition achieves privacy parameters (with respect to pure or approximate differential privacy) that match the (optimal) composition theorem for noninteractive differential privacy. We also prove a composition theorem for interactive mechanisms that satisfy approximate differential privacy. That bound is weaker than even the basic (suboptimal) composition theorem for noninteractive differential privacy, and we leave closing the gap as a direction for future research, along with understanding concurrent composition for other variants of differential privacy.
CRMay 4, 2021
Inaccessible Entropy II: IE Functions and Universal One-Way HashingIftach Haitner, Thomas Holenstein, Omer Reingold et al.
This paper uses a variant of the notion of \emph{inaccessible entropy} (Haitner, Reingold, Vadhan and Wee, STOC 2009), to give an alternative construction and proof for the fundamental result, first proved by Rompel (STOC 1990), that \emph{Universal One-Way Hash Functions (UOWHFs)} can be based on any one-way functions. We observe that a small tweak of any one-way function $f$ is already a weak form of a UOWHF: consider the function $F(x,i)$ that returns the $i$-bit-long prefix of $f(x)$. If $F$ were a UOWHF then given a random $x$ and $i$ it would be hard to come up with $x'\neq x$ such that $F(x,i)=F(x',i)$. While this may not be the case, we show (rather easily) that it is hard to sample $x'$ with almost full entropy among all the possible such values of $x'$. The rest of our construction simply amplifies and exploits this basic property.Combined with other recent work, the construction of three fundamental cryptographic primitives (Pseudorandom Generators, Statistically Hiding Commitments and UOWHFs) out of one-way functions is now to a large extent unified. In particular, all three constructions rely on and manipulate computational notions of entropy in similar ways. Pseudorandom Generators rely on the well-established notion of pseudoentropy, whereas Statistically Hiding Commitments and UOWHFs rely on the newer notion of inaccessible entropy.
CROct 12, 2020
Inaccessible Entropy I: Inaccessible Entropy Generators and Statistically Hiding Commitments from One-Way FunctionsIftach Haitner, Omer Reingold, Salil Vadhan et al.
We put forth a new computational notion of entropy, measuring the (in)feasibility of sampling high-entropy strings that are consistent with a given generator. Specifically, the i'th output block of a generator G has accessible entropy at most k if the following holds: when conditioning on its prior coin tosses, no polynomial-time strategy $\widetilde{G}$ can generate valid output for G's i'th output block with entropy greater than k. A generator has inaccessible entropy if the total accessible entropy (summed over the blocks) is noticeably smaller than the real entropy of G's output. As an application of the above notion, we improve upon the result of Haitner, Nguyen, Ong, Reingold, and Vadhan [Sicomp '09], presenting a much simpler and more efficient construction of statistically hiding commitment schemes from arbitrary one-way functions.
LGJul 10, 2020
Differentially Private Simple Linear RegressionDaniel Alabi, Audra McMillan, Jayshree Sarathy et al.
Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
CRFeb 28, 2019
Unifying computational entropies via Kullback-Leibler divergenceRohit Agrawal, Yi-Hsiu Chen, Thibaut Horel et al.
We introduce hardness in relative entropy, a new notion of hardness for search problems which on the one hand is satisfied by all one-way functions and on the other hand implies both next-block pseudoentropy and inaccessible entropy, two forms of computational entropy used in recent constructions of pseudorandom generators and statistically hiding commitment schemes, respectively. Thus, hardness in relative entropy unifies the latter two notions of computational entropy and sheds light on the apparent "duality" between them. Additionally, it yields a more modular and illuminating proof that one-way functions imply next-block inaccessible entropy, similar in structure to the proof that one-way functions imply next-block pseudoentropy (Vadhan and Zheng, STOC '12).
HCSep 11, 2018
Usable Differential Privacy: A Case Study with PSIJack Murtagh, Kathryn Taylor, George Kellaris et al.
Differential privacy is a promising framework for addressing the privacy concerns in sharing sensitive datasets for others to analyze. However differential privacy is a highly technical area and current deployments often require experts to write code, tune parameters, and optimize the trade-off between the privacy and accuracy of statistical releases. For differential privacy to achieve its potential for wide impact, it is important to design usable systems that enable differential privacy to be used by ordinary data owners and analysts. PSI is a tool that was designed for this purpose, allowing researchers to release useful differentially private statistical information about their datasets without being experts in computer science, statistics, or privacy. We conducted a thorough usability study of PSI to test whether it accomplishes its goal of usability by non-experts. The usability test illuminated which features of PSI are most user-friendly and prompted us to improve aspects of the tool that caused confusion. The test also highlighted some general principles and lessons for designing usable systems for differential privacy, which we discuss in depth.
CRNov 10, 2017
Finite Sample Differentially Private Confidence IntervalsVishesh Karwa, Salil Vadhan
We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.
CRSep 14, 2016
PSI (Ψ): a Private data Sharing InterfaceMarco Gaboardi, James Honaker, Gary King et al.
We provide an overview of PSI ("a Private data Sharing Interface"), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.
CRMay 26, 2016
Privacy Odometers and Filters: Pay-as-you-Go CompositionRyan Rogers, Aaron Roth, Jonathan Ullman et al.
In this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn't even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his pre-specified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to pre-specify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing privacy composition theorems. We also give a privacy odometer that nearly matches non-adaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer use-cases.
DSApr 19, 2016
Locating a Small Cluster PrivatelyKobbi Nissim, Uri Stemmer, Salil Vadhan
We present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim, and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova, and Smith, 2007], which allows compiling of "off the shelf" (non-private) analyses into analyses that preserve differential privacy.
STFeb 7, 2016
Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence TestingMarco Gaboardi, Hyun woo Lim, Ryan Rogers et al.
Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables. We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance. We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.
CRApr 28, 2015
Differentially Private Release and Learning of Threshold FunctionsMark Bun, Kobbi Nissim, Uri Stemmer et al.
We prove new upper and lower bounds on the sample complexity of $(ε, δ)$ differentially private algorithms for releasing approximate answers to threshold functions. A threshold function $c_x$ over a totally ordered domain $X$ evaluates to $c_x(y) = 1$ if $y \le x$, and evaluates to $0$ otherwise. We give the first nontrivial lower bound for releasing thresholds with $(ε,δ)$ differential privacy, showing that the task is impossible over an infinite domain $X$, and moreover requires sample complexity $n \ge Ω(\log^*|X|)$, which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with $n \le 2^{(1+ o(1))\log^*|X|}$ samples. This improves the previous best upper bound of $8^{(1 + o(1))\log^*|X|}$ (Beimel et al., RANDOM '13). Our sample complexity upper and lower bounds also apply to the tasks of learning distributions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with $(ε,δ)$ differential privacy and learning without privacy. For properly learning thresholds in $\ell$ dimensions, this lower bound extends to $n \ge Ω(\ell \cdot \log^*|X|)$. To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database $D$ of elements from $X$, the interior point problem asks for an element between the smallest and largest elements in $D$. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.
CRNov 13, 2013
Fingerprinting Codes and the Price of Approximate Differential PrivacyMark Bun, Jonathan Ullman, Salil Vadhan
We show new lower bounds on the sample complexity of $(\varepsilon, δ)$-differentially private algorithms that accurately answer large sets of counting queries. A counting query on a database $D \in (\{0,1\}^d)^n$ has the form "What fraction of the individual records in the database satisfy the property $q$?" We show that in order to answer an arbitrary set $\mathcal{Q}$ of $\gg nd$ counting queries on $D$ to within error $\pm α$ it is necessary that $$ n \geq \tildeΩ\Bigg(\frac{\sqrt{d} \log |\mathcal{Q}|}{α^2 \varepsilon} \Bigg). $$ This bound is optimal up to poly-logarithmic factors, as demonstrated by the Private Multiplicative Weights algorithm (Hardt and Rothblum, FOCS'10). In particular, our lower bound is the first to show that the sample complexity required for accuracy and $(\varepsilon, δ)$-differential privacy is asymptotically larger than what is required merely for accuracy, which is $O(\log |\mathcal{Q}| / α^2)$. In addition, we show that our lower bound holds for the specific case of $k$-way marginal queries (where $|\mathcal{Q}| = 2^k \binom{d}{k}$) when $α$ is not too small compared to $d$ (e.g. when $α$ is any fixed constant). Our results rely on the existence of short \emph{fingerprinting codes} (Boneh and Shaw, CRYPTO'95, Tardos, STOC'03), which we show are closely connected to the sample complexity of differentially private data release. We also give a new method for combining certain types of sample complexity lower bounds into stronger lower bounds.