Ashwin Machanavajjhala

h-index40

18papers

778citations

Novelty47%

AI Score32

Ranked #123,889 of 194,257 authors (top 64%)#254 in DB (top 57%)

18 Papers

2.3CRMay 2, 2025

Slowly Scaling Per-Record Differential Privacy

Brian Finley, Anthony M Caruso, Justin C Doty et al.

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.

36.2CRDec 16, 2021

Benchmarking Differentially Private Synthetic Data Generation Algorithms

Yuchao Tao, Ryan McKenna, Michael Hay et al.

This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.

4.3CYNov 8, 2021

Equity and Privacy: More Than Just a Tradeoff

David Pujol, Ashwin Machanavajjhala

While the entire field of privacy preserving data analytics is focused on the privacy-utility tradeoff, recent work has shown that privacy preserving data publishing can introduce different levels of utility across different population groups. It is important to understand this new tradeoff between privacy and equity as privacy technology is being deployed in situations where the data products will be used for research and policy making. Will marginal populations see disproportionately less utility from privacy technology? If there is an inequity how can we address it?

13.6CRJul 22, 2021

Differentially Private Algorithms for 2020 Census Detailed DHC Race \& Ethnicity

Sam Haney, William Sexton, Ashwin Machanavajjhala et al.

This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribes at varying levels of geography. We describe two differentially private algorithmic strategies, one based on adding noise drawn from a two-sided Geometric distribution that satisfies "pure"-DP, and another based on adding noise from a Discrete Gaussian distribution that satisfied a well studied variant of differential privacy, called Zero Concentrated Differential Privacy (zCDP). We analytically estimate the privacy loss parameters ensured by the two algorithms for comparable levels of error introduced in the statistics.

10.8DBJun 23, 2021

HDMM: Optimizing error of high-dimensional statistical queries under differential privacy

Ryan McKenna, Gerome Miklau, Michael Hay et al.

In this work we describe the High-Dimensional Matrix Mechanism (HDMM), a differentially private algorithm for answering a workload of predicate counting queries. HDMM represents query workloads using a compact implicit matrix representation and exploits this representation to efficiently optimize over (a subset of) the space of differentially private algorithms for one that is unbiased and answers the input query workload with low expected error. HDMM can be deployed for both $ε$-differential privacy (with Laplace noise) and $(ε, δ)$-differential privacy (with Gaussian noise), although the core techniques are slightly different for each. We demonstrate empirically that HDMM can efficiently answer queries with lower expected error than state-of-the-art techniques, and in some cases, it nearly matches existing lower bounds for the particular class of mechanisms we consider.

1.2DSJun 9, 2021

Prior-Aware Distribution Estimation for Differential Privacy

Yuchao Tao, Johes Bater, Ashwin Machanavajjhala

Joint distribution estimation of a dataset under differential privacy is a fundamental problem for many privacy-focused applications, such as query answering, machine learning tasks and synthetic data generation. In this work, we examine the joint distribution estimation problem given two data points: 1) differentially private answers of a workload computed over private data and 2) a prior empirical distribution from a public dataset. Our goal is to find a new distribution such that estimating the workload using this distribution is as accurate as the differentially private answer, and the relative entropy, or KL divergence, of this distribution is minimized with respect to the prior distribution. We propose an approach based on iterative optimization for solving this problem. An application of our solution won second place in the NIST 2020 Differential Privacy Temporal Map Challenge, Sprint 2.

19.6CRApr 19, 2020

DP-Cryptography: Marrying Differential Privacy and Cryptography in Emerging Applications

Sameer Wagh, Xi He, Ashwin Machanavajjhala et al.

Differential privacy (DP) has arisen as the state-of-the-art metric for quantifying individual privacy when sensitive data are analyzed, and it is starting to see practical deployment in organizations such as the US Census Bureau, Apple, Google, etc. There are two popular models for deploying differential privacy - standard differential privacy (SDP), where a trusted server aggregates all the data and runs the DP mechanisms, and local differential privacy (LDP), where each user perturbs their own data and perturbed data is analyzed. Due to security concerns arising from aggregating raw data at a single server, several real world deployments in industry have embraced the LDP model. However, systems based on the LDP model tend to have poor utility - "a gap" in the utility achieved as compared to systems based on the SDP model. In this work, we survey and synthesize emerging directions of research at the intersection of differential privacy and cryptography. First, we survey solutions that combine cryptographic primitives like secure computation and anonymous communication with differential privacy to give alternatives to the LDP model that avoid a trusted server as in SDP but close the gap in accuracy. These primitives introduce performance bottlenecks and necessitate efficient alternatives. Second, we synthesize work in an area we call "DP-Cryptography" - cryptographic primitives that are allowed to leak differentially private outputs. These primitives have orders of magnitude better performance than standard cryptographic primitives. DP-cryptographic primitives are perfectly suited for implementing alternatives to LDP, but are also applicable to scenarios where standard cryptographic primitives do not have practical implementations. Through this unique lens of research taxonomy, we survey ongoing research in these directions while also providing novel directions for future research.

2.3DBAug 27, 2019

Answering Summation Queries for Numerical Attributes under Differential Privacy

Yikai Wu, David Pujol, Ios Kotsogiannis et al.

In this work we explore the problem of answering a set of sum queries under Differential Privacy. This is a little understood, non-trivial problem especially in the case of numerical domains. We show that traditional techniques from the literature are not always the best choice and a more rigorous approach is necessary to develop low error algorithms.

11.5LGJul 3, 2019

Capacity Bounded Differential Privacy

Kamalika Chaudhuri, Jacob Imola, Ashwin Machanavajjhala

Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privacy, capacity bounded differential privacy, where the adversary that distinguishes output distributions is assumed to be capacity-bounded -- i.e. bounded not in computational power, but in terms of the function class from which their attack algorithm is drawn. We model adversaries in terms of restricted f-divergences between probability distributions, and study properties of the definition and algorithms that satisfy them.

17.0CRFeb 20, 2019

Crypt$ε$: Crypto-Assisted Differential Privacy on Untrusted Servers

Amrita Roy Chowdhury, Chenghong Wang, Xi He et al.

Differential privacy (DP) has steadily become the de-facto standard for achieving privacy in data analysis, which is typically implemented either in the "central" or "local" model. The local model has been more popular for commercial deployments as it does not require a trusted data collector. This increased privacy, however, comes at a cost of utility and algorithmic expressibility as compared to the central model. In this work, we propose, Crypt$ε$, a system and programming framework that (1) achieves the accuracy guarantees and algorithmic expressibility of the central model (2) without any trusted data collector like in the local model. Crypt$ε$ achieves the "best of both worlds" by employing two non-colluding untrusted servers that run DP programs on encrypted data from the data owners. Although straightforward implementations of DP programs using secure computation tools can achieve the above goal theoretically, in practice they are beset with many challenges such as poor performance and tricky security proofs. To this end, Crypt$ε$ allows data analysts to author logical DP programs that are automatically translated to secure protocols that work on encrypted data. These protocols ensure that the untrusted servers learn nothing more than the noisy outputs, thereby guaranteeing DP (for computationally bounded adversaries) for all Crypt$ε$ programs. Crypt$ε$ supports a rich class of DP programs that can be expressed via a small set of transformation and measurement operators followed by arbitrary post-processing. Further, we propose performance optimizations leveraging the fact that the output is noisy. We demonstrate Crypt$ε$'s feasibility for practical DP analysis with extensive empirical evaluations on real datasets.

21.6DBAug 10, 2018

Optimizing error of high-dimensional statistical queries under differential privacy

Ryan McKenna, Gerome Miklau, Michael Hay et al.

Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especially when analyzing more than one or two dimensions of the data. In this work we propose HDMM, a new differentially private algorithm for answering a workload of predicate counting queries, that is especially effective for higher-dimensional datasets. HDMM represents query workloads using an implicit matrix representation and exploits this compact representation to efficiently search (a subset of) the space of differentially private algorithms for one that answers the input query workload with high accuracy. We empirically show that HDMM can efficiently answer queries with lower error than state-of-the-art techniques on a variety of low and high dimensional datasets.

21.3CRDec 16, 2017

One-sided Differential Privacy

Stelios Doudalis, Ios Kotsogiannis, Samuel Haney et al.

In this paper, we study the problem of privacy-preserving data sharing, wherein only a subset of the records in a database are sensitive, possibly based on predefined privacy policies. Existing solutions, viz, differential privacy (DP), are over-pessimistic and treat all information as sensitive. Alternatively, techniques, like access control and personalized differential privacy, reveal all non-sensitive records truthfully, and they indirectly leak information about sensitive records through exclusion attacks. Motivated by the limitations of prior work, we introduce the notion of one-sided differential privacy (OSDP). We formalize the exclusion attack and we show how OSDP protects against it. OSDP offers differential privacy like guarantees, but only to the sensitive records. OSDP allows the truthful release of a subset of the non-sensitive records. The sample can be used to support applications that must output true data, and is well suited for publishing complex types of data, e.g. trajectories. Though some non-sensitive records are suppressed to avoid exclusion attacks, our experiments show that the suppression results in a small loss in utility in most cases. Additionally, we present a recipe for turning DP mechanisms for answering counting queries into OSDP techniques for the same task. Our OSDP algorithms leverage the presence of non-sensitive records and are able to offer up to a 25x improvement in accuracy over state-of-the-art DP-solutions.

12.2DBFeb 2, 2017

Composing Differential Privacy and Secure Computation: A case study on scaling private record linkage

Xi He, Ashwin Machanavajjhala, Cheryl Flynn et al.

Private record linkage (PRL) is the problem of identifying pairs of records that are similar as per an input matching rule from databases held by two parties that do not trust one another. We identify three key desiderata that a PRL solution must ensure: 1) perfect precision and high recall of matching pairs, 2) a proof of end-to-end privacy, and 3) communication and computational costs that scale subquadratically in the number of input records. We show that all of the existing solutions for PRL - including secure 2-party computation (S2PC), and their variants that use non-private or differentially private (DP) blocking to ensure subquadratic cost - violate at least one of the three desiderata. In particular, S2PC techniques guarantee end-to-end privacy but have either low recall or quadratic cost. In contrast, no end-to-end privacy guarantee has been formalized for solutions that achieve subquadratic cost. This is true even for solutions that compose DP and S2PC: DP does not permit the release of any exact information about the databases, while S2PC algorithms for PRL allow the release of matching records. In light of this deficiency, we propose a novel privacy model, called output constrained differential privacy, that shares the strong privacy protection of DP, but allows for the truthful release of the output of a certain function applied to the data. We apply this to PRL, and show that protocols satisfying this privacy model permit the disclosure of the true matching records, but their execution is insensitive to the presence or absence of a single non-matching record. We find that prior work that combine DP and S2PC techniques even fail to satisfy this end-to-end privacy model. Hence, we develop novel protocols that provably achieve this end-to-end privacy guarantee, together with the other two desiderata of PRL.

2.3CYJan 3, 2017

Privacy-Preserving Data Analysis for the Federal Statistical Agencies

John Abowd, Lorenzo Alvisi, Cynthia Dwork et al.

Government statistical agencies collect enormously valuable data on the nation's population and business activities. Wide access to these data enables evidence-based policy making, supports new research that improves society, facilitates training for students in data science, and provides resources for the public to better understand and participate in their society. These data also affect the private sector. For example, the Employment Situation in the United States, published by the Bureau of Labor Statistics, moves markets. Nonetheless, government agencies are under increasing pressure to limit access to data because of a growing understanding of the threats to data privacy and confidentiality. "De-identification" - stripping obvious identifiers like names, addresses, and identification numbers - has been found inadequate in the face of modern computational and informational resources. Unfortunately, the problem extends even to the release of aggregate data statistics. This counter-intuitive phenomenon has come to be known as the Fundamental Law of Information Recovery. It says that overly accurate estimates of too many statistics can completely destroy privacy. One may think of this as death by a thousand cuts. Every statistic computed from a data set leaks a small amount of information about each member of the data set - a tiny cut. This is true even if the exact value of the statistic is distorted a bit in order to preserve privacy. But while each statistical release is an almost harmless little cut in terms of privacy risk for any individual, the cumulative effect can be to completely compromise the privacy of some individuals.

18.1DBDec 15, 2015

Principled Evaluation of Differentially Private Algorithms using DPBench

Michael Hay, Ashwin Machanavajjhala, Gerome Miklau et al.

Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is {\em data dependent}, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important. In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design.

10.3DBAug 28, 2015

On the Privacy Properties of Variants on the Sparse Vector Technique

Yan Chen, Ashwin Machanavajjhala

The sparse vector technique is a powerful differentially private primitive that allows an analyst to check whether queries in a stream are greater or lesser than a threshold. This technique has a unique property -- the algorithm works by adding noise with a finite variance to the queries and the threshold, and guarantees privacy that only degrades with (a) the maximum sensitivity of any one query in stream, and (b) the number of positive answers output by the algorithm. Recent work has developed variants of this algorithm, which we call {\em generalized private threshold testing}, and are claimed to have privacy guarantees that do not depend on the number of positive or negative answers output by the algorithm. These algorithms result in a significant improvement in utility over the sparse vector technique for a given privacy budget, and have found applications in frequent itemset mining, feature selection in machine learning and generating synthetic data. In this paper we critically analyze the privacy properties of generalized private threshold testing. We show that generalized private threshold testing does not satisfy ε-differential privacy for any finite ε. We identify a subtle error in the privacy analysis of this technique in prior work. Moreover, we show an adversary can use generalized private threshold testing to recover counts from the datasets (especially small counts) exactly with high accuracy, and thus can result in individuals being reidentified. We demonstrate our attacks empirically on real datasets.

14.0LGNov 20, 2014

Differentially Private Algorithms for Empirical Machine Learning

Ben Stoddard, Yan Chen, Ashwin Machanavajjhala

An important use of private data is to build machine learning classifiers. While there is a burgeoning literature on differentially private classification algorithms, we find that they are not practical in real applications due to two reasons. First, existing differentially private classifiers provide poor accuracy on real world datasets. Second, there is no known differentially private algorithm for empirically evaluating the private classifier on a private test dataset. In this paper, we develop differentially private algorithms that mirror real world empirical machine learning workflows. We consider the private classifier training algorithm as a blackbox. We present private algorithms for selecting features that are input to the classifier. Though adding a preprocessing step takes away some of the privacy budget from the actual classification process (thus potentially making it noisier and less accurate), we show that our novel preprocessing techniques significantly increase classifier accuracy on three real-world datasets. We also present the first private algorithms for empirically constructing receiver operating characteristic (ROC) curves on a private test set.

4.3DBApr 14, 2014

Design of Policy-Aware Differentially Private Algorithms

Samuel Haney, Ashwin Machanavajjhala, Bolin Ding

The problem of designing error optimal differentially private algorithms is well studied. Recent work applying differential privacy to real world settings have used variants of differential privacy that appropriately modify the notion of neighboring databases. The problem of designing error optimal algorithms for such variants of differential privacy is open. In this paper, we show a novel transformational equivalence result that can turn the problem of query answering under differential privacy with a modified notion of neighbors to one of query answering under standard differential privacy, for a large class of neighbor definitions. We utilize the Blowfish privacy framework that generalizes differential privacy. Blowfish uses a {\em policy graph} to instantiate different notions of neighboring databases. We show that the error incurred when answering a workload $\mathbf{W}$ on a database $\mathbf{x}$ under a Blowfish policy graph $G$ is identical to the error required to answer a transformed workload $f_G(\mathbf{W})$ on database $g_G(\mathbf{x})$ under standard differential privacy, where $f_G$ and $g_G$ are linear transformations based on $G$. Using this result, we develop error efficient algorithms for releasing histograms and multidimensional range queries under different Blowfish policies. We believe the tools we develop will be useful for finding mechanisms to answer many other classes of queries with low error under other policy graphs.