Marco Gaboardi

h-index13

18papers

1,223citations

Novelty51%

AI Score32

Ranked #126,048 of 194,257 authors (top 65%)#27,744 in LG (top 69%)

18 Papers

26.1LGMar 22, 2023

Stability is Stable: Connections between Replicability, Privacy, and Adaptive Generalization

Mark Bun, Marco Gaboardi, Max Hopkins et al.

The notion of replicable algorithms was introduced in Impagliazzo et al. [STOC '22] to describe randomized algorithms that are stable under the resampling of their inputs. More precisely, a replicable algorithm gives the same output with high probability when its randomness is fixed and it is run on a new i.i.d. sample drawn from the same distribution. Using replicable algorithms for data analysis can facilitate the verification of published results by ensuring that the results of an analysis will be the same with high probability, even when that analysis is performed on a new data set. In this work, we establish new connections and separations between replicability and standard notions of algorithmic stability. In particular, we give sample-efficient algorithmic reductions between perfect generalization, approximate differential privacy, and replicability for a broad class of statistical problems. Conversely, we show any such equivalence must break down computationally: there exist statistical problems that are easy under differential privacy, but that cannot be solved replicably without breaking public-key cryptography. Furthermore, these results are tight: our reductions are statistically optimal, and we show that any computational separation between DP and replicability must imply the existence of one-way functions. Our statistical reductions give a new algorithmic framework for translating between notions of stability, which we instantiate to answer several open questions in replicability and privacy. This includes giving sample-efficient replicable algorithms for various PAC learning, distribution estimation, and distribution testing problems, algorithmic amplification of $δ$ in approximate DP, conversions from item-level to user-level privacy, and the existence of private agnostic-to-realizable learning reductions under structured distributions.

2.3LONov 24, 2024

Kleene algebra with commutativity conditions is undecidable

Arthur Azevedo de Amorim, Cheng Zhang, Marco Gaboardi

We prove that the equational theory of Kleene algebra with commutativity conditions on primitives (or atomic terms) is undecidable, thereby settling a longstanding open question in the theory of Kleene algebra. While this question has also been recently solved independently by Kuznetsov, our results hold even for weaker theories that do not support the induction axioms of Kleene algebra.

2.3PLAug 17, 2021

On Incorrectness Logic and Kleene Algebra with Top and Tests

Cheng Zhang, Arthur Azevedo de Amorim, Marco Gaboardi

Kleene algebra with tests (KAT) is a foundational equational framework for reasoning about programs, which has found applications in program transformations, networking and compiler optimizations, among many other areas. In his seminal work, Kozen proved that KAT subsumes propositional Hoare logic, showing that one can reason about the (partial) correctness of while programs by means of the equational theory of KAT. In this work, we investigate the support that KAT provides for reasoning about incorrectness, instead, as embodied by Ohearn's recently proposed incorrectness logic. We show that KAT cannot directly express incorrectness logic. The main reason for this limitation can be traced to the fact that KAT cannot express explicitly the notion of codomain, which is essential to express incorrectness triples. To address this issue, we study Kleene Algebra with Top and Tests (TopKAT), an extension of KAT with a top element. We show that TopKAT is powerful enough to express a codomain operation, to express incorrectness triples, and to prove all the rules of incorrectness logic sound. This shows that one can reason about the incorrectness of while-like programs by means of the equational theory of TopKAT.

5.5LGJul 22, 2021

Multiclass versus Binary Differentially Private PAC Learning

Mark Bun, Marco Gaboardi, Satchit Sivakumar

We show a generic reduction from multiclass differentially private PAC learning to binary private PAC learning. We apply this transformation to a recently proposed binary private PAC learner to obtain a private multiclass learner with sample complexity that has a polynomial dependence on the multiclass Littlestone dimension and a poly-logarithmic dependence on the number of classes. This yields an exponential improvement in the dependence on both parameters over learners from previous work. Our proof extends the notion of $Ψ$-dimension defined in work of Ben-David et al. [JCSS '95] to the online setting and explores its general properties.

21.0LGJun 24, 2021

Covariance-Aware Private Mean Estimation Without Private Covariance Estimation

Gavin Brown, Marco Gaboardi, Adam Smith et al.

We present two sample-efficient differentially private mean estimators for $d$-dimensional (sub)Gaussian distributions with unknown covariance. Informally, given $n \gtrsim d/α^2$ samples from such a distribution with mean $μ$ and covariance $Σ$, our estimators output $\tildeμ$ such that $\| \tildeμ- μ\|_Σ \leq α$, where $\| \cdot \|_Σ$ is the Mahalanobis distance. All previous estimators with the same guarantee either require strong a priori bounds on the covariance matrix or require $Ω(d^{3/2})$ samples. Each of our estimators is based on a simple, general approach to designing differentially private mechanisms, but with novel technical steps to make the estimator private and sample-efficient. Our first estimator samples a point with approximately maximum Tukey depth using the exponential mechanism, but restricted to the set of points of large Tukey depth. Its accuracy guarantees hold even for data sets that have a small amount of adversarial corruption. Proving that this mechanism is private requires a novel analysis. Our second estimator perturbs the empirical mean of the data set with noise calibrated to the empirical covariance, without releasing the covariance itself. Its sample complexity guarantees hold more generally for subgaussian distributions, albeit with a slightly worse dependence on the privacy parameter. For both estimators, careful preprocessing of the data is required to satisfy differential privacy.

10.1LGNov 11, 2020

Empirical Risk Minimization in the Non-interactive Local Model of Differential Privacy

Di Wang, Marco Gaboardi, Adam Smith et al.

In this paper, we study the Empirical Risk Minimization (ERM) problem in the non-interactive Local Differential Privacy (LDP) model. Previous research on this problem \citep{smith2017interaction} indicates that the sample complexity, to achieve error $α$, needs to be exponentially depending on the dimensionality $p$ for general loss functions. In this paper, we make two attempts to resolve this issue by investigating conditions on the loss functions that allow us to remove such a limit. In our first attempt, we show that if the loss function is $(\infty, T)$-smooth, by using the Bernstein polynomial approximation we can avoid the exponential dependency in the term of $α$. We then propose player-efficient algorithms with $1$-bit communication complexity and $O(1)$ computation cost for each player. The error bound of these algorithms is asymptotically the same as the original one. With some additional assumptions, we also give an algorithm which is more efficient for the server. In our second attempt, we show that for any $1$-Lipschitz generalized linear convex loss function, there is an $(ε, δ)$-LDP algorithm whose sample complexity for achieving error $α$ is only linear in the dimensionality $p$. Our results use a polynomial of inner product approximation technique. Finally, motivated by the idea of using polynomial approximation and based on different types of polynomial approximations, we propose (efficient) non-interactive locally differentially private algorithms for learning the set of k-way marginal queries and the set of smooth queries.

5.1MEJul 24, 2020

Controlling Privacy Loss in Sampling Schemes: an Analysis of Stratified and Cluster Sampling

Mark Bun, Jörg Drechsler, Marco Gaboardi et al.

Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designs.

10.3CCNov 8, 2019

The Complexity of Verifying Loop-Free Programs as Differentially Private

Marco Gaboardi, Kobbi Nissim, David Purser

We study the problem of verifying differential privacy for loop-free programs with probabilistic choice. Programs in this class can be seen as randomized Boolean circuits, which we will use as a formal model to answer two different questions: first, deciding whether a program satisfies a prescribed level of privacy; second, approximating the privacy parameters a program realizes. We show that the problem of deciding whether a program satisfies $\varepsilon$-differential privacy is $coNP^{\#P}$-complete. In fact, this is the case when either the input domain or the output range of the program is large. Further, we show that deciding whether a program is $(\varepsilon,δ)$-differentially private is $coNP^{\#P}$-hard, and in $coNP^{\#P}$ for small output domains, but always in $coNP^{\#P^{\#P}}$. Finally, we show that the problem of approximating the level of differential privacy is both $NP$-hard and $coNP$-hard. These results complement previous results by Murtagh and Vadhan showing that deciding the optimal composition of differentially private components is $\#P$-complete, and that approximating the optimal composition of differentially private components is in $P$.

4.3DSOct 26, 2019

Facility Location Problem in Differential Privacy Model Revisited

Yunus Esencayi, Marco Gaboardi, Shi Li et al.

In this paper we study the uncapacitated facility location problem in the model of differential privacy (DP) with uniform facility cost. Specifically, we first show that, under the hierarchically well-separated tree (HST) metrics and the super-set output setting that was introduced in Gupta et. al., there is an $ε$-DP algorithm that achieves an $O(\frac{1}ε)$(expected multiplicative) approximation ratio; this implies an $O(\frac{\log n}ε)$ approximation ratio for the general metric case, where $n$ is the size of the input metric. These bounds improve the best-known results given by Gupta et. al. In particular, our approximation ratio for HST-metrics is independent of $n$, and the ratio for general metrics is independent of the aspect ratio of the input metric. On the negative side, we show that the approximation ratio of any $ε$-DP algorithm is lower bounded by $Ω(\frac{1}{\sqrtε})$, even for instances on HST metrics with uniform facility cost, under the super-set output setting. The lower bound shows that the dependence of the approximation ratio for HST metrics on $ε$ can not be removed or greatly improved. Our novel methods and techniques for both the upper and lower bound may find additional applications.

14.3LGOct 1, 2019

Estimating Smooth GLM in Non-interactive Local Differential Privacy Model with Public Unlabeled Data

Di Wang, Lijie Hu, Huanyu Zhang et al.

In this paper, we study the problem of estimating smooth Generalized Linear Models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Different from its classical setting, our model allows the server to access some additional public but unlabeled data. In the first part of the paper we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. sampled from a zero-mean multivariate Gaussian distribution. Motivated by the Stein's lemma, we present an $(ε, δ)$-NLDP algorithm for GLMs. Moreover, the sample complexity of public and private data for the algorithm to achieve an $\ell_2$-norm estimation error of $α$ (with high probability) is ${O}(p α^{-2})$ and $\tilde{O}(p^3α^{-2}ε^{-2})$ respectively, where $p$ is the dimension of the feature vector. This is a significant improvement over the previously known exponential or quasi-polynomial in $α^{-1}$, or exponential in $p$ sample complexities of GLMs with no public data. Then we consider a more general setting where each data record is i.i.d. sampled from some sub-Gaussian distribution with bounded $\ell_1$-norm. Based on a variant of Stein's lemma, we propose an $(ε, δ)$-NLDP algorithm for GLMs whose sample complexity of public and private data to achieve an $\ell_\infty$-norm estimation error of $α$ is ${O}(p^2α^{-2})$ and $\tilde{O}(p^2α^{-2}ε^{-2})$ respectively, under some mild assumptions and if $α$ is not too small ({\em i.e.,} $α\geq Ω(\frac{1}{\sqrt{p}})$). In the second part of the paper, we extend our idea to the problem of estimating non-linear regressions and show similar results as in GLMs for both multivariate Gaussian and sub-Gaussian cases. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

17.0CRSep 10, 2019

A Programming Framework for Differential Privacy with Accuracy Concentration Bounds

Elisabet Lobo-Vesga, Alejandro Russo, Marco Gaboardi

Differential privacy offers a formal framework for reasoning about privacy and accuracy of computations on private data. It also offers a rich set of building blocks for constructing data analyses. When carefully calibrated, these analyses simultaneously guarantee privacy of the individuals contributing their data, and accuracy of their results for inferring useful properties about the population. The compositional nature of differential privacy has motivated the design and implementation of several programming languages aimed at helping a data analyst in programming differentially private analyses. However, most of the programming languages for differential privacy proposed so far provide support for reasoning about privacy but not for reasoning about the accuracy of data analyses. To overcome this limitation, in this work we present DPella, a programming framework providing data analysts with support for reasoning about privacy, accuracy and their trade-offs. The distinguishing feature of DPella is a novel component which statically tracks the accuracy of different data analyses. In order to make tighter accuracy estimations, this component leverages taint analysis for automatically inferring statistical independence of the different noise quantities added for guaranteeing privacy. We show the flexibility of our approach by not only implementing classical counting queries (e.g., CDFs) but also by analyzing hierarchical counting queries (like those done by Census Bureaus), where accuracy have different constraints per level and data analysts should figure out the best manner to calibrate privacy to meet the accuracy requirements.

12.8LGMay 29, 2019

Privacy Amplification by Mixing and Diffusion Mechanisms

Borja Balle, Gilles Barthe, Marco Gaboardi et al.

A fundamental result in differential privacy states that the privacy guarantees of a mechanism are preserved by any post-processing of its output. In this paper we investigate under what conditions stochastic post-processing can amplify the privacy of a mechanism. By interpreting post-processing as the application of a Markov operator, we first give a series of amplification results in terms of uniform mixing properties of the Markov process defined by said operator. Next we provide amplification bounds in terms of coupling arguments which can be applied in cases where uniform mixing is not available. Finally, we introduce a new family of mechanisms based on diffusion processes which are closed under post-processing, and analyze their privacy via a novel heat flow argument. On the applied side, we generalize the analysis of "privacy amplification by iteration" in Noisy SGD and show it admits an exponential improvement in the strongly convex case, and study a mechanism based on the Ornstein-Uhlenbeck diffusion process which contains the Gaussian mechanism with optimal post-processing on bounded inputs as a special case.

21.7LGMay 24, 2019

Hypothesis Testing Interpretations and Renyi Differential Privacy

Borja Balle, Gilles Barthe, Marco Gaboardi et al.

Differential privacy is a de facto standard in data privacy, with applications in the public and private sectors. A way to explain differential privacy, which is particularly appealing to statistician and social scientists is by means of its statistical hypothesis testing interpretation. Informally, one cannot effectively test whether a specific individual has contributed her data by observing the output of a private mechanism---any test cannot have both high significance and high power. In this paper, we identify some conditions under which a privacy definition given in terms of a statistical divergence satisfies a similar interpretation. These conditions are useful to analyze the distinguishability power of divergences and we use them to study the hypothesis testing interpretation of some relaxations of differential privacy based on Renyi divergence. This analysis also results in an improved conversion rule between these definitions and differential privacy.

33.4LGJul 4, 2018

Privacy Amplification by Subsampling: Tight Analyses via Couplings and Divergences

Borja Balle, Gilles Barthe, Marco Gaboardi

Differential privacy comes equipped with multiple analytical tools for the design of private data analyses. One important tool is the so-called "privacy amplification by subsampling" principle, which ensures that a differentially private mechanism run on a random subsample of a population provides higher privacy guarantees than when run on the entire population. Several instances of this principle have been studied for different random subsampling methods, each with an ad-hoc analysis. In this paper we present a general method that recovers and improves prior analyses, yields lower bounds and derives new instances of privacy amplification by subsampling. Our method leverages a characterization of differential privacy as a divergence which emerged in the program verification community. Furthermore, it introduces new tools, including advanced joint convexity and privacy profiles, which might be of independent interest.

9.8LGFeb 12, 2018

Empirical Risk Minimization in Non-interactive Local Differential Privacy: Efficiency and High Dimensional Case

Di Wang, Marco Gaboardi, Jinhui Xu

In this paper, we study the Empirical Risk Minimization problem in the non-interactive local model of differential privacy. In the case of constant or low dimensionality ($p\ll n$), we first show that if the ERM loss function is $(\infty, T)$-smooth, then we can avoid a dependence of the sample complexity, to achieve error $α$, on the exponential of the dimensionality $p$ with base $1/α$ (i.e., $α^{-p}$), which answers a question in [smith 2017 interaction]. Our approach is based on polynomial approximation. Then, we propose player-efficient algorithms with $1$-bit communication complexity and $O(1)$ computation cost for each player. The error bound is asymptotically the same as the original one. Also with additional assumptions we show a server efficient algorithm. Next we consider the high dimensional case ($n\ll p$), we show that if the loss function is Generalized Linear function and convex, then we could get an error bound which is dependent on the Gaussian width of the underlying constrained set instead of $p$, which is lower than that in [smith 2017 interaction].

21.8STSep 21, 2017

Local Private Hypothesis Testing: Chi-Square Tests

Marco Gaboardi, Ryan Rogers

The local model for differential privacy is emerging as the reference model for practical applications collecting and sharing sensitive information while satisfying strong privacy guarantees. In the local model, there is no trusted entity which is allowed to have each individual's raw data as is assumed in the traditional curator model for differential privacy. So, individuals' data are usually perturbed before sharing them. We explore the design of private hypothesis tests in the local model, where each data entry is perturbed to ensure the privacy of each participant. Specifically, we analyze locally private chi-square tests for goodness of fit and independence testing, which have been studied in the traditional, curator model for differential privacy.

28.1CRSep 14, 2016

PSI (Ψ): a Private data Sharing Interface

Marco Gaboardi, James Honaker, Gary King et al.

We provide an overview of PSI ("a Private data Sharing Interface"), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.

20.0STFeb 7, 2016

Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing

Marco Gaboardi, Hyun woo Lim, Ryan Rogers et al.

Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables. We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance. We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.