Jacob Imola

LG
8papers
270citations
Novelty60%
AI Score45

8 Papers

LGJan 31, 2023
Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees

Jacob Imola, Alessandro Epasto, Mohammad Mahdian et al.

Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $ε$-DP algorithm must exhibit $O(|V|^2/ ε)$-additive error for an input dataset $V$. Then, we exhibit a polynomial-time approximation algorithm with $O(|V|^{2.5}/ ε)$-additive error, and an exponential-time algorithm that meets the lower bound. To overcome the lower bound, we focus on the stochastic block model, a popular model of graphs, and, with a separation assumption on the blocks, propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly. Finally, we perform an empirical study of our algorithms and validate their performance.

DSJan 11, 2023
Private estimation algorithms for stochastic block models and mixture models

Hongjie Chen, Vincent Cohen-Addad, Tommaso d'Orsi et al.

We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(ε, δ)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(ε, δ)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.

CRMay 7
Privacy by Postprocessing the Discrete Laplace Mechanism

Quentin Hillebrand, Jacob Imola, Rasmus Pagh et al.

We show that an "old dog", the classical discrete Laplace (aka.~geometric) mechanism, can "perform new tricks": 1. It can be post-processed to yield a simple, unbiased estimator of any subexponential function $f$ of the original data, giving a simple, discrete, multivariate version of the recent unbiasing result for the Laplace mechanism by Calmon et al. (FORC '25). 2. It can be post-processed to output the same distribution as the Laplace mechanism or the Staircase mechanism with identical privacy parameters. Thus, the discrete Laplace mechanism is a versatile mechanism that should be preferred over the Laplace and Staircase mechanisms whenever the data is discrete (or can be made discrete while controlling $\ell_1$-sensitivity). We show bounds on the variance of our estimator, compared to the mean square error of the biased estimator that simply evaluates the $f$ on the output of the mechanism. Though our unbiased estimator has exponential running time for worst-case functions, we show that it can often be computed in linear or polynomial time for some common functions exhibiting structure. We showcase the properties of our methods empirically with several use cases including profile and entropy estimation, as well as distributed/federated data analysis applications in which unbiasedness is key to accuracy.

CROct 13, 2021
Communication-Efficient Triangle Counting under Local Differential Privacy

Jacob Imola, Takao Murakami, Kamalika Chaudhuri

Triangle counting in networks under LDP (Local Differential Privacy) is a fundamental task for analyzing connection patterns or calculating a clustering coefficient while strongly protecting sensitive friendships from a central server. In particular, a recent study proposes an algorithm for this task that uses two rounds of interaction between users and the server to significantly reduce estimation error. However, this algorithm suffers from a prohibitively high communication cost due to a large noisy graph each user needs to download. In this work, we propose triangle counting algorithms under LDP with a small estimation error and communication cost. We first propose two-rounds algorithms consisting of edge sampling and carefully selecting edges each user downloads so that the estimation error is small. Then we propose a double clipping technique, which clips the number of edges and then the number of noisy triangles, to significantly reduce the sensitivity of each user's query. Through comprehensive evaluation, we show that our algorithms dramatically reduce the communication cost of the existing algorithm, e.g., from 6 hours to 8 seconds or less at a 20 Mbps download rate, while keeping a small estimation error.

LGMay 21, 2021
Privacy Amplification Via Bernoulli Sampling

Jacob Imola, Kamalika Chaudhuri

Balancing privacy and accuracy is a major challenge in designing differentially private machine learning algorithms. One way to improve this tradeoff for free is to leverage the noise in common data operations that already use randomness. Such operations include noisy SGD and data subsampling. The additional noise in these operations may amplify the privacy guarantee of the overall algorithm, a phenomenon known as privacy amplification. In this paper, we analyze the privacy amplification of sampling from a multidimensional Bernoulli distribution family given the parameter from a private algorithm. This setup has applications to Bayesian inference and to data compression. We provide an algorithm to compute the amplification factor, and we establish upper and lower bounds on this factor.

LGFeb 18, 2021
Online $k$-means Clustering on Arbitrary Data Streams

Robi Bhattacharjee, Jacob Imola, Michal Moshkovitz et al.

We consider online $k$-means clustering where each new point is assigned to the nearest cluster center, after which the algorithm may update its centers. The loss incurred is the sum of squared distances from new points to their assigned cluster centers. The goal over a data stream $X$ is to achieve loss that is a constant factor of $L(X, OPT_k)$, the best possible loss using $k$ fixed points in hindsight. We propose a data parameter, $Λ(X)$, such that for any algorithm maintaining $O(k\text{poly}(\log n))$ centers at time $n$, there exists a data stream $X$ for which a loss of $Ω(Λ(X))$ is inevitable. We then give a randomized algorithm that achieves clustering loss $O(Λ(X) + L(X, OPT_k))$. Our algorithm uses $O(k\text{poly}(\log n))$ memory and maintains $O(k\text{poly}(\log n))$ cluster centers. Our algorithm also enjoys a running time of $O(k\text{poly}(\log n))$ and is the first algorithm to achieve polynomial space and time complexity in this setting. It also is the first to have provable guarantees without making any assumptions on the input data.

CROct 17, 2020
Locally Differentially Private Analysis of Graph Statistics

Jacob Imola, Takao Murakami, Kamalika Chaudhuri

Differentially private analysis of graphs is widely used for releasing statistics from sensitive graphs while still preserving user privacy. Most existing algorithms however are in a centralized privacy model, where a trusted data curator holds the entire graph. As this model raises a number of privacy and security issues -- such as, the trustworthiness of the curator and the possibility of data breaches, it is desirable to consider algorithms in a more decentralized local model where no server holds the entire graph. In this work, we consider a local model, and present algorithms for counting subgraphs -- a fundamental task for analyzing the connection patterns in a graph -- with LDP (Local Differential Privacy). For triangle counts, we present algorithms that use one and two rounds of interaction, and show that an additional round can significantly improve the utility. For $k$-star counts, we present an algorithm that achieves an order optimal estimation error in the non-interactive local model. We provide new lower-bounds on the estimation error for general graph statistics including triangle counts and $k$-star counts. Finally, we perform extensive experiments on two real datasets, and show that it is indeed possible to accurately estimate subgraph counts in the local differential privacy model.

LGJul 3, 2019
Capacity Bounded Differential Privacy

Kamalika Chaudhuri, Jacob Imola, Ashwin Machanavajjhala

Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privacy, capacity bounded differential privacy, where the adversary that distinguishes output distributions is assumed to be capacity-bounded -- i.e. bounded not in computational power, but in terms of the function class from which their attack algorithm is drawn. We model adversaries in terms of restricted f-divergences between probability distributions, and study properties of the definition and algorithms that satisfy them.