DS CR LGJul 14, 2023

Differentially Private Clustering in Data Streams

Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong

arXiv:2307.07449v33.32 citationsh-index: 22

Originality Highly original

AI Analysis

This work addresses privacy concerns in streaming clustering for applications like real-time data analysis, providing the first such algorithms in the continual release setting.

The paper tackles the problem of performing differentially private k-means and k-median clustering on streaming data, achieving sublinear space usage with bounded multiplicative and additive error guarantees.

Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms may not be as applicable in many scenarios. In this work, we provide the first differentially private algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using space that is sublinear (in $T$) in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error, or (2) a $(1+γ)$-multiplicative approximation with $\tilde{O}_γ(poly(k,2^{O_γ(d)},\log(T)))$ space for any $γ>0$, and the additive error is $poly(k,2^{O_γ(d)},\log(T))$. Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.

View on arXiv PDF

Similar