DS AI LG MLSep 3, 2024

Differentially Private Kernel Density Estimation

Erzhi Liu, Jerry Yao-Chieh Hu, Alex Reneau, Zhao Song, Han Liu

arXiv:2409.01688v35.98 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient and accurate private data analysis for machine learning and statistics, representing an incremental improvement over existing methods.

The paper tackles the problem of differentially private kernel density estimation by introducing a refined data structure that improves privacy-utility tradeoffs and efficiency, achieving reductions in query time by a factor of α^{-1} log n, improving the approximation ratio from α to 1, and reducing error dependence by a factor of α^{-0.5} compared to prior work.

We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function $f$ (or DP KDE) and a private dataset $X \subset \mathbb{R}^d$, our goal is to preprocess $X$ so that for any query $y\in\mathbb{R}^d$, we approximate $\sum_{x \in X} f(x, y)$ in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires $O(nd)$ space and time for preprocessing with $n=|X|$. For any query point, the query time is $d \log n$, with an error guarantee of $(1+α)$-approximation and $ε^{-1} α^{-0.5} d^{1.5} R \log^{1.5} n$. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of $α^{-1} \log n$. - We improve the approximation ratio from $α$ to 1. - We reduce the error dependence by a factor of $α^{-0.5}$. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $α^{-1} \log n$ numbers, each derived from the summation of $\log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into $\log n$ numbers, where each is a smart combination of two distance values, two counting values, and $y$ itself. We believe our tree structure may be of independent interest.

View on arXiv PDF

Similar