Covariance-Aware Private Mean Estimation Without Private Covariance Estimation
This provides a more efficient solution for privacy-preserving statistical analysis in machine learning and data science, though it is incremental as it builds on existing differential privacy frameworks with novel technical steps.
The paper tackles the problem of differentially private mean estimation for high-dimensional Gaussian distributions with unknown covariance, achieving sample-efficient estimators that require only n ≳ d/α² samples to output an estimate with Mahalanobis distance error at most α, improving over previous methods that needed stronger assumptions or more samples.
We present two sample-efficient differentially private mean estimators for $d$-dimensional (sub)Gaussian distributions with unknown covariance. Informally, given $n \gtrsim d/α^2$ samples from such a distribution with mean $μ$ and covariance $Σ$, our estimators output $\tildeμ$ such that $\| \tildeμ- μ\|_Σ \leq α$, where $\| \cdot \|_Σ$ is the Mahalanobis distance. All previous estimators with the same guarantee either require strong a priori bounds on the covariance matrix or require $Ω(d^{3/2})$ samples. Each of our estimators is based on a simple, general approach to designing differentially private mechanisms, but with novel technical steps to make the estimator private and sample-efficient. Our first estimator samples a point with approximately maximum Tukey depth using the exponential mechanism, but restricted to the set of points of large Tukey depth. Its accuracy guarantees hold even for data sets that have a small amount of adversarial corruption. Proving that this mechanism is private requires a novel analysis. Our second estimator perturbs the empirical mean of the data set with noise calibrated to the empirical covariance, without releasing the covariance itself. Its sample complexity guarantees hold more generally for subgaussian distributions, albeit with a slightly worse dependence on the privacy parameter. For both estimators, careful preprocessing of the data is required to satisfy differential privacy.