On Differentially Private Subspace Estimation in a Distribution-Free Setting
This work addresses the challenge of high sample costs in private data analysis for datasets with low-dimensional structure, offering a novel approach to reduce dependency on ambient dimension, though it is incremental in building on prior work.
The paper tackles the problem of differentially private subspace estimation by introducing measures of dataset 'easiness' based on singular-value gaps, enabling dimension-independent sample complexity for certain instances. It provides new upper and lower bounds and a practical algorithm that outperforms prior methods in high-dimensional settings.
Private data analysis faces a significant challenge known as the curse of dimensionality, leading to increased costs. However, many datasets possess an inherent low-dimensional structure. For instance, during optimization via gradient descent, the gradients frequently reside near a low-dimensional subspace. If the low-dimensional structure could be privately identified using a small amount of points, we could avoid paying for the high ambient dimension. On the negative side, Dwork, Talwar, Thakurta, and Zhang (STOC 2014) proved that privately estimating subspaces, in general, requires an amount of points that has a polynomial dependency on the dimension. However, their bounds do not rule out the possibility to reduce the number of points for "easy" instances. Yet, providing a measure that captures how much a given dataset is "easy" for this task turns out to be challenging, and was not properly addressed in prior works. Inspired by the work of Singhal and Steinke (NeurIPS 2021), we provide the first measures that quantify "easiness" as a function of multiplicative singular-value gaps in the input dataset, and support them with new upper and lower bounds. In particular, our results determine the first types of gaps that are sufficient and necessary for estimating a subspace with an amount of points that is independent of the dimension. Furthermore, we realize our upper bounds using a practical algorithm and demonstrate its advantage in high-dimensional regimes compared to prior approaches.