Sofya Raskhodnikova

DS
h-index29
6papers
146citations
Novelty62%
AI Score46

6 Papers

LGNov 15, 2022
Differentially Private Sampling from Distributions

Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith et al.

We initiate an investigation of private sampling from distributions. Given a dataset with $n$ independent observations from an unknown distribution $P$, a sampling algorithm must output a single observation from a distribution that is close in total variation distance to $P$ while satisfying differential privacy. Sampling abstracts the goal of generating small amounts of realistic-looking data. We provide tight upper and lower bounds for the dataset size needed for this task for three natural families of distributions: arbitrary distributions on $\{1,\ldots ,k\}$, arbitrary product distributions on $\{0,1\}^d$, and product distributions on $\{0,1\}^d$ with bias in each coordinate bounded away from 0 and 1. We demonstrate that, in some parameter regimes, private sampling requires asymptotically fewer observations than learning a description of $P$ nonprivately; in other regimes, however, private sampling proves to be as difficult as private learning. Notably, for some classes of distributions, the overhead in the number of observations needed for private learning compared to non-private learning is completely captured by the number of observations needed for private sampling.

DSApr 1
Local Node Differential Privacy

Sofya Raskhodnikova, Adam Smith, Connor Wagaman et al.

We initiate an investigation of node differential privacy for graphs in the local model of private data analysis. In our model, dubbed LNDP*, each node sees its own edge list and releases the output of a local randomizer on this input. These outputs are aggregated by an untrusted server to obtain a final output. We develop a novel algorithmic framework for this setting that allows us to accurately answer arbitrary linear queries about the input graph's degree distribution. Our framework is based on a new object, called the blurry degree distribution, which closely approximates the degree distribution and has lower sensitivity. Instead of answering queries about the degree distribution directly, our algorithms answer queries about the blurry degree distribution. This framework yields accurate LNDP* algorithms for the edge count, PMF and CDF of the degree distribution, and other graph statistics. For some natural problems, our algorithms match the accuracy achievable with node privacy in the central model, where data are held and processed by a trusted server. We also prove lower bounds on the error required by LNDP* algorithms that imply the optimality of our framework for edge counting in sparse graphs and Erdos-Renyi parameter estimation. Our lower bounds apply even to interactive protocols with a constant number of rounds of interaction between the nodes and the server. Existing lower-bound techniques for related models either yield loose bounds or do not apply in our setting, as graph data results in inherently overlapping inputs to local randomizers. To prove our bounds, we develop a splicing argument that stitches together views from locally similar but globally different distributions on graphs to obtain hard instances. Finally, we prove structural results that reveal qualitative differences between local node privacy and the standard local model for tabular data.

DSOct 20, 2025
Fast Agnostic Learners in the Plane

Talya Eden, Ludmila Glinskih, Sofya Raskhodnikova

We investigate the computational efficiency of agnostic learning for several fundamental geometric concept classes in the plane. While the sample complexity of agnostic learning is well understood, its time complexity has received much less attention. We study the class of triangles and, more generally, the class of convex polygons with $k$ vertices for small $k$, as well as the class of convex sets in a square. We present a proper agnostic learner for the class of triangles that has optimal sample complexity and runs in time $\tilde O({ε^{-6}})$, improving on the algorithm of Dobkin and Gunopulos (COLT `95) that runs in time $\tilde O({ε^{-10}})$. For 4-gons and 5-gons, we improve the running time from $O({ε^{-12}})$, achieved by Fischer and Kwek (eCOLT `96), to $\tilde O({ε^{-8}})$ and $\tilde O({ε^{-10}})$, respectively. We also design a proper agnostic learner for convex sets under the uniform distribution over a square with running time $\tilde O({ε^{-5}})$, improving on the previous $\tilde O(ε^{-8})$ bound at the cost of slightly higher sample complexity. Notably, agnostic learning of convex sets in $[0,1]^2$ under general distributions is impossible because this concept class has infinite VC-dimension. Our agnostic learners use data structures and algorithms from computational geometry and their analysis relies on tools from geometry and probabilistic combinatorics. Because our learners are proper, they yield tolerant property testers with matching running times. Our results raise a fundamental question of whether a gap between the sample and time complexity is inherent for agnostic learning of these and other natural concept classes.

DSDec 1, 2021
The Price of Differential Privacy under Continual Observation

Palak Jain, Sofya Raskhodnikova, Satchit Sivakumar et al.

We study the accuracy of differentially private mechanisms in the continual release model. A continual release mechanism receives a sensitive dataset as a stream of $T$ inputs and produces, after receiving each input, an accurate output on the obtained inputs. In contrast, a batch algorithm receives the data as one batch and produces a single output. We provide the first strong lower bounds on the error of continual release mechanisms. In particular, for two fundamental problems that are widely studied and used in the batch model, we show that the worst case error of every continual release algorithm is $\tilde Ω(T^{1/3})$ times larger than that of the best batch algorithm. Previous work shows only a polylogarithimic (in $T$) gap between the worst case error achievable in these two models; further, for many problems, including the summation of binary attributes, the polylogarithmic gap is tight (Dwork et al., 2010; Chan et al., 2010). Our results show that problems closely related to summation -- specifically, those that require selecting the largest of a set of sums -- are fundamentally harder in the continual release model than in the batch model. Our lower bounds assume only that privacy holds for streams fixed in advance (the "nonadaptive" setting). However, we provide matching upper bounds that hold in a model where privacy is required even for adaptively selected streams. This model may be of independent interest.

CRApr 29, 2015
Efficient Lipschitz Extensions for High-Dimensional Graph Statistics and Node Private Degree Distributions

Sofya Raskhodnikova, Adam Smith

Lipschitz extensions were recently proposed as a tool for designing node differentially private algorithms. However, efficiently computable Lipschitz extensions were known only for 1-dimensional functions (that is, functions that output a single real value). In this paper, we study efficiently computable Lipschitz extensions for multi-dimensional (that is, vector-valued) functions on graphs. We show that, unlike for 1-dimensional functions, Lipschitz extensions of higher-dimensional functions on graphs do not always exist, even with a non-unit stretch. We design Lipschitz extensions with small stretch for the sorted degree list and for the degree distribution of a graph. Crucially, our extensions are efficiently computable. We also develop new tools for employing Lipschitz extensions in the design of differentially private algorithms. Specifically, we generalize the exponential mechanism, a widely used tool in data privacy. The exponential mechanism is given a collection of score functions that map datasets to real values. It attempts to return the name of the function with nearly minimum value on the data set. Our generalized exponential mechanism provides better accuracy when the sensitivity of an optimal score function is much smaller than the maximum sensitivity of score functions. We use our Lipschitz extension and the generalized exponential mechanism to design a node-differentially private algorithm for releasing an approximation to the degree distribution of a graph. Our algorithm is much more accurate than algorithms from previous work.

LGAug 10, 2012
Learning pseudo-Boolean k-DNF and Submodular Functions

Sofya Raskhodnikova, Grigory Yaroslavtsev

We prove that any submodular function f: {0,1}^n -> {0,1,...,k} can be represented as a pseudo-Boolean 2k-DNF formula. Pseudo-Boolean DNFs are a natural generalization of DNF representation for functions with integer range. Each term in such a formula has an associated integral constant. We show that an analog of Hastad's switching lemma holds for pseudo-Boolean k-DNFs if all constants associated with the terms of the formula are bounded. This allows us to generalize Mansour's PAC-learning algorithm for k-DNFs to pseudo-Boolean k-DNFs, and hence gives a PAC-learning algorithm with membership queries under the uniform distribution for submodular functions of the form f:{0,1}^n -> {0,1,...,k}. Our algorithm runs in time polynomial in n, k^{O(k \log k / ε)}, 1/εand log(1/δ) and works even in the agnostic setting. The line of previous work on learning submodular functions [Balcan, Harvey (STOC '11), Gupta, Hardt, Roth, Ullman (STOC '11), Cheraghchi, Klivans, Kothari, Lee (SODA '12)] implies only n^{O(k)} query complexity for learning submodular functions in this setting, for fixed epsilon and delta. Our learning algorithm implies a property tester for submodularity of functions f:{0,1}^n -> {0, ..., k} with query complexity polynomial in n for k=O((\log n/ \loglog n)^{1/2}) and constant proximity parameter ε.