Albert Cheu

CR
h-index38
15papers
886citations
Novelty54%
AI Score44

15 Papers

32.5CRMar 23Code
Hardening Confidential Federated Compute against Side-channel Attacks

James Bell-Clark, Albert Cheu, Adria Gascon et al.

In this work, we identify a set of side-channels in our Confidential Federated Compute platform that a hypothetical insider could exploit to circumvent differential privacy (DP) guarantees. We show how DP can mitigate two of the side-channels, one of which has been implemented in our open-source library.

CRApr 16, 2024
Confidential Federated Computations

Hubert Eichner, Daniel Ramage, Kallista Bonawitz et al.

Federated Learning and Analytics (FLA) have seen widespread adoption by technology platforms for processing sensitive on-device data. However, basic FLA systems have privacy limitations: they do not necessarily require anonymization mechanisms like differential privacy (DP), and provide limited protections against a potentially malicious service provider. Adding DP to a basic FLA system currently requires either adding excessive noise to each device's updates, or assuming an honest service provider that correctly implements the mechanism and only uses the privatized outputs. Secure multiparty computation (SMPC) -based oblivious aggregations can limit the service provider's access to individual user updates and improve DP tradeoffs, but the tradeoffs are still suboptimal, and they suffer from scalability challenges and susceptibility to Sybil attacks. This paper introduces a novel system architecture that leverages trusted execution environments (TEEs) and open-sourcing to both ensure confidentiality of server-side computations and provide externally verifiable privacy properties, bolstering the robustness and trustworthiness of private federated computations.

LGFeb 10, 2025
Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?

Marika Swanberg, Ryan McKenna, Edo Roth et al.

Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. Recent advancements in large language models (LLMs) have inspired a number of algorithm techniques for improving DP synthetic data generation. One family of approaches uses DP finetuning on the foundation model weights; however, the model weights for state-of-the-art models may not be public. In this work we propose two DP synthetic tabular data algorithms that only require API access to the foundation model. We adapt the Private Evolution algorithm (Lin et al., 2023; Xie et al., 2024) -- which was designed for image and text data -- to the tabular data domain. In our extension of Private Evolution, we define a query workload-based distance measure, which may be of independent interest. We propose a family of algorithms that use one-shot API access to LLMs, rather than adaptive queries to the LLM. Our findings reveal that API-access to powerful LLMs does not always improve the quality of DP synthetic data compared to established baselines that operate without such access. We provide insights into the underlying reasons and propose improvements to LLMs that could make them more effective for this application.

CRDec 13, 2024
Differentially Private Multi-Sampling from Distributions

Albert Cheu, Debanuj Nayak

Many algorithms have been developed to estimate probability distributions subject to differential privacy (DP): such an algorithm takes as input independent samples from a distribution and estimates the density function in a way that is insensitive to any one sample. A recent line of work, initiated by Raskhodnikova et al. (Neurips '21), explores a weaker objective: a differentially private algorithm that approximates a single sample from the distribution. Raskhodnikova et al. studied the sample complexity of DP \emph{single-sampling} i.e., the minimum number of samples needed to perform this task. They showed that the sample complexity of DP single-sampling is less than the sample complexity of DP learning for certain distribution classes. We define two variants of \emph{multi-sampling}, where the goal is to privately approximate $m>1$ samples. This better models the realistic scenario where synthetic data is needed for exploratory data analysis. A baseline solution to \emph{multi-sampling} is to invoke a single-sampling algorithm $m$ times on independently drawn datasets of samples. When the data comes from a finite domain, we improve over the baseline by a factor of $m$ in the sample complexity. When the data comes from a Gaussian, Ghazi et al. (Neurips '23) show that \emph{single-sampling} can be performed under approximate differential privacy; we show it is possible to \emph{single- and multi-sample Gaussians with known covariance subject to pure DP}. Our solution uses a variant of the Laplace mechanism that is of independent interest. We also give sample complexity lower bounds, one for strong multi-sampling of finite distributions and another for weak multi-sampling of bounded-covariance Gaussians.

CRDec 19, 2021
Pure Differential Privacy from Secure Intermediaries

Albert Cheu, Chao Yan

Recent work in differential privacy has explored the prospect of combining local randomization with a secure intermediary. Specifically, there are a variety of protocols in the secure shuffle model (where an intermediary randomly permutes messages) as well as the secure aggregation model (where an intermediary adds messages). Most of these protocols are limited to approximate differential privacy. An exception is the shuffle protocol by Ghazi, Golowich, Kumar, Manurangsi, Pagh, and Velingker (arXiv:2002.01919): it computes bounded sums under pure differential privacy. Its additive error is $\tilde{O}(1/\varepsilon^{3/2})$, where $\varepsilon$ is the privacy parameter. In this work, we give a new protocol that ensures $O(1/\varepsilon)$ error under pure differential privacy. We also show how to use it to test uniformity of distributions over $[d]$. The tester's sample complexity has an optimal dependence on $d$. Our work relies on a novel class of secure intermediaries which are of independent interest.

CRJul 25, 2021
Differential Privacy in the Shuffle Model: A Survey of Separations

Albert Cheu

Differential privacy is often studied in one of two models. In the central model, a single analyzer has the responsibility of performing a privacy-preserving computation on data. But in the local model, each data owner ensures their own privacy. Although it removes the need to trust the analyzer, local privacy comes at a price: a locally private protocol is less accurate than a centrally private counterpart when solving many learning and estimation problems. Protocols in the shuffle model are designed to attain the best of both worlds: recent work has shown high accuracy is possible with only a mild trust assumption. This survey paper gives an overview of novel shuffle protocols, along with lower bounds that establish the limits of the new model. We also summarize work that show the promise of interactivity in the shuffle model.

LGJun 17, 2021
Shuffle Private Stochastic Convex Optimization

Albert Cheu, Matthew Joseph, Jieming Mao et al.

In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. We present interactive shuffle protocols for stochastic convex optimization. Our protocols rely on a new noninteractive protocol for summing vectors of bounded $\ell_2$ norm. By combining this sum subroutine with mini-batch stochastic gradient descent, accelerated gradient descent, and Nesterov's smoothing method, we obtain loss guarantees for a variety of convex loss functions that significantly improve on those of the local model and sometimes match those of the central model.

CRApr 6, 2021
Differentially Private Histograms in the Shuffle Model from Fake Users

Albert Cheu, Maxim Zhilyaev

There has been much recent work in the shuffle model of differential privacy, particularly for approximate $d$-bin histograms. While these protocols achieve low error, the number of messages sent by each user -- the message complexity -- has so far scaled with $d$ or the privacy parameters. The message complexity is an informative predictor of a shuffle protocol's resource consumption. We present a protocol whose message complexity is two when there are sufficiently many users. The protocol essentially pairs each row in the dataset with a fake row and performs a simple randomization on all rows. We show that the error introduced by the protocol is small, using rigorous analysis as well as experiments on real-world data. We also prove that corrupt users have a relatively low impact on our protocol's estimates.

DSSep 17, 2020
The Limits of Pan Privacy and Shuffle Privacy for Learning and Estimation

Albert Cheu, Jonathan Ullman

There has been a recent wave of interest in intermediate trust models for differential privacy that eliminate the need for a fully trusted central data collector, but overcome the limitations of local differential privacy. This interest has led to the introduction of the shuffle model (Cheu et al., EUROCRYPT 2019; Erlingsson et al., SODA 2019) and revisiting the pan-private model (Dwork et al., ITCS 2010). The message of this line of work is that, for a variety of low-dimensional problems -- such as counts, means, and histograms -- these intermediate models offer nearly as much power as central differential privacy. However, there has been considerably less success using these models for high-dimensional learning and estimation problems. In this work, we show that, for a variety of high-dimensional learning and estimation problems, both the shuffle model and the pan-private model inherently incur an exponential price in sample complexity relative to the central model. For example, we show that, private agnostic learning of parity functions over $d$ bits requires $Ω(2^{d/2})$ samples in these models, and privately selecting the most common attribute from a set of $d$ choices requires $Ω(d^{1/2})$ samples, both of which are exponential separations from the central model. Our work gives the first non-trivial lower bounds for these problems for both the pan-private model and the general multi-message shuffle model.

LGApr 23, 2020
Private Query Release Assisted by Public Data

Raef Bassily, Albert Cheu, Shay Moran et al.

We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $α$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities. First, we show that we can solve the problem for any query class $\mathcal{H}$ of finite VC-dimension using only $d/α$ public samples and $\sqrt{p}d^{3/2}/α^2$ private samples, where $d$ and $p$ are the VC-dimension and dual VC-dimension of $\mathcal{H}$, respectively. In comparison, with only private samples, this problem cannot be solved even for simple query classes with VC-dimension one, and without any private samples, a larger public sample of size $d/α^2$ is needed. Next, we give sample complexity lower bounds that exhibit tight dependence on $p$ and $α$. For the class of decision stumps, we give a lower bound of $\sqrt{p}/α$ on the private sample complexity whenever the public sample size is less than $1/α^2$. Given our upper bounds, this shows that the dependence on $\sqrt{p}$ is necessary in the private sample complexity. We also give a lower bound of $1/α$ on the public sample complexity for a broad family of query classes, which by our upper bound, is tight in $α$.

CRApr 20, 2020
Connecting Robust Shuffle Privacy and Pan-Privacy

Victor Balcer, Albert Cheu, Matthew Joseph et al.

In the \emph{shuffle model} of differential privacy, data-holding users send randomized messages to a secure shuffler, the shuffler permutes the messages, and the resulting collection of messages must be differentially private with regard to user data. In the \emph{pan-private} model, an algorithm processes a stream of data while maintaining an internal state that is differentially private with regard to the stream data. We give evidence connecting these two apparently different models. Our results focus on \emph{robustly} shuffle private protocols, whose privacy guarantees are not greatly affected by malicious users. First, we give robustly shuffle private protocols and upper bounds for counting distinct elements and uniformity testing. Second, we use pan-private lower bounds to prove robustly shuffle private lower bounds for both problems. Focusing on the dependence on the domain size $k$, we find that robust approximate shuffle privacy and approximate pan-privacy have additive error $Θ(\sqrt{k})$ for counting distinct elements. For uniformity testing, we give a robust approximate shuffle private protocol with sample complexity $\tilde O(k^{2/3})$ and show that an $Ω(k^{2/3})$ dependence is necessary for any robust pure shuffle private tester. Finally, we show that this connection is useful in both directions: we give a pan-private adaptation of recent work on shuffle private histograms and use it to recover further separations between pan-privacy and interactive local privacy.

CRNov 15, 2019
Separating Local & Shuffled Differential Privacy via Histograms

Victor Balcer, Albert Cheu

Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while keeping raw data in users' hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This implies an arbitrarily large gap in sample complexity between the shuffled and local models. On the other hand, the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers.

DSSep 20, 2019
Manipulation Attacks in Local Differential Privacy

Albert Cheu, Adam Smith, Jonathan Ullman

Local differential privacy is a widely studied restriction on distributed algorithms that collect aggregates about sensitive user data, and is now deployed in several large systems. We initiate a systematic study of a fundamental limitation of locally differentially private protocols: they are highly vulnerable to adversarial manipulation. While any algorithm can be manipulated by adversaries who lie about their inputs, we show that any non-interactive locally differentially private protocol can be manipulated to a much greater extent. Namely, when the privacy level is high or the input domain is large, an attacker who controls a small fraction of the users in the protocol can completely obscure the distribution of the users' inputs. We also show that existing protocols differ greatly in their resistance to manipulation, even when they offer the same accuracy guarantee with honest execution. Our results suggest caution when deploying local differential privacy and reinforce the importance of efficient cryptographic techniques for emulating mechanisms from central differential privacy in distributed settings.

CRAug 4, 2018
Distributed Differential Privacy via Shuffling

Albert Cheu, Adam Smith, Jonathan Ullman et al.

We consider the problem of designing scalable, robust protocols for computing statistics about sensitive data. Specifically, we look at how best to design differentially private protocols in a distributed setting, where each user holds a private datum. The literature has mostly considered two models: the "central" model, in which a trusted server collects users' data in the clear, which allows greater accuracy; and the "local" model, in which users individually randomize their data, and need not trust the server, but accuracy is limited. Attempts to achieve the accuracy of the central model without a trusted server have so far focused on variants of cryptographic MPC, which limits scalability. In this paper, we initiate the analytic study of a shuffled model for distributed differentially private algorithms, which lies between the local and central models. This simple-to-implement model, a special case of the ESA framework of [Bittau et al., '17], augments the local model with an anonymous channel that randomly permutes a set of user-supplied messages. For sum queries, we show that this model provides the power of the central model while avoiding the need to trust a central server and the complexity of cryptographic secure function evaluation. More generally, we give evidence that the power of the shuffled model lies strictly between those of the central and local models: for a natural restriction of the model, we show that shuffled protocols for a widely studied selection problem require exponentially higher sample complexity than do central-model protocols.

LGNov 12, 2017
Skyline Identification in Multi-Armed Bandits

Albert Cheu, Ravi Sundaram, Jonathan Ullman

We introduce a variant of the classical PAC multi-armed bandit problem. There is an ordered set of $n$ arms $A[1],\dots,A[n]$, each with some stochastic reward drawn from some unknown bounded distribution. The goal is to identify the $skyline$ of the set $A$, consisting of all arms $A[i]$ such that $A[i]$ has larger expected reward than all lower-numbered arms $A[1],\dots,A[i-1]$. We define a natural notion of an $\varepsilon$-approximate skyline and prove matching upper and lower bounds for identifying an $\varepsilon$-skyline. Specifically, we show that in order to identify an $\varepsilon$-skyline from among $n$ arms with probability $1-δ$, $$ Θ\bigg(\frac{n}{\varepsilon^2} \cdot \min\bigg\{ \log\bigg(\frac{1}{\varepsilon δ}\bigg), \log\bigg(\frac{n}δ\bigg) \bigg\} \bigg) $$ samples are necessary and sufficient. When $\varepsilon \gg 1/n$, our results improve over the naive algorithm, which draws enough samples to approximate the expected reward of every arm; the algorithm of (Auer et al., AISTATS'16) for Pareto-optimal arm identification is likewise superseded. Our results show that the sample complexity of the skyline problem lies strictly in between that of best arm identification (Even-Dar et al., COLT'02) and that of approximating the expected reward of every arm.