LGMar 22, 2023
Stability is Stable: Connections between Replicability, Privacy, and Adaptive GeneralizationMark Bun, Marco Gaboardi, Max Hopkins et al.
The notion of replicable algorithms was introduced in Impagliazzo et al. [STOC '22] to describe randomized algorithms that are stable under the resampling of their inputs. More precisely, a replicable algorithm gives the same output with high probability when its randomness is fixed and it is run on a new i.i.d. sample drawn from the same distribution. Using replicable algorithms for data analysis can facilitate the verification of published results by ensuring that the results of an analysis will be the same with high probability, even when that analysis is performed on a new data set. In this work, we establish new connections and separations between replicability and standard notions of algorithmic stability. In particular, we give sample-efficient algorithmic reductions between perfect generalization, approximate differential privacy, and replicability for a broad class of statistical problems. Conversely, we show any such equivalence must break down computationally: there exist statistical problems that are easy under differential privacy, but that cannot be solved replicably without breaking public-key cryptography. Furthermore, these results are tight: our reductions are statistically optimal, and we show that any computational separation between DP and replicability must imply the existence of one-way functions. Our statistical reductions give a new algorithmic framework for translating between notions of stability, which we instantiate to answer several open questions in replicability and privacy. This includes giving sample-efficient replicable algorithms for various PAC learning, distribution estimation, and distribution testing problems, algorithmic amplification of $δ$ in approximate DP, conversions from item-level to user-level privacy, and the existence of private agnostic-to-realizable learning reductions under structured distributions.
LGOct 23, 2025
Approximate Replicability in LearningMax Hopkins, Russell Impagliazzo, Christopher Ye
Replicability, introduced by (Impagliazzo et al. STOC '22), is the notion that algorithms should remain stable under a resampling of their inputs (given access to shared randomness). While a strong and interesting notion of stability, the cost of replicability can be prohibitive: there is no replicable algorithm, for instance, for tasks as simple as threshold learning (Bun et al. STOC '23). Given such strong impossibility results we ask: under what approximate notions of replicability is learning possible? In this work, we propose three natural relaxations of replicability in the context of PAC learning: (1) Pointwise: the learner must be consistent on any fixed input, but not across all inputs simultaneously, (2) Approximate: the learner must output hypotheses that classify most of the distribution consistently, (3) Semi: the algorithm is fully replicable, but may additionally use shared unlabeled samples. In all three cases, for constant replicability parameters, we obtain sample-optimal agnostic PAC learners: (1) and (2) are achievable for ``free" using $Θ(d/α^2)$ samples, while (3) requires $Θ(d^2/α^2)$ labeled samples.
MLJun 4, 2024
Replicability in High Dimensional StatisticsMax Hopkins, Russell Impagliazzo, Daniel Kane et al.
The replicability crisis is a major issue across nearly all areas of empirical science, calling for the formal study of replicability in statistics. Motivated in this context, [Impagliazzo, Lei, Pitassi, and Sorrell STOC 2022] introduced the notion of replicable learning algorithms, and gave basic procedures for $1$-dimensional tasks including statistical queries. In this work, we study the computational and statistical cost of replicability for several fundamental high dimensional statistical tasks, including multi-hypothesis testing and mean estimation. Our main contribution establishes a computational and statistical equivalence between optimal replicable algorithms and high dimensional isoperimetric tilings. As a consequence, we obtain matching sample complexity upper and lower bounds for replicable mean estimation of distributions with bounded covariance, resolving an open problem of [Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell, STOC2023] and for the $N$-Coin Problem, resolving a problem of [Karbasi, Velegkas, Yang, and Zhou, NeurIPS2023] up to log factors. While our equivalence is computational, allowing us to shave log factors in sample complexity from the best known efficient algorithms, efficient isoperimetric tilings are not known. To circumvent this, we introduce several relaxed paradigms that do allow for sample and computationally efficient algorithms, including allowing pre-processing, adaptivity, and approximate replicability. In these cases we give efficient algorithms matching or beating the best known sample complexity for mean estimation and the coin problem, including a generic procedure that reduces the standard quadratic overhead of replicability to linear in expectation.
LGJan 20, 2022
Reproducibility in LearningRussell Impagliazzo, Rex Lei, Toniann Pitassi et al.
We introduce the notion of a reproducible algorithm in the context of learning. A reproducible learning algorithm is resilient to variations in its samples -- with high probability, it returns the exact same output when run on two samples from the same underlying distribution. We begin by unpacking the definition, clarifying how randomness is instrumental in balancing accuracy and reproducibility. We initiate a theory of reproducible algorithms, showing how reproducibility implies desirable properties such as data reuse and efficient testability. Despite the exceedingly strong demand of reproducibility, there are efficient reproducible algorithms for several fundamental problems in statistics and learning. First, we show that any statistical query algorithm can be made reproducible with a modest increase in sample complexity, and we use this to construct reproducible algorithms for finding approximate heavy-hitters and medians. Using these ideas, we give the first reproducible algorithm for learning halfspaces via a reproducible weak learner and a reproducible boosting algorithm. Finally, we initiate the study of lower bounds and inherent tradeoffs for reproducible algorithms, giving nearly tight sample complexity upper and lower bounds for reproducible versus nonreproducible SQ algorithms.
LGJun 14, 2021
Boosting in the Presence of Massart NoiseIlias Diakonikolas, Russell Impagliazzo, Daniel Kane et al.
We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $η(x) \leq η$, where $η<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $η$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles.
CCNov 12, 2020
Comparing computational entropies below majority (or: When is the dense model theorem false?)Russell Impagliazzo, Sam McGuire
Computational pseudorandomness studies the extent to which a random variable $\bf{Z}$ looks like the uniform distribution according to a class of tests $\cal{F}$. Computational entropy generalizes computational pseudorandomness by studying the extent which a random variable looks like a \emph{high entropy} distribution. There are different formal definitions of computational entropy with different advantages for different applications. Because of this, it is of interest to understand when these definitions are equivalent. We consider three notions of computational entropy which are known to be equivalent when the test class $\cal{F}$ is closed under taking majorities. This equivalence constitutes (essentially) the so-called \emph{dense model theorem} of Green and Tao (and later made explicit by Tao-Zeigler, Reingold et al., and Gowers). The dense model theorem plays a key role in Green and Tao's proof that the primes contain arbitrarily long arithmetic progressions and has since been connected to a surprisingly wide range of topics in mathematics and computer science, including cryptography, computational complexity, combinatorics and machine learning. We show that, in different situations where $\cal{F}$ is \emph{not} closed under majority, this equivalence fails. This in turn provides examples where the dense model theorem is \emph{false}.