Rocco A. Servedio

h-index43

20papers

450citations

Novelty58%

AI Score57

Ranked #5,766 of 194,257 authors (top 3%)#1,551 in LG (top 4%)

20 Papers

7.3CCMar 21

Halfspaces are hard to test with relative error

Xi Chen, Anindya De, Yizhi Huang et al.

Several recent works [DHLNSY25, CPPS25a, CPPS25b] have studied a model of property testing of Boolean functions under a \emph{relative-error} criterion. In this model, the distance from a target function $f: \{0,1\}^n \to \{0,1\}$ that is being tested to a function $g$ is defined relative to the number of inputs $x$ for which $f(x)=1$; moreover, testing algorithms in this model have access both to a black-box oracle for $f$ and to independent uniform satisfying assignments of $f$. The motivation for this model is that it provides a natural framework for testing \emph{sparse} Boolean functions that have few satisfying assignments, analogous to well-studied models for property testing of sparse graphs. The main result of this paper is a lower bound for testing \emph{halfspaces} (i.e., linear threshold functions) in the relative error model: we show that $\tildeÎ©(\log n)$ oracle calls are required for any relative-error halfspace testing algorithm over the Boolean hypercube $\{0,1\}^n$. This stands in sharp contrast both with the constant-query testability (independent of $n$) of halfspaces in the standard model [MORS10], and with the positive results for relative-error testing of many other classes given in [DHLNSY25, CPPS25a, CPPS25b]. Our lower bound for halfspaces gives the first example of a well-studied class of functions for which relative-error testing is provably more difficult than standard-model testing.

6.5DSMay 20

Model-agnostic super-resolution in high dimensions

Xi Chen, Anindya De, Yizhi Huang et al.

The problem of super-resolution, roughly speaking, is to reconstruct an unknown signal to high accuracy, given (potentially noisy) information about its low-degree Fourier coefficients. Prior results on super-resolution have imposed strong modeling assumptions on the signal, typically requiring that it is a linear combination of spatially separated point sources. In this work we analyze a very general version of the super-resolution problem by considering completely general non-negative signals (equivalently, distributions) over the $d$-dimensional torus $[0,1)^d$; we do not assume any spatial separation between point sources, or even that the distribution is a finite linear combination of point sources. The question naturally arises: what can be said about super-resolution in such a general setting? - As a warm-up, we first give a set of results for reconstructing distributions under the Wasserstein distance. We establish essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $κ$ of the noise for which accurate reconstruction is possible: we show that for $d$-dimensional distributions, estimates of $\approx \exp(d)$ many Fourier coefficients are both necessary and sufficient for accurate Wasserstein reconstruction. - As our main result, we define a new notion of "heavy hitter" reconstruction for distributions, which essentially amounts to achieving high-accuracy reconstruction of all "sufficiently dense" regions of the distribution. We give essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $κ$ of the noise for which accurate reconstruction is possible under this notion. Our results show that (in sharp contrast with Wasserstein reconstruction) accurate estimates of only $\approx \exp(\sqrt{d})$ many Fourier coefficients are both necessary and sufficient for heavy hitter reconstruction.

11.4LGNov 12, 2025

Is nasty noise actually harder than malicious noise?

Guy Blanc, Yizhi Huang, Tal Malkin et al.

We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions: malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and nasty noise, in which an adversary can arbitrarily corrupt an adversarially chosen subset of examples given to the learner. We consider both the distribution-independent and fixed-distribution settings. Our main results highlight a dramatic difference between these two settings: For distribution-independent learning, we prove a strong equivalence between the two noise models: If a class ${\cal C}$ of functions is efficiently learnable in the presence of $η$-rate malicious noise, then it is also efficiently learnable in the presence of $η$-rate nasty noise. In sharp contrast, for the fixed-distribution setting we show an arbitrarily large separation: Under a standard cryptographic assumption, for any arbitrarily large value $r$ there exists a concept class for which there is a ratio of $r$ between the rate $η_{malicious}$ of malicious noise that polynomial-time learning algorithms can tolerate, versus the rate $η_{nasty}$ of nasty noise that such learning algorithms can tolerate. To offset the negative result for the fixed-distribution setting, we define a broad and natural class of algorithms, namely those that ignore contradictory examples (ICE). We show that for these algorithms, malicious noise and nasty noise are equivalent up to a factor of two in the noise rate: Any efficient ICE learner that succeeds with $η$-rate malicious noise can be converted to an efficient learner that succeeds with $η/2$-rate nasty noise. We further show that the above factor of two is necessary, again under a standard cryptographic assumption.

9.0CCApr 28

Improved Bounds for Coin Flipping, Leader Election, and Random Selection

Eshan Chattopadhyay, Mohit Gurumukhani, Noam Ringach et al.

Random selection, leader election, and collective coin flipping are fundamental tasks in fault-tolerant distributed computing. We study these problems in the full-information model where despite decades of study, key gaps remain in our understanding of the trade-offs between round complexity, communication per player in each round, and adversarial resilience. We make progress by proving improved bounds for these problems. We first show that any $k$-round coin flipping protocol over $\ell$ players, each player sending one bit per round, can be biased by $O(\ell/\log^{(k)}(\ell))$ bad players. We obtain a similar lower bound for leader election. This strengthens prior best bounds [RSZ, SICOMP 2002] of $O(\ell/\log^{(2k-1)}(\ell))$ for coin flipping protocols and $O(\ell/\log^{(2k+1)}(\ell))$ for leader election protocols. Our result implies that any (1-bit per player) protocol tolerating linear fraction of bad players requires at least $\log^* \ell$ rounds, showing existing protocols [RZ, JCSS 2001; F, FOCS 1999] are near-optimal. We next initiate the study of one-round, (1-bit per player) random selection. For all $m\ge (\log(\ell))^2$, we obtain an optimal protocol (a first in the full information model for any task): We construct a protocol resilient to $O(\ell / m)$ bad players that outputs $m$ uniform random bits. And, we show that any protocol that outputs $m$ uniform random bits can be corrupted using $O(\ell / m)$ bad players. This also implies a one-round leader election protocol resilient to $\ell / (\log \ell)^2$ bad players, improving the prior best protocol [RZ, JCSS 2001] which was resilient to $\ell / (\log \ell)^3$ bad players. Our resilience matches that of the best one-round coin flipping protocol by Ajtai & Linial. To obtain our lower bound, we introduce multi-output influence, an extension of influence of boolean functions to the multi-output setting.

3.1MLNov 22, 2024

Sparsifying Suprema of Gaussian Processes

Anindya De, Shivam Nadimpalli, Ryan O'Donnell et al.

We give a dimension-independent sparsification result for suprema of centered Gaussian processes: Let $T$ be any (possibly infinite) bounded set of vectors in $\mathbb{R}^n$, and let $\{\boldsymbol{X}_t := t \cdot \boldsymbol{g} \}_{t\in T}$ be the canonical Gaussian process on $T$, where $\boldsymbol{g}\sim N(0, I_n)$. We show that there is an $O_\varepsilon(1)$-size subset $S \subseteq T$ and a set of real values $\{c_s\}_{s \in S}$ such that the random variable $\sup_{s \in S} \{\boldsymbol{X}_s + c_s\}$ is an $\varepsilon$-approximator\,(in $L^1$) of the random variable $\sup_{t \in T} {\boldsymbol{X}}_t$. Notably, the size of the sparsifier $S$ is completely independent of both $|T|$ and the ambient dimension $n$. We give two applications of this sparsification theorem: - A "Junta Theorem" for Norms: We show that given any norm $ν(x)$ on $\mathbb{R}^n$, there is another norm $ψ(x)$ depending only on the projection of $x$ onto $O_\varepsilon(1)$ directions, for which $ψ({\boldsymbol{g}})$ is a multiplicative $(1 \pm \varepsilon)$-approximation of $ν({\boldsymbol{g}})$ with probability $1-\varepsilon$ for ${\boldsymbol{g}} \sim N(0,I_n)$. - Sparsification of Convex Sets: We show that any intersection of (possibly infinitely many) halfspaces in $\mathbb{R}^n$ that are at distance $r$ from the origin is $\varepsilon$-close (under $N(0,I_n)$) to an intersection of only $O_{r,\varepsilon}(1)$ halfspaces. This yields new polynomial-time \emph{agnostic learning} and \emph{tolerant property testing} algorithms for intersections of halfspaces.

10.3MLNov 11, 2025

The Probably Approximately Correct Learning Model in Computational Learning Theory

Rocco A. Servedio

This survey paper gives an overview of various known results on learning classes of Boolean functions in Valiant's Probably Approximately Correct (PAC) learning model and its commonly studied variants.

8.7LGFeb 10, 2022

Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals

Daniel Hsu, Clayton Sanford, Rocco Servedio et al.

We consider the well-studied problem of learning intersections of halfspaces under the Gaussian distribution in the challenging \emph{agnostic learning} model. Recent work of Diakonikolas et al. (2021) shows that any Statistical Query (SQ) algorithm for agnostically learning the class of intersections of $k$ halfspaces over $\mathbb{R}^n$ to constant excess error either must make queries of tolerance at most $n^{-\tildeΩ(\sqrt{\log k})}$ or must make $2^{n^{Ω(1)}}$ queries. We strengthen this result by improving the tolerance requirement to $n^{-\tildeΩ(\log k)}$. This lower bound is essentially best possible since an SQ algorithm of Klivans et al. (2008) agnostically learns this class to any constant excess error using $n^{O(\log k)}$ queries of tolerance $n^{-O(\log k)}$. We prove two variants of our lower bound, each of which combines ingredients from Diakonikolas et al. (2021) with (an extension of) a different earlier approach for agnostic SQ lower bounds for the Boolean setting due to Dachman-Soled et al. (2014). Our approach also yields lower bounds for agnostically SQ learning the class of "convex subspace juntas" (studied by Vempala, 2010) and the class of sets with bounded Gaussian surface area; all of these lower bounds are nearly optimal since they essentially match known upper bounds from Klivans et al. (2008).

3.1LGDec 8, 2021

The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss

Philip M. Long, Rocco A. Servedio

Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing.

18.9LGFeb 3, 2021

On the Approximation Power of Two-Layer Networks of Random ReLUs

Daniel Hsu, Clayton Sanford, Rocco A. Servedio et al.

This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upper- and lower-bounds for $L_2$-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments.

3.3DSJul 12, 2019

Efficient average-case population recovery in the presence of insertions and deletions

Frank Ban, Xi Chen, Rocco A. Servedio et al.

Several recent works have considered the \emph{trace reconstruction problem}, in which an unknown source string $x\in\{0,1\}^n$ is transmitted through a probabilistic channel which may randomly delete coordinates or insert random bits, resulting in a \emph{trace} of $x$. The goal is to reconstruct the original string~$x$ from independent traces of $x$. While the best algorithms known for worst-case strings use $\exp(O(n^{1/3}))$ traces \cite{DOS17,NazarovPeres17}, highly efficient algorithms are known \cite{PZ17,HPP18} for the \emph{average-case} version, in which $x$ is uniformly random. We consider a generalization of this average-case trace reconstruction problem, which we call \emph{average-case population recovery in the presence of insertions and deletions}. In this problem, there is an unknown distribution $\cal{D}$ over $s$ unknown source strings $x^1,\dots,x^s \in \{0,1\}^n$, and each sample is independently generated by drawing some $x^i$ from $\cal{D}$ and returning an independent trace of $x^i$. Building on \cite{PZ17} and \cite{HPP18}, we give an efficient algorithm for this problem. For any support size $s \leq \smash{\exp(Θ(n^{1/3}))}$, for a $1-o(1)$ fraction of all $s$-element support sets $\{x^1,\dots,x^s\} \subset \{0,1\}^n$, for every distribution $\cal{D}$ supported on $\{x^1,\dots,x^s\}$, our algorithm efficiently recovers ${\cal D}$ up to total variation distance $ε$ with high probability, given access to independent traces of independent draws from $\cal{D}$. The algorithm runs in time poly$(n,s,1/ε)$ and its sample complexity is poly$(s,1/ε,\exp(\log^{1/3}n)).$ This polynomial dependence on the support size $s$ is in sharp contrast with the \emph{worst-case} version (when $x^1,\dots,x^s$ may be any strings in $\{0,1\}^n$), in which the sample complexity of the most efficient known algorithm \cite{BCFSS19} is doubly exponential in $s$.

7.3DSJul 2, 2019

Learning from satisfying assignments under continuous distributions

Clément L. Canonne, Anindya De, Rocco A. Servedio

What kinds of functions are learnable from their satisfying assignments? Motivated by this simple question, we extend the framework of De, Diakonikolas, and Servedio [DDS15], which studied the learnability of probability distributions over $\{0,1\}^n$ defined by the set of satisfying assignments to "low-complexity" Boolean functions, to Boolean-valued functions defined over continuous domains. In our learning scenario there is a known "background distribution" $\mathcal{D}$ over $\mathbb{R}^n$ (such as a known normal distribution or a known log-concave distribution) and the learner is given i.i.d. samples drawn from a target distribution $\mathcal{D}_f$, where $\mathcal{D}_f$ is $\mathcal{D}$ restricted to the satisfying assignments of an unknown low-complexity Boolean-valued function $f$. The problem is to learn an approximation $\mathcal{D}'$ of the target distribution $\mathcal{D}_f$ which has small error as measured in total variation distance. We give a range of efficient algorithms and hardness results for this problem, focusing on the case when $f$ is a low-degree polynomial threshold function (PTF). When the background distribution $\mathcal{D}$ is log-concave, we show that this learning problem is efficiently solvable for degree-1 PTFs (i.e.,~linear threshold functions) but not for degree-2 PTFs. In contrast, when $\mathcal{D}$ is a normal distribution, we show that this learning problem is efficiently solvable for degree-2 PTFs but not for degree-4 PTFs. Our hardness results rely on standard assumptions about secure signature schemes.

4.1LGNov 9, 2018

Density estimation for shift-invariant multidimensional distributions

Anindya De, Philip M. Long, Rocco A. Servedio

We study density estimation for classes of shift-invariant distributions over $\mathbb{R}^d$. A multidimensional distribution is "shift-invariant" if, roughly speaking, it is close in total variation distance to a small shift of it in any direction. Shift-invariance relaxes smoothness assumptions commonly used in non-parametric density estimation to allow jump discontinuities. The different classes of distributions that we consider correspond to different rates of tail decay. For each such class we give an efficient algorithm that learns any distribution in the class from independent samples with respect to total variation distance. As a special case of our general result, we show that $d$-dimensional shift-invariant distributions which satisfy an exponential tail bound can be learned to total variation distance error $ε$ using $\tilde{O}_d(1/ ε^{d+2})$ examples and $\tilde{O}_d(1/ ε^{2d+2})$ time. This implies that, for constant $d$, multivariate log-concave distributions can be learned in $\tilde{O}_d(1/ε^{2d+2})$ time using $\tilde{O}_d(1/ε^{d+2})$ samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a $(1-ε,ε)$ mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error $O(ε)$ from the target distribution. We show that our general results are close to best possible by proving a simple $Ω\left(1/ε^d\right)$ information-theoretic lower bound on sample complexity even for learning bounded distributions that are shift-invariant.

4.1LGNov 3, 2018

Learning sparse mixtures of rankings from noisy information

Anindya De, Ryan O'Donnell, Rocco Servedio

We study the problem of learning an unknown mixture of $k$ rankings over $n$ elements, given access to noisy samples drawn from the unknown mixture. We consider a range of different noise models, including natural variants of the "heat kernel" noise framework and the Mallows model. For each of these noise models we give an algorithm which, under mild assumptions, learns the unknown mixture to high accuracy and runs in $n^{O(\log k)}$ time. The best previous algorithms for closely related problems have running times which are exponential in $k$.

13.0DSJul 18, 2018

Learning Sums of Independent Random Variables with Sparse Collective Support

Anindya De, Philip M. Long, Rocco A. Servedio

We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For $\mathcal{A} \subset \mathbf{Z}_{+}$, a sum of independent random variables with collective support $\mathcal{A}$} (called an $\mathcal{A}$-sum in this paper) is a distribution $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$ where the $\mathbf{X}_i$'s are mutually independent (but not necessarily identically distributed) integer random variables with $\cup_i \mathsf{supp}(\mathbf{X}_i) \subseteq \mathcal{A}.$ We give two main algorithmic results for learning such distributions: 1. For the case $| \mathcal{A} | = 3$, we give an algorithm for learning $\mathcal{A}$-sums to accuracy $ε$ that uses $\mathsf{poly}(1/ε)$ samples and runs in time $\mathsf{poly}(1/ε)$, independent of $N$ and of the elements of $\mathcal{A}$. 2. For an arbitrary constant $k \geq 4$, if $\mathcal{A} = \{ a_1,...,a_k\}$ with $0 \leq a_1 < ... < a_k$, we give an algorithm that uses $\mathsf{poly}(1/ε) \cdot \log \log a_k$ samples (independent of $N$) and runs in time $\mathsf{poly}(1/ε, \log a_k).$ We prove an essentially matching lower bound: if $|\mathcal{A}| = 4$, then any algorithm must use $Ω(\log \log a_4) $ samples even for learning to constant accuracy. We also give similar-in-spirit (but quantitatively very different) algorithmic results, and essentially matching lower bounds, for the case in which $\mathcal{A}$ is not known to the learner.

5.9DSMar 4, 2017

Sharp bounds for population recovery

Anindya De, Ryan O'Donnell, Rocco Servedio

The population recovery problem is a basic problem in noisy unsupervised learning that has attracted significant research attention in recent years [WY12,DRWY12, MS13, BIMP13, LZ15,DST16]. A number of different variants of this problem have been studied, often under assumptions on the unknown distribution (such as that it has restricted support size). In this work we study the sample complexity and algorithmic complexity of the most general version of the problem, under both bit-flip noise and erasure noise model. We give essentially matching upper and lower sample complexity bounds for both noise models, and efficient algorithms matching these sample complexity bounds up to polynomial factors.

8.6CCDec 9, 2016

Optimal mean-based algorithms for trace reconstruction

Anindya De, Ryan O'Donnell, Rocco Servedio

In the (deletion-channel) trace reconstruction problem, there is an unknown $n$-bit source string $x$. An algorithm is given access to independent traces of $x$, where a trace is formed by deleting each bit of~$x$ independently with probability~$δ$. The goal of the algorithm is to recover~$x$ exactly (with high probability), while minimizing samples (number of traces) and running time. Previously, the best known algorithm for the trace reconstruction problem was due to Holenstein~et~al.; it uses $\exp(\tilde{O}(n^{1/2}))$ samples and running time for any fixed $0 < δ< 1$. It is also what we call a "mean-based algorithm", meaning that it only uses the empirical means of the individual bits of the traces. Holenstein~et~al.~also gave a lower bound, showing that any mean-based algorithm must use at least $n^{\tildeΩ(\log n)}$ samples. In this paper we improve both of these results, obtaining matching upper and lower bounds for mean-based trace reconstruction. For any constant deletion rate $0 < δ< 1$, we give a mean-based algorithm that uses $\exp(O(n^{1/3}))$ time and traces; we also prove that any mean-based algorithm must use at least $\exp(Ω(n^{1/3}))$ traces. In fact, we obtain matching upper and lower bounds even for $δ$ subconstant and $ρ:= 1-δ$ subconstant: when $(\log^3 n)/n \ll δ\leq 1/2$ the bound is $\exp(-Θ(δn)^{1/3})$, and when $1/\sqrt{n} \ll ρ\leq 1/2$ the bound is $\exp(-Θ(n/ρ)^{1/3})$. Our proofs involve estimates for the maxima of Littlewood polynomials on complex disks. We show that these techniques can also be used to perform trace reconstruction with random insertions and bit-flips in addition to deletions. We also find a surprising result: for deletion probabilities $δ> 1/2$, the presence of insertions can actually help with trace reconstruction.

12.2LGNov 1, 2014

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio et al.

Let $p$ be an unknown and arbitrary probability distribution over $[0,1)$. We consider the problem of {\em density estimation}, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any $k$ and $ε$, we give an algorithm that makes $\tilde{O}(k/ε^2)$ draws from $p$, runs in $\tilde{O}(k/ε^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \log^2(1/ε))$ pieces. With high probability the hypothesis $h$ satisfies $d_{\mathrm{TV}}(p,h) \leq C \cdot \mathrm{opt}_k(p) + ε$, where $d_{\mathrm{TV}}$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\mathrm{opt}_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are optimal up to logarithmic factors. The "approximation factor" $C$ in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $ε$ can achieve $C<2$ regardless of what kind of hypothesis distribution it uses.

26.0LGMay 14, 2013

Efficient Density Estimation via Piecewise Polynomial Approximation

Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio et al.

We give a highly efficient "semi-agnostic" algorithm for learning univariate probability distributions that are well approximated by piecewise polynomial density functions. Let $p$ be an arbitrary distribution over an interval $I$ which is $τ$-close (in total variation distance) to an unknown probability distribution $q$ that is defined by an unknown partition of $I$ into $t$ intervals and $t$ unknown degree-$d$ polynomials specifying $q$ over each of the intervals. We give an algorithm that draws $\tilde{O}(t\new{(d+1)}/\eps^2)$ samples from $p$, runs in time $\poly(t,d,1/\eps)$, and with high probability outputs a piecewise polynomial hypothesis distribution $h$ that is $(O(τ)+\eps)$-close (in total variation distance) to $p$. This sample complexity is essentially optimal; we show that even for $τ=0$, any algorithm that learns an unknown $t$-piecewise degree-$d$ probability distribution over $I$ to accuracy $\eps$ must use $Ω({\frac {t(d+1)} {\poly(1 + \log(d+1))}} \cdot {\frac 1 {\eps^2}})$ samples from the distribution, regardless of its running time. Our algorithm combines tools from approximation theory, uniform convergence, linear programming, and dynamic programming. We apply this general algorithm to obtain a wide range of results for many natural problems in density estimation over both continuous and discrete domains. These include state-of-the-art results for learning mixtures of log-concave distributions; mixtures of $t$-modal distributions; mixtures of Monotone Hazard Rate distributions; mixtures of Poisson Binomial Distributions; mixtures of Gaussians; and mixtures of $k$-monotone densities. Our general technique yields computationally efficient algorithms for all these problems, in many cases with provably optimal sample complexities (up to logarithmic factors) in all parameters.

23.1LGOct 2, 2012

Learning mixtures of structured distributions over discrete domains

Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio et al.

Let $\mathfrak{C}$ be a class of probability distributions over the discrete domain $[n] = \{1,...,n\}.$ We show that if $\mathfrak{C}$ satisfies a rather general condition -- essentially, that each distribution in $\mathfrak{C}$ can be well-approximated by a variable-width histogram with few bins -- then there is a highly efficient (both in terms of running time and sample complexity) algorithm that can learn any mixture of $k$ unknown distributions from $\mathfrak{C}.$ We analyze several natural types of distributions over $[n]$, including log-concave, monotone hazard rate and unimodal distributions, and show that they have the required structural property of being well-approximated by a histogram with few bins. Applying our general algorithm, we obtain near-optimally efficient algorithms for all these mixture learning problems.

8.6CCJun 5, 2012

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces

Anindya De, Ilias Diakonikolas, Vitaly Feldman et al.

The \emph{Chow parameters} of a Boolean function $f: \{-1,1\}^n \to \{-1,1\}$ are its $n+1$ degree-0 and degree-1 Fourier coefficients. It has been known since 1961 (Chow, Tannenbaum) that the (exact values of the) Chow parameters of any linear threshold function $f$ uniquely specify $f$ within the space of all Boolean functions, but until recently (O'Donnell and Servedio) nothing was known about efficient algorithms for \emph{reconstructing} $f$ (exactly or approximately) from exact or approximate values of its Chow parameters. We refer to this reconstruction problem as the \emph{Chow Parameters Problem.} Our main result is a new algorithm for the Chow Parameters Problem which, given (sufficiently accurate approximations to) the Chow parameters of any linear threshold function $f$, runs in time $\tilde{O}(n^2)\cdot (1/\eps)^{O(\log^2(1/\eps))}$ and with high probability outputs a representation of an LTF $f'$ that is $\eps$-close to $f$. The only previous algorithm (O'Donnell and Servedio) had running time $\poly(n) \cdot 2^{2^{\tilde{O}(1/\eps^2)}}.$ As a byproduct of our approach, we show that for any linear threshold function $f$ over $\{-1,1\}^n$, there is a linear threshold function $f'$ which is $\eps$-close to $f$ and has all weights that are integers at most $\sqrt{n} \cdot (1/\eps)^{O(\log^2(1/\eps))}$. This significantly improves the best previous result of Diakonikolas and Servedio which gave a $\poly(n) \cdot 2^{\tilde{O}(1/\eps^{2/3})}$ weight bound, and is close to the known lower bound of $\max\{\sqrt{n},$ $(1/\eps)^{Ω(\log \log (1/\eps))}\}$ (Goldberg, Servedio). Our techniques also yield improved algorithms for related problems in learning theory.