LGNov 22, 2017

Learning Discrete Distributions from Untrusted Batches

arXiv:1711.08113v135 citations
Originality Highly original
AI Analysis

This addresses a robustness issue in statistical learning for scenarios with untrusted data sources, offering theoretical guarantees for secure distribution estimation.

The paper tackles the problem of learning a discrete distribution from batches of data where some sources are malicious, providing algorithms that recover the distribution with error bounds that are information-theoretically optimal, achieving error O(η + ε/√k) with polynomial runtime in key parameters.

We consider the problem of learning a discrete distribution in the presence of an $ε$ fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, $p$, and each data source provides a batch of $\ge k$ samples, with the guarantee that at least a $(1-ε)$ fraction of the sources draw their samples from a distribution with total variation distance at most $η$ from $p$. We make no assumptions on the data provided by the remaining $ε$ fraction of sources--this data can even be chosen as an adversarial function of the $(1-ε)$ fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, $n$, but polynomial in $k$, $1/ε$ and $1/η$ that takes $O((n+k)/ε^2)$ batches and recovers $p$ to error $O(η+ε/\sqrt{k})$. This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the $η= 0$ setting and also achieves an $O(ε/\sqrt{k})$ recover guarantee, though it runs in $\mathrm{poly}((nk)^k)$ time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing $\ell_1$ distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes