Optimal Identity Testing with High Probability
This addresses a fundamental problem in distribution testing for researchers, providing optimal algorithms for high-confidence scenarios, though it is incremental as it builds on prior work in identity testing.
The paper tackles the problem of identity testing with high confidence, showing that black-box amplification is suboptimal for small failure probabilities and presenting a new tester that achieves optimal sample complexity, specifically Θ(1/ε²(√(n log(1/δ)) + log(1/δ))).
We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $0< ε, δ< 1$, we wish to distinguish, {\em with probability at least $1-δ$}, whether the distributions are identical versus $\varepsilon$-far in total variation distance. Most prior work focused on the case that $δ= Ω(1)$, for which the sample complexity of identity testing is known to be $Θ(\sqrt{n}/ε^2)$. Given such an algorithm, one can achieve arbitrarily small values of $δ$ via black-box amplification, which multiplies the required number of samples by $Θ(\log(1/δ))$. We show that black-box amplification is suboptimal for any $δ= o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is \[ Θ\left( \frac{1}{ε^2}\left(\sqrt{n \log(1/δ)} + \log(1/δ) \right)\right) \] for any $n, \varepsilon$, and $δ$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $d_{\mathrm TV}(p, U_n) \geq \varepsilon$, we simply threshold $d_{\mathrm TV}(\widehat{p}, U_n)$, where $\widehat{p}$ is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant $δ$ case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of $\varepsilon$ and $δ$.