DS IT LG MLNov 19, 2023

Testing with Non-identically Distributed Samples

Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant

arXiv:2311.11194v22.31 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses challenges in statistical inference for heterogeneous data sources, such as individuals or spatial data, but is incremental as it extends known i.i.d. results to non-identical settings.

The paper tackles the problem of property testing and estimation when samples are drawn from non-identically distributed sources, focusing on learning or testing properties of the average distribution. It shows that with one sample per distribution, linear samples in support size are necessary for testing uniformity or identity, but with two or more samples per distribution, sublinear sample guarantees similar to the i.i.d. setting are achievable, with specific sample complexities like O(√k/ε² + 1/ε⁴).

We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{avg}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $Θ(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{avg}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity -- distinguishing the case that $p_{avg}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $ρ> 0$ such that even in the linear regime with $ρk$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.

View on arXiv PDF

Similar