LG DS SP MLSep 25, 2023

Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity

Spencer L. Gordon, Erik Jahn, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman

arXiv:2309.13993v17.75 citationsh-index: 42

Originality Incremental advance

AI Analysis

This provides a near-optimal algorithm for a fundamental problem in statistical learning, with applications in areas like clustering and latent variable models, though it is incremental in improving complexity bounds.

The paper tackles the problem of identifying mixtures of discrete product distributions from statistics, achieving sample and time complexity of (1/ζ)^{O(k)} for any n ≥ 2k-1, which improves upon the previous best of (1/ζ)^{O(k^2 log k)} and matches a lower bound of e^{Ω(k)} across a broad range of ζ.

We consider the problem of identifying, from statistics, a distribution of discrete random variables $X_1,\ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/ζ)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $ζ$). The best known lower bound was $\exp(Ω(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/ζ)^{O(k)}$. We also extend the known lower bound of $e^{Ω(k)}$ to match our upper bound across a broad range of $ζ$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.

View on arXiv PDF

Similar