LG DSApr 10, 2015

Learning Arbitrary Statistical Mixtures of Discrete Distributions

Jian Li, Yuval Rabani, Leonard J. Schulman, Chaitanya Swamy

arXiv:1504.02526v19.021 citations

Originality Incremental advance

AI Analysis

This addresses a foundational challenge in unsupervised learning, such as for topic models and collaborative filtering, by providing the first efficient method for general mixtures, though it is incremental in extending existing mixture learning frameworks.

The paper tackles the problem of learning arbitrary statistical mixture models from unlabeled, noisy samples without structural assumptions, achieving efficient algorithms with bounds on solution quality based on sample size and number.

We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, the model to be learned, $\vartheta$, is a probability distribution over probability distributions $p$, where each such $p$ is a probability distribution over $[n] = \{1,2,\dots,n\}$. When we sample from $\vartheta$, we do not observe $p$ directly, but only indirectly and in very noisy fashion, by sampling from $[n]$ repeatedly, independently $K$ times from the distribution $p$. The problem is to infer $\vartheta$ to high accuracy in transportation (earthmover) distance. We give the first efficient algorithms for learning this mixture model without making any restricting assumptions on the structure of the distribution $\vartheta$. We bound the quality of the solution as a function of the size of the samples $K$ and the number of samples used. Our model and results have applications to a variety of unsupervised learning scenarios, including learning topic models and collaborative filtering.

View on arXiv PDF

Similar