Generalized Identifiability Bounds for Mixture Models with Grouped Samples
This provides improved sample efficiency for mixture model identifiability, with applications in topic modeling, but is incremental as it builds on prior work.
The paper generalizes identifiability bounds for mixture models, showing that with linear independence of subsets of k components, identifiability can be achieved with (2m-1)/(k-1) samples per group, and proves this bound is tight.
Recent work has shown that finite mixture models with $m$ components are identifiable, while making no assumptions on the mixture components, so long as one has access to groups of samples of size $2m-1$ which are known to come from the same mixture component. In this work we generalize that result and show that, if every subset of $k$ mixture components of a mixture model are linearly independent, then that mixture model is identifiable with only $(2m-1)/(k-1)$ samples per group. We further show that this value cannot be improved. We prove an analogous result for a stronger form of identifiability known as "determinedness" along with a corresponding lower bound. This independence assumption almost surely holds if mixture components are chosen randomly from a $k$-dimensional space. We describe some implications of our results for multinomial mixture models and topic modeling.