Reliable Clustering of Bernoulli Mixture Models
This work addresses a foundational challenge in clustering for applications like population genetics and social network analysis, offering theoretical guarantees but is incremental in nature.
The paper tackles the problem of clustering Bernoulli Mixture Models (BMMs) with an unknown number of clusters by establishing theoretical conditions for PAC-clusterability, providing the first non-asymptotic bounds on sample complexity.
A Bernoulli Mixture Model (BMM) is a finite mixture of random binary vectors with independent dimensions. The problem of clustering BMM data arises in a variety of real-world applications, ranging from population genetics to activity analysis in social networks. In this paper, we analyze the clusterability of BMMs from a theoretical perspective, when the number of clusters is unknown. In particular, we stipulate a set of conditions on the sample complexity and dimension of the model in order to guarantee the Probably Approximately Correct (PAC)-clusterability of a dataset. To the best of our knowledge, these findings are the first non-asymptotic bounds on the sample complexity of learning or clustering BMMs.