Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data
This work addresses the problem of scalable Bayesian nonparametric modeling for distributed data, representing an incremental improvement in distributed DPMM estimation.
The paper tackled the challenge of efficiently and consistently handling new components in distributed estimation of Dirichlet Process Mixture Models (DPMMs) by proposing a method that allows local creation and probabilistic merging of components, achieving high scalability without compromising mixing performance in experiments on large real-world datasets.
We consider the estimation of Dirichlet Process Mixture Models (DPMMs) in distributed environments, where data are distributed across multiple computing nodes. A key advantage of Bayesian nonparametric models such as DPMMs is that they allow new components to be introduced on the fly as needed. This, however, posts an important challenge to distributed estimation -- how to handle new components efficiently and consistently. To tackle this problem, we propose a new estimation method, which allows new components to be created locally in individual computing nodes. Components corresponding to the same cluster will be identified and merged via a probabilistic consolidation scheme. In this way, we can maintain the consistency of estimation with very low communication cost. Experiments on large real-world data sets show that the proposed method can achieve high scalability in distributed and asynchronous environments without compromising the mixing performance.