CO MLAug 2, 2016

Blocking Collapsed Gibbs Sampler for Latent Dirichlet Allocation Models

arXiv:1608.00945v11.2

Originality Incremental advance

AI Analysis

This work addresses a bottleneck in large-scale text analysis applications by improving sampling efficiency for LDA models, though it is incremental as it builds on existing methods.

The authors tackled the inefficiency of single-site collapsed Gibbs sampling in latent Dirichlet allocation (LDA) models by introducing a blocking scheme with two simulation procedures, achieving substantial improvements in chain mixing and significant computation time reduction when topics exceed hundreds.

The latent Dirichlet allocation (LDA) model is a widely-used latent variable model in machine learning for text analysis. Inference for this model typically involves a single-site collapsed Gibbs sampling step for latent variables associated with observations. The efficiency of the sampling is critical to the success of the model in practical large scale applications. In this article, we introduce a blocking scheme to the collapsed Gibbs sampler for the LDA model which can, with a theoretical guarantee, improve chain mixing efficiency. We develop two procedures, an O(K)-step backward simulation and an O(log K)-step nested simulation, to directly sample the latent variables within each block. We demonstrate that the blocking scheme achieves substantial improvements in chain mixing compared to the state of the art single-site collapsed Gibbs sampler. We also show that when the number of topics is over hundreds, the nested-simulation blocking scheme can achieve a significant reduction in computation time compared to the single-site sampler.

View on arXiv PDF

Similar