Scaling up Dynamic Topic Models
This addresses the bottleneck for researchers and practitioners needing to analyze large-scale time-series text data, representing a strong incremental improvement in efficiency.
The paper tackles the scalability problem of dynamic topic models by introducing a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics, enabling learning of 1,000 topics from 2.6 million documents in under half an hour with lower perplexity than baselines.
Dynamic topic models (DTMs) are very effective in discovering topics and capturing their evolution trends in time series data. To do posterior inference of DTMs, existing methods are all batch algorithms that scan the full dataset before each update of the model and make inexact variational approximations with mean-field assumptions. Due to a lack of a more scalable inference algorithm, despite the usefulness, DTMs have not captured large topic dynamics. This paper fills this research void, and presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions. We also present a Metropolis-Hastings based $O(1)$ sampler for topic assignments for each word token. In a distributed environment, our algorithm requires very little communication between workers during sampling (almost embarrassingly parallel) and scales up to large-scale applications. We are able to learn the largest Dynamic Topic Model to our knowledge, and learned the dynamics of 1,000 topics from 2.6 million documents in less than half an hour, and our empirical results show that our algorithm is not only orders of magnitude faster than the baselines but also achieves lower perplexity.