ALBU: An approximate Loopy Belief message passing algorithm for LDA to improve performance on small data sets
This work addresses the challenge of aspect modeling in text analysis for researchers and practitioners dealing with limited data, though it appears incremental as it builds on existing variational message passing methods.
The authors tackled the problem of latent Dirichlet allocation (LDA) performing poorly on small datasets by developing ALBU, an approximate loopy belief message passing algorithm, which showed improved accuracy in learning latent distributions compared to variational Bayes, particularly for smaller datasets, as evidenced by coherence measures and Kullback-Leibler divergence.
Variational Bayes (VB) applied to latent Dirichlet allocation (LDA) has become the most popular algorithm for aspect modeling. While sufficiently successful in text topic extraction from large corpora, VB is less successful in identifying aspects in the presence of limited data. We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA) and compare it with the gold standard VB and collapsed Gibbs sampling. In situations where marginalisation leads to non-conjugate messages, we use ideas from sampling to derive approximate update equations. In cases where conjugacy holds, Loopy Belief update (LBU) (also known as Lauritzen-Spiegelhalter) is used. Our algorithm, ALBU (approximate LBU), has strong similarities with Variational Message Passing (VMP) (which is the message passing variant of VB). To compare the performance of the algorithms in the presence of limited data, we use data sets consisting of tweets and news groups. Additionally, to perform more fine grained evaluations and comparisons, we use simulations that enable comparisons with the ground truth via Kullback-Leibler divergence (KLD). Using coherence measures for the text corpora and KLD with the simulations we show that ALBU learns latent distributions more accurately than does VB, especially for smaller data sets.