LG CL IR MLJan 22, 2014

Parsimonious Topic Models with Salient Word Discovery

arXiv:1401.6169v233 citations

AI Analysis

This addresses inefficiencies in topic modeling for text and image data, though it is incremental as it builds on existing models like LDA.

The authors tackled the problem of topic models like LDA including all words and topics in every document, which can be inefficient, by proposing a parsimonious model that identifies salient words per topic and sparse topics per document, resulting in higher test set likelihood and better agreement with ground-truth labels compared to LDA and a sparsity model.

We propose a parsimonious topic model for text corpora. In related models such as Latent Dirichlet Allocation (LDA), all words are modeled topic-specifically, even though many words occur with similar frequencies across different topics. Our modeling determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Further, in LDA all topics are in principle present in every document. By contrast our model gives sparse topic representation, determining the (small) subset of relevant topics for each document. We derive a Bayesian Information Criterion (BIC), balancing model complexity and goodness of fit. Here, interestingly, we identify an effective sample size and corresponding penalty specific to each parameter type in our model. We minimize BIC to jointly determine our entire model -- the topic-specific words, document-specific topics, all model parameter values, {\it and} the total number of topics -- in a wholly unsupervised fashion. Results on three text corpora and an image dataset show that our model achieves higher test set likelihood and better agreement with ground-truth class labels, compared to LDA and to a model designed to incorporate sparsity.

View on arXiv PDF

Similar