IRCLLGMLJun 17, 2019

Analyses of Multi-collection Corpora via Compound Topic Modeling

arXiv:1907.01636v1
Originality Incremental advance
AI Analysis

This addresses a deficiency in topic modeling for text mining by allowing analysis of partitioned corpora, though it appears incremental as it builds on existing LDA methods.

The paper tackles the challenge of comparative text analysis across multiple collections by proposing the compound latent Dirichlet allocation (cLDA) model, which improves on previous work by enabling automatic topic exploration and modeling connections and variations across collections, as demonstrated in usability studies on real-world corpora.

As electronically stored data grow in daily life, obtaining novel and relevant information becomes challenging in text mining. Thus people have sought statistical methods based on term frequency, matrix algebra, or topic modeling for text mining. Popular topic models have centered on one single text collection, which is deficient for comparative text analyses. We consider a setting where one can partition the corpus into subcollections. Each subcollection shares a common set of topics, but there exists relative variation in topic proportions among collections. Including any prior knowledge about the corpus (e.g. organization structure), we propose the compound latent Dirichlet allocation (cLDA) model, improving on previous work, encouraging generalizability, and depending less on user-input parameters. To identify the parameters of interest in cLDA, we study Markov chain Monte Carlo (MCMC) and variational inference approaches extensively, and suggest an efficient MCMC method. We evaluate cLDA qualitatively and quantitatively using both synthetic and real-world corpora. The usability study on some real-world corpora illustrates the superiority of cLDA to explore the underlying topics automatically but also model their connections and variations across multiple collections.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes