CLLGMLDec 28, 2022

Choosing the Number of Topics in LDA Models -- A Monte Carlo Comparison of Selection Criteria

arXiv:2212.14074v217 citationsh-index: 31
Originality Synthesis-oriented
AI Analysis

This work addresses a difficult task in topic modeling for researchers and practitioners, but it is incremental as it focuses on comparing and refining existing model selection criteria.

The paper tackles the problem of selecting the number of topics in LDA models by evaluating and comparing the singular Bayesian information criterion (sBIC) against alternative criteria using Monte Carlo simulations across various settings. The result includes practical recommendations for applications, with performance measured based on identifying the correct number of topics and relevant topics from data-generating processes.

Selecting the number of topics in LDA models is considered to be a difficult task, for which alternative approaches have been proposed. The performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be implemented to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the DGPs are identified. Practical recommendations for LDA model selection in applications are derived.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes