CLAug 24, 2018

Measuring LDA Topic Stability from Clusters of Replicated Runs

arXiv:1808.08098v155 citations
AI Analysis

This addresses the issue of unreplicable conclusions from non-deterministic algorithms like LDA in software engineering text analysis, though it is incremental as it builds on prior work on parameter tuning.

The authors tackled the problem of instability in Latent Dirichlet Allocation (LDA) topic modeling, which can cause systematic errors, by proposing a method that uses replicated LDA runs and clustering to measure topic stability, with initial validation on 270,000 Mozilla Firefox commit messages showing how stability metrics relate to topic contents.

Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes