LGMay 22, 2025

Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics

arXiv:2505.16493v12 citationsh-index: 3Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of discovering minority topics in text data for researchers and practitioners in NLP, though it is incremental as it builds on existing constrained NMF methods.

The paper tackled the problem of topic models failing to capture low-prevalence, domain-critical themes like mental health in online comments by proposing a constrained non-negative matrix factorization method that uses seed word lists without requiring pre-specified topic divisions. The result showed outperformance over baselines on synthetic data in metrics like topic purity and normalized mutual information, and successfully identified minority topics in a case study on YouTube vlog comments.

Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes