LG CLMar 31

PRISM: PRIor from corpus Statistics for topic Modeling

arXiv:2603.2940655.6h-index: 2Has Code

AI Analysis

This addresses the challenge of applying topic modeling to emerging or underexplored domains where external knowledge is unavailable, though it is incremental as it builds on the existing LDA framework.

The paper tackles the problem of topic modeling in resource-constrained settings by introducing PRISM, a method that initializes LDA using corpus-intrinsic statistics, and shows it improves topic coherence and interpretability, rivaling models that rely on external knowledge.

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

View on arXiv PDF Code

Similar