CLIRLGOct 15, 2018

Improving Topic Models with Latent Feature Word Representations

arXiv:1810.06306v11154 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing topic modeling performance for researchers and practitioners dealing with limited or sparse text data, representing an incremental improvement over existing methods.

The authors tackled the problem of improving topic models by incorporating latent feature word representations from large external corpora, resulting in significant improvements in topic coherence, document clustering, and classification, particularly for datasets with few or short documents.

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes