CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling
This work addresses the problem of improving topic modeling for researchers and practitioners by enabling better word sense disambiguation and handling unseen words, though it is incremental as it builds on existing BERT embeddings.
The paper tackles the limitations of bag-of-words representations in topic modeling by introducing CWTM, a neural topic model that integrates contextualized word embeddings from BERT, resulting in more coherent topics and the ability to handle out-of-vocabulary words in new documents.
Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.