Experiments on Generalizability of BERTopic on Multi-Domain Short Text
This addresses the problem of topic modeling for short texts across domains, but is incremental as it modifies an existing method to handle a specific issue.
The study evaluated BERTopic's performance on short multi-domain texts, finding it generalizes better than LDA with higher topic coherence and diversity, but identified that HDBSCAN in BERTopic classifies most documents as outliers, which was mitigated by using k-Means to achieve similar performance without outliers.
Topic modeling is widely used for analytically evaluating large collections of textual data. One of the most popular topic techniques is Latent Dirichlet Allocation (LDA), which is flexible and adaptive, but not optimal for e.g. short texts from various domains. We explore how the state-of-the-art BERTopic algorithm performs on short multi-domain text and find that it generalizes better than LDA in terms of topic coherence and diversity. We further analyze the performance of the HDBSCAN clustering algorithm utilized by BERTopic and find that it classifies a majority of the documents as outliers. This crucial, yet overseen problem excludes too many documents from further analysis. When we replace HDBSCAN with k-Means, we achieve similar performance, but without outliers.