Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance
This addresses issues in topic modeling for text analysis, offering a more robust method, though it appears incremental as it builds on prior models.
The paper tackles the unreliability of existing clustering topic models like Top2Vec and BERTopic in discovering natural clusters and estimating term importance, introducing Topeax, which improves cluster recovery and description with less sensitivity to sample size and hyperparameters.
Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.