Graph2topic: an opensource topic modeling framework based on sentence embedding and community detection
This addresses the need for more robust topic modeling tools for researchers and practitioners, though it appears incremental as it builds on existing embedding and clustering methods.
The authors tackled the problem of parameter selection and incomplete modeling in clustering-based topic models by proposing Graph2topic (G2T), a framework using sentence embeddings and community detection, which achieved state-of-the-art performance on English and Chinese documents.
It has been reported that clustering-based topic models, which cluster high-quality sentence embeddings with an appropriate word selection method, can generate better topics than generative probabilistic topic models. However, these approaches suffer from the inability to select appropriate parameters and incomplete models that overlook the quantitative relation between words with topics and topics with text. To solve these issues, we propose graph to topic (G2T), a simple but effective framework for topic modelling. The framework is composed of four modules. First, document representation is acquired using pretrained language models. Second, a semantic graph is constructed according to the similarity between document representations. Third, communities in document semantic graphs are identified, and the relationship between topics and documents is quantified accordingly. Fourth, the word--topic distribution is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T achieved state-of-the-art performance on both English and Chinese documents with different lengths.