BERTopic: Neural topic modeling with a class-based TF-IDF procedure
This work addresses the need for more coherent and effective topic modeling in natural language processing, offering a novel method that is competitive but incremental in the context of clustering-based approaches.
The authors tackled the problem of discovering latent topics in document collections by proposing BERTopic, a neural topic model that uses a class-based TF-IDF procedure to extract coherent topic representations, achieving competitive performance across various benchmarks.
Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.