CL AIApr 10

A comparative study of transformer-based embeddings for topic coherence

Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

arXiv:2605.288325.2h-index: 6

AI Analysis

For practitioners in topic modeling, this work provides evidence that smaller, more efficient transformer models can be used without sacrificing topic quality, reducing computational costs.

This study examines the effect of transformer model size on topic quality in BERTopic pipelines, finding that model size (22M to 13B parameters) has negligible impact on topic coherence and divergence, with smaller models achieving comparable performance to larger ones.

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

View on arXiv PDF

Similar