CL AIFeb 5, 2024

Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian

Darija Medvecki, Bojana Bašaragin, Adela Ljajić, Nikola Milošević

arXiv:2402.03067v14.821 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This work addresses topic modeling for morphologically rich, low-resource languages like Serbian, offering insights for researchers in similar domains, though it is incremental as it adapts an existing method to a new language context.

The paper applied BERTopic to short Serbian text, showing it yields informative topics with minimal performance drop on partially preprocessed data and outperforms LDA and NMF in topic quality when topics are not limited.

This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and NMF on fully preprocessed text. The experiments were conducted on a dataset of tweets expressing hesitancy toward COVID-19 vaccination. Our results show that with adequate parameter setting, BERTopic can yield informative topics even when applied to partially pre-processed short text. When the same parameters are applied in both prepro-cessing scenarios, the performance drop on partially preprocessed text is minimal. Compared to LDA and NMF, judging by the keywords, BERTopic offers more informative topics and gives novel insights when the number of topics is not limited. The findings of this paper can be significant for re-searchers working with other morphologically rich low-resource languages and short text.

View on arXiv PDF

Similar