An Analysis of Lemmatization on Topic Models of Morphologically Rich Language
This addresses the problem of topic model interpretability for researchers working with morphologically rich languages, but it is incremental as it builds on prior work and calls for further investigation.
The study tackled the effect of lemmatization on topic models for morphologically rich languages, specifically Russian Wikipedia, and found that in one configuration, it significantly improved interpretability according to a word intrusion metric.
Topic models are typically represented by top-$m$ word lists for human interpretation. The corpus is often pre-processed with lemmatization (or stemming) so that those representations are not undermined by a proliferation of words with similar meanings, but there is little public work on the effects of that pre-processing. Recent work studied the effect of stemming on topic models of English texts and found no supporting evidence for the practice. We study the effect of lemmatization on topic models of Russian Wikipedia articles, finding in one configuration that it significantly improves interpretability according to a word intrusion metric. We conclude that lemmatization may benefit topic models on morphologically rich languages, but that further investigation is needed.