CLLGFeb 4, 2025

Topic Modeling in Marathi

arXiv:2502.02100v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses a gap in NLP for Indic languages like Marathi, but it is incremental as it applies existing methods to a new domain.

The paper tackled topic modeling for the Marathi language by comparing BERT and non-BERT approaches, finding that BERTopic with Indic-trained BERT models outperformed LDA in performance metrics.

While topic modeling in English has become a prevalent and well-explored area, venturing into topic modeling for Indic languages remains relatively rare. The limited availability of resources, diverse linguistic structures, and unique challenges posed by Indic languages contribute to the scarcity of research and applications in this domain. Despite the growing interest in natural language processing and machine learning, there exists a noticeable gap in the comprehensive exploration of topic modeling methodologies tailored specifically for languages such as Hindi, Marathi, Tamil, and others. In this paper, we examine several topic modeling approaches applied to the Marathi language. Specifically, we compare various BERT and non-BERT approaches, including multilingual and monolingual BERT models, using topic coherence and topic diversity as evaluation metrics. Our analysis provides insights into the performance of these approaches for Marathi language topic modeling. The key finding of the paper is that BERTopic, when combined with BERT models trained on Indic languages, outperforms LDA in terms of topic modeling performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes