CLLGJun 24, 2023

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

arXiv:2306.13888v1130 citationsh-index: 21Has Code
Originality Synthesis-oriented
AI Analysis

This provides a resource for sentiment analysis in Marathi, addressing a gap for researchers and practitioners in low-resource language processing, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of datasets for sentiment analysis in low-resource languages like Marathi by creating L3Cube-MahaSent-MD, a multi-domain dataset with around 60,000 manually tagged samples across four domains, and achieved the best accuracy using the MahaBERT model.

The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes