CLAug 16, 2024

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

arXiv:2408.08964v313.821 citationsh-index: 9

Originality Synthesis-oriented

AI Analysis

This provides a resource for sentiment analysis in low-resource Bengali, addressing a domain-specific gap, though it is incremental as it builds on existing dataset creation practices.

The authors tackled the lack of a large-scale sentiment analysis dataset for code-mixed Bengali by introducing BnSentMix, a dataset of 20,000 samples from diverse sources, and achieved an overall accuracy of 69.8% and F1 score of 69.1% with baseline methods.

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.

View on arXiv PDF

Similar