CLAug 16, 2024

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

arXiv:2408.08964v321 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This provides a resource for sentiment analysis in low-resource Bengali, addressing a domain-specific gap, though it is incremental as it builds on existing dataset creation practices.

The authors tackled the lack of a large-scale sentiment analysis dataset for code-mixed Bengali by introducing BnSentMix, a dataset of 20,000 samples from diverse sources, and achieved an overall accuracy of 69.8% and F1 score of 69.1% with baseline methods.

The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes