BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
This provides a resource for sentiment analysis in low-resource Bengali, addressing a domain-specific gap, though it is incremental as it builds on existing dataset creation practices.
The authors tackled the lack of a large-scale sentiment analysis dataset for code-mixed Bengali by introducing BnSentMix, a dataset of 20,000 samples from diverse sources, and achieved an overall accuracy of 69.8% and F1 score of 69.1% with baseline methods.
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.