CLMar 11, 2018

Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages

Soumil Mandal, Sainik Kumar Mahata, Dipankar Das

arXiv:1803.04000v12.745 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of poor sentiment analysis performance for code-mixed social media data in Indian languages, though it is incremental as it focuses on data preparation rather than novel methods.

The authors tackled the lack of resources for sentiment analysis in code-mixed Bengali-English data by creating a gold-standard corpus from Twitter, achieving high inter-annotator agreement with Kappa values.

Analysis of informative contents and sentiments of social users has been attempted quite intensively in the recent past. Most of the systems are usable only for monolingual data and fails or gives poor results when used on data with code-mixing property. To gather attention and encourage researchers to work on this crisis, we prepared gold standard Bengali-English code-mixed data with language and polarity tag for sentiment analysis purposes. In this paper, we discuss the systems we prepared to collect and filter raw Twitter data. In order to reduce manual work while annotation, hybrid systems combining rule based and supervised models were developed for both language and sentiment tagging. The final corpus was annotated by a group of annotators following a few guidelines. The gold standard corpus thus obtained has impressive inter-annotator agreement obtained in terms of Kappa values. Various metrics like Code-Mixed Index (CMI), Code-Mixed Factor (CF) along with various aspects (language and emotion) also qualitatively polled the code-mixed and sentiment properties of the corpus.

View on arXiv PDF

Similar