CLMay 30, 2020

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

arXiv:2006.00210v11025 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited datasets for code-mixed sentiment analysis, specifically for Malayalam-English, which is incremental as it extends existing work to a new language pair.

The paper tackles the lack of resources for sentiment analysis of code-mixed Malayalam-English text by creating a new gold standard corpus, which achieved a Krippendorff's alpha above 0.8 for annotation reliability, and uses it to establish a benchmark for this domain.

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes