CLMay 30, 2018

A Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection

Sahil Swami, Ankush Khandelwal, Vinay Singh, Syed Sarfaraz Akhtar, Manish Shrivastava

arXiv:1805.11869v13.670 citations

Originality Synthesis-oriented

AI Analysis

This work addresses sarcasm detection for social media analysis, particularly in code-mixed contexts, but it is incremental as it focuses on dataset creation and a baseline method.

The authors tackled the problem of sarcasm detection in English-Hindi code-mixed tweets by creating the first annotated dataset for this task, and their baseline supervised classification system achieved an average F-score of 78.4 using a random forest classifier with 10-fold cross-validation.

Social media platforms like twitter and facebook have be- come two of the largest mediums used by people to express their views to- wards different topics. Generation of such large user data has made NLP tasks like sentiment analysis and opinion mining much more important. Using sarcasm in texts on social media has become a popular trend lately. Using sarcasm reverses the meaning and polarity of what is implied by the text which poses challenge for many NLP tasks. The task of sarcasm detection in text is gaining more and more importance for both commer- cial and security services. We present the first English-Hindi code-mixed dataset of tweets marked for presence of sarcasm and irony where each token is also annotated with a language tag. We present a baseline su- pervised classification system developed using the same dataset which achieves an average F-score of 78.4 after using random forest classifier and performing 10-fold cross validation.

View on arXiv PDF

Similar