ML LGAug 1, 2017

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, Sune Lehmann

arXiv:1708.00524v21258 citations

AI Analysis

This work addresses data scarcity in social media NLP tasks, offering a more effective distant supervision approach for sentiment analysis and related domains, though it is incremental over previous methods.

The paper tackled the problem of limited annotated data in NLP by using a diverse set of emojis as distant supervision to learn representations, achieving state-of-the-art performance on 8 benchmark datasets for sentiment, emotion, and sarcasm detection with a single model trained on 1246 million tweets.

NLP tasks are often limited by scarcity of manually annotated data. In social media sentiment analysis and related tasks, researchers have therefore used binarized emoticons and specific hashtags as forms of distant supervision. Our paper shows that by extending the distant supervision to a more diverse set of noisy labels, the models can learn richer representations. Through emoji prediction on a dataset of 1246 million tweets containing one of 64 common emojis we obtain state-of-the-art performance on 8 benchmark datasets within sentiment, emotion and sarcasm detection using a single pretrained model. Our analyses confirm that the diversity of our emotional labels yield a performance improvement over previous distant supervision approaches.

View on arXiv PDF

Similar