MLLGAug 1, 2017

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

arXiv:1708.00524v21258 citations
AI Analysis

This work addresses data scarcity in social media NLP tasks, offering a more effective distant supervision approach for sentiment analysis and related domains, though it is incremental over previous methods.

The paper tackled the problem of limited annotated data in NLP by using a diverse set of emojis as distant supervision to learn representations, achieving state-of-the-art performance on 8 benchmark datasets for sentiment, emotion, and sarcasm detection with a single model trained on 1246 million tweets.

NLP tasks are often limited by scarcity of manually annotated data. In social media sentiment analysis and related tasks, researchers have therefore used binarized emoticons and specific hashtags as forms of distant supervision. Our paper shows that by extending the distant supervision to a more diverse set of noisy labels, the models can learn richer representations. Through emoji prediction on a dataset of 1246 million tweets containing one of 64 common emojis we obtain state-of-the-art performance on 8 benchmark datasets within sentiment, emotion and sarcasm detection using a single pretrained model. Our analyses confirm that the diversity of our emotional labels yield a performance improvement over previous distant supervision approaches.

Code Implementations7 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes