CLSep 17, 2022

News Headlines Dataset For Sarcasm Detection

arXiv:2212.06035v11.628 citationsh-index: 8Has Code

Originality Synthesis-oriented

AI Analysis

This dataset addresses the need for less noisy and more accessible sarcasm detection resources for NLP researchers, though it is incremental as it builds on existing data collection methods.

The authors tackled the problem of noisy and context-dependent sarcasm detection datasets by curating a new dataset of 28K news headlines, with 13K sarcastic examples, from TheOnion and HuffPost to provide cleaner labels and broader applicability.

Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag-based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets, and detecting sarcasm in these requires the availability of contextual tweets. To overcome the limitations related to noise in Twitter datasets, we curate News Headlines Dataset from two news websites: TheOnion aims at producing sarcastic versions of current events, whereas HuffPost publishes real news. The dataset contains about 28K headlines out of which 13K are sarcastic. To make it more useful, we have included the source links of the news articles so that more data can be extracted as needed. In this paper, we describe various details about the dataset and potential use cases apart from Sarcasm Detection.

View on arXiv PDF Code

Similar