A Crowd-Annotated Spanish Corpus for Humor Analysis
This provides a dataset for researchers in computational humor, but it is incremental as it focuses on a specific language and domain.
The authors tackled the problem of lacking human-curated data for computational humor tasks by creating a crowd-annotated Spanish corpus of 27,000 tweets, with an inter-annotator agreement of 0.5710, to support humor detection and analysis.
Computational Humor involves several tasks, such as humor recognition, humor generation, and humor scoring, for which it is useful to have human-curated data. In this work we present a corpus of 27,000 tweets written in Spanish and crowd-annotated by their humor value and funniness score, with about four annotations per tweet, tagged by 1,300 people over the Internet. It is equally divided between tweets coming from humorous and non-humorous accounts. The inter-annotator agreement Krippendorff's alpha value is 0.5710. The dataset is available for general use and can serve as a basis for humor detection and as a first step to tackle subjectivity.