CLMay 24, 2025

StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos

arXiv:2505.18903v17 citationsh-index: 4Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of improving computational humor detection for researchers and developers, but it is incremental as it builds on existing datasets and methods.

The authors tackled humor detection in stand-up comedy by creating a large, multilingual dataset of over 330 hours across seven languages, annotated with laughter and joke tags, and proposed a method for enhanced laughter detection using ASR errors.

Aiming towards improving current computational models of humor detection, we propose a new multimodal dataset of stand-up comedies, in seven languages: English, French, Spanish, Italian, Portuguese, Hungarian and Czech. Our dataset of more than 330 hours, is at the time of writing the biggest available for this type of task, and the most diverse. The whole dataset is automatically annotated in laughter (from the audience), and the subpart left for model validation is manually annotated. Contrary to contemporary approaches, we do not frame the task of humor detection as a binary sequence classification, but as word-level sequence labeling, in order to take into account all the context of the sequence and to capture the continuous joke tagging mechanism typically occurring in natural conversations. As par with unimodal baselines results, we propose a method for e propose a method to enhance the automatic laughter detection based on Audio Speech Recognition errors. Our code and data are available online: https://tinyurl.com/EMNLPHumourStandUpPublic

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes