MuSE-ing on the Impact of Utterance Ordering On Crowdsourced Emotion Annotations
This addresses the problem of annotation subjectivity in emotion recognition for researchers and practitioners, but it is incremental as it focuses on a specific methodological aspect.
The study investigated how the order of presenting video clips affects emotion annotations and algorithm performance, finding that contextual ordering yields annotations more similar to speaker self-reports, while randomized ordering makes labels easier for automated systems to predict.
Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are inherently subjective. There is generally not a single annotation that can be unambiguously declared "correct". As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact on both the annotations themselves and on the performance of these algorithms. We focus on one critical question: the effect of context. We present a new emotion dataset, Multimodal Stressed Emotion (MuSE), and annotate the dataset using two conditions: randomized, in which annotators are presented with clips in random order, and contextualized, in which annotators are presented with clips in order. We find that contextual labeling schemes result in annotations that are more similar to a speaker's own self-reported labels and that labels generated from randomized schemes are most easily predictable by automated systems.