MSP-Conversation: A Corpus for Naturalistic, Time-Continuous Emotion Recognition
This addresses the problem of limited naturalistic datasets for researchers in affective computing and speech emotion recognition, though it is incremental as it builds on existing annotation methods and corpora.
The authors tackled the need for large-scale, naturalistic emotional corpora in speech emotion recognition by introducing the MSP-Conversation corpus, which includes over 70 hours of conversational audio with time-continuous emotional annotations and speaker diarizations, establishing it as a valuable resource for advancing dynamic SER research.
Affective computing aims to understand and model human emotions for computational systems. Within this field, speech emotion recognition (SER) focuses on predicting emotions conveyed through speech. While early SER systems relied on limited datasets and traditional machine learning models, recent deep learning approaches demand largescale, naturalistic emotional corpora. To address this need, we introduce the MSP-Conversation corpus: a dataset of more than 70 hours of conversational audio with time-continuous emotional annotations and detailed speaker diarizations. The time-continuous annotations capture the dynamic and contextdependent nature of emotional expression. The annotations in the corpus include fine-grained temporal traces of valence, arousal, and dominance. The audio data is sourced from publicly available podcasts and overlaps with a subset of the isolated speaking turns in the MSP-Podcast corpus to facilitate direct comparisons between annotation methods (i.e., in-context versus out-of-context annotations). The paper outlines the development of the corpus, annotation methodology, analyses of the annotations, and baseline SER experiments, establishing the MSP-Conversation corpus as a valuable resource for advancing research in dynamic SER in naturalistic settings.