SD CL ASNov 25, 2024

A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning

arXiv:2411.19803v1

Originality Incremental advance

AI Analysis

This addresses the challenge of poor generalization in SER for applications requiring robust emotion detection across diverse data sources, though it is incremental as it builds on existing self-supervised models.

The paper tackled the problem of limited generalization in Speech Emotion Recognition (SER) across different datasets by proposing a cross-corpus method based on supervised contrastive learning, achieving unweighted accuracies of 77.41% on IEMOCAP and 96.49% on CASIA, outperforming state-of-the-art results.

Research on Speech Emotion Recognition (SER) often faces challenges such as the lack of large-scale public datasets and limited generalization capability when dealing with data from different distributions. To solve this problem, this paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning. The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets; then, the classifier is fine-tuned on the target dataset. The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset, outperforming the state-of-the-art results on the two datasets.

View on arXiv PDF

Similar