SD AI ASMay 3, 2024

GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

arXiv:2405.02151v36.72 citationsh-index: 11SLT

Originality Incremental advance

AI Analysis

This work addresses the challenge of capturing complex emotions within single utterances in SER, which is important for applications like human-computer interaction, but it appears incremental as it builds on existing pre-trained models and methods.

The paper tackled the problem of inadequate emotion capture in Speech Emotion Recognition (SER) by proposing GMP-TL, a framework using gender-augmented multi-scale pseudo-labels and transfer learning, achieving a WAR of 80.0% and UAR of 82.0% on IEMOCAP, outperforming state-of-the-art unimodal methods.

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.

View on arXiv PDF

Similar