Adapting WavLM for Speech Emotion Recognition
This work addresses the challenge of fine-tuning pre-trained models for speech emotion recognition, which is incremental as it builds on existing WavLM methods.
The paper tackled the problem of optimizing fine-tuning strategies for the WavLM Large model in speech emotion recognition, achieving competitive results on the MSP Podcast Corpus and submitting to the Speech Emotion Recognition Challenge 2024.
Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.