Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
This work addresses the problem of emotion recognition from speech for applications like human-computer interaction, though it is incremental as it adapts existing pre-trained models.
The authors tackled the challenge of limited dataset sizes for speech emotion recognition by proposing a transfer learning method that extracts features from pre-trained wav2vec 2.0 models and models them with simple neural networks, achieving superior performance on IEMOCAP and RAVDESS databases compared to existing literature.
Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two different wav2vec 2.0 models, with and without finetuning for speech recognition. We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.