Speech Emotion Recognition Using Quaternion Convolutional Neural Networks
This addresses the challenge of inferring emotion from speech signals for applications in human-computer interaction, though it is incremental as it builds on existing methods with a novel encoding approach.
The paper tackled speech emotion recognition by proposing a quaternion convolutional neural network (QCNN) model that encodes Mel-spectrogram features in an RGB quaternion domain, achieving state-of-the-art accuracy of 77.87% on RAVDESS and comparable results on other datasets like 70.46% on IEMOCAP and 88.78% on EMO-DB.
Although speech recognition has become a widespread technology, inferring emotion from speech signals still remains a challenge. To address this problem, this paper proposes a quaternion convolutional neural network (QCNN) based speech emotion recognition (SER) model in which Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain. We show that our QCNN based SER model outperforms other real-valued methods in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS, 8-classes) dataset, achieving, to the best of our knowledge, state-of-the-art results. The QCNN also achieves comparable results with the state-of-the-art methods in the Interactive Emotional Dyadic Motion Capture (IEMOCAP 4-classes) and Berlin EMO-DB (7-classes) datasets. Specifically, the model achieves an accuracy of 77.87\%, 70.46\%, and 88.78\% for the RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. In addition, our results show that the quaternion unit structure is better able to encode internal dependencies to reduce its model size significantly compared to other methods.