Towards Robust Deep Neural Networks for Affect and Depression Recognition from Speech
This work addresses the problem of early diagnosis of Major Depressive Disorder for healthcare applications, but it is incremental as it builds on existing deep learning approaches for speech-based affect recognition.
The authors tackled emotion and depression recognition from speech by introducing EmoAudioNet, a deep neural network that learns from both time-frequency and visual representations of audio signals, achieving performance similar to or better than state-of-the-art methods on RECOLA and DAIC-WOZ datasets for predicting arousal, valence, and depression.
Intelligent monitoring systems and affective computing applications have emerged in recent years to enhance healthcare. Examples of these applications include assessment of affective states such as Major Depressive Disorder (MDD). MDD describes the constant expression of certain emotions: negative emotions (low Valence) and lack of interest (low Arousal). High-performing intelligent systems would enhance MDD diagnosis in its early stages. In this paper, we present a new deep neural network architecture, called EmoAudioNet, for emotion and depression recognition from speech. Deep EmoAudioNet learns from the time-frequency representation of the audio signal and the visual representation of its spectrum of frequencies. Our model shows very promising results in predicting affect and depression. It works similarly or outperforms the state-of-the-art methods according to several evaluation metrics on RECOLA and on DAIC-WOZ datasets in predicting arousal, valence, and depression. Code of EmoAudioNet is publicly available on GitHub: https://github.com/AliceOTHMANI/EmoAudioNet