CVFeb 20, 2020

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

Xin Guo, Luisa F. Polanía, Kenneth E. Barner

arXiv:2002.09023v13.36 citations

Originality Incremental advance

AI Analysis

This work addresses emotion recognition for applications like human-computer interaction, but it is incremental as it builds on existing deep learning and hybrid approaches.

The paper tackled emotion recognition from audiovisual data in the wild by proposing a hybrid network combining multiple deep models for images and audio, achieving a large margin improvement over baseline methods.

This paper presents an audiovisual-based emotion recognition hybrid network. While most of the previous work focuses either on using deep models or hand-engineered features extracted from images, we explore multiple deep models built on both images and audio signals. Specifically, in addition to convolutional neural networks (CNN) and recurrent neutral networks (RNN) trained on facial images, the hybrid network also contains one SVM classifier trained on holistic acoustic feature vectors, one long short-term memory network (LSTM) trained on short-term feature sequences extracted from segmented audio clips, and one Inception(v2)-LSTM network trained on image-like maps, which are built based on short-term acoustic feature sequences. Experimental results show that the proposed hybrid network outperforms the baseline method by a large margin.

View on arXiv PDF

Similar