CVFeb 20, 2020

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

arXiv:2002.09023v16 citations
AI Analysis

This work addresses emotion recognition for applications like human-computer interaction, but it is incremental as it builds on existing deep learning and hybrid approaches.

The paper tackled emotion recognition from audiovisual data in the wild by proposing a hybrid network combining multiple deep models for images and audio, achieving a large margin improvement over baseline methods.

This paper presents an audiovisual-based emotion recognition hybrid network. While most of the previous work focuses either on using deep models or hand-engineered features extracted from images, we explore multiple deep models built on both images and audio signals. Specifically, in addition to convolutional neural networks (CNN) and recurrent neutral networks (RNN) trained on facial images, the hybrid network also contains one SVM classifier trained on holistic acoustic feature vectors, one long short-term memory network (LSTM) trained on short-term feature sequences extracted from segmented audio clips, and one Inception(v2)-LSTM network trained on image-like maps, which are built based on short-term acoustic feature sequences. Experimental results show that the proposed hybrid network outperforms the baseline method by a large margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes