AICLCVMay 3, 2018

Framewise approach in multimodal emotion recognition in OMG challenge

arXiv:1805.01369v1
Originality Synthesis-oriented
AI Analysis

This work addresses emotion recognition for applications like human-computer interaction, but it is incremental as it combines existing methods with minor improvements.

The paper tackled multimodal emotion recognition by using an ensemble of single-modality models on voice and face data from video, achieving 53% unweighted accuracy over 7 emotions and mean squared errors of 0.05 and 0.09 for arousal and valence in the OMG challenge.

In this report we described our approach achieves $53\%$ of unweighted accuracy over $7$ emotions and $0.05$ and $0.09$ mean squared errors for arousal and valence in OMG emotion recognition challenge. Our results were obtained with ensemble of single modality models trained on voice and face data from video separately. We consider each stream as a sequence of frames. Next we estimated features from frames and handle it with recurrent neural network. As audio frame we mean short $0.4$ second spectrogram interval. For features estimation for face pictures we used own ResNet neural network pretrained on AffectNet database. Each short spectrogram was considered as a picture and processed by convolutional network too. As a base audio model we used ResNet pretrained in speaker recognition task. Predictions from both modalities were fused on decision level and improve single-channel approaches by a few percent

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes