CVSDASIVFeb 7, 2020

$M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild

arXiv:2002.02957v129 citations
AI Analysis

This work addresses emotion recognition in-the-wild for applications like human-computer interaction, but it is incremental as it builds on existing multi-modal and multi-task approaches.

The paper tackles continuous valence-arousal estimation from multi-modal data in real-world settings by proposing an M^3T framework that fuses visual and acoustic features, and it significantly outperforms the baseline on the ABAW validation set.

This report describes a multi-modal multi-task ($M^3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge, held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020. In the proposed $M^3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal. The spatio-temporal visual features are extracted with a 3D convolutional network and a bidirectional recurrent neural network. Considering the correlations between valence / arousal, emotions, and facial actions, we also explores mechanisms to benefit from other tasks. We evaluated the $M^3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes