Transformer for Emotion Recognition
This is an incremental improvement for emotion recognition in video-based challenges.
The paper tackled emotion recognition by predicting arousal and valence using surrounding context, reporting improvements for both unimodal and multimodal predictions.
This paper describes the UMONS solution for the OMG-Emotion Challenge. We explore a context-dependent architecture where the arousal and valence of an utterance are predicted according to its surrounding context (i.e. the preceding and following utterances of the video). We report an improvement when taking into account context for both unimodal and multimodal predictions.