CVAIJul 8, 2020

Temporal aggregation of audio-visual modalities for emotion recognition

arXiv:2007.04364v11 citations
Originality Incremental advance
AI Analysis

This addresses emotion recognition for affective computing and human-computer interaction, but it appears incremental as it builds on existing multimodal approaches.

The paper tackles emotion recognition by proposing a multimodal fusion technique that combines audio-visual modalities with temporal offsets, showing it outperforms other methods and human accuracy on the CREMA-D dataset.

Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes