CVSDASApr 17, 2023

Recursive Joint Attention for Audio-Visual Fusion in Regression based Emotion Recognition

arXiv:2304.07958v121 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the challenge of leveraging complementary audio-visual relationships for emotion recognition, which is important for applications like human-computer interaction, but it appears incremental as it builds on existing fusion techniques.

The paper tackled the problem of effectively fusing audio and visual modalities for emotion recognition in videos by proposing a recursive joint attention model with LSTMs, achieving significant performance improvements over state-of-the-art methods on the Affwild2 and Fatigue datasets.

In video-based emotion recognition (ER), it is important to effectively leverage the complementary relationship among audio (A) and visual (V) modalities, while retaining the intra-modal characteristics of individual modalities. In this paper, a recursive joint attention model is proposed along with long short-term memory (LSTM) modules for the fusion of vocal and facial expressions in regression-based ER. Specifically, we investigated the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model in a recursive fashion with LSTMs to capture the intra-modal temporal dependencies within the same modalities as well as among the A-V feature representations. By integrating LSTMs with recursive joint cross-attention, our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities. The results of extensive experiments performed on the challenging Affwild2 and Fatigue (private) datasets indicate that the proposed A-V fusion model can significantly outperform state-of-art-methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes