LGAICVMMMar 16, 2025

MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

arXiv:2503.12623v211 citationsh-index: 16Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This work addresses the problem of recognizing transient emotions in real-world conversational videos for applications in affective computing, though it appears incremental as it builds on existing multi-modal approaches.

The paper tackled dynamic emotion recognition in the wild by proposing MAVEN, a multi-modal attention network that integrates visual, audio, and textual cues, achieving a concordance correlation coefficient of 0.3061 on the Aff-Wild2 dataset, surpassing a baseline of 0.22.

Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes