CVAIMar 12

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

arXiv:2603.11971v13.7h-index: 2
Predicted impact top 82% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses robust emotion recognition for applications in affective computing, though it is incremental as it builds on existing methods like CLIP and Wav2Vec 2.0.

The paper tackled emotion recognition in unconstrained video data by proposing a multimodal framework using pre-trained models, temporal modeling, and bi-directional cross-attention, achieving improved performance over unimodal approaches on the ABAW 10th EXPR benchmark.

Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes