MMCLCVLGSDASJul 26, 2024

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

arXiv:2407.18552v415 citationsh-index: 6Has Code
AI Analysis

This work improves emotion recognition for applications like human-computer interaction, though it is incremental as it builds on existing transformer and attention methods.

The paper tackled the problem of multimodal emotion recognition by addressing temporal misalignment and suboptimal fusion of audio-visual cues, resulting in a model that outperformed state-of-the-art baselines on three benchmark datasets with significant improvements in accuracy and F1-score.

Multimodal emotion recognition (MER) aims to infer human affect by jointly modeling audio and visual cues; however, existing approaches often struggle with temporal misalignment, weakly discriminative feature representations, and suboptimal fusion of heterogeneous modalities. To address these challenges, we propose AVT-CA, an Audio-Video Transformer architecture with cross attention for robust emotion recognition. The proposed model introduces a hierarchical video feature representation that combines channel attention, spatial attention, and local feature extraction to emphasize emotionally salient regions while suppressing irrelevant information. These refined visual features are integrated with audio representations through an intermediate transformer-based fusion mechanism that captures interlinked temporal dependencies across modalities. Furthermore, a cross-attention module selectively reinforces mutually consistent audio-visual cues, enabling effective feature selection and noise-aware fusion. Extensive experiments on three benchmark datasets, CMU-MOSEI, RAVDESS, and CREMA-D, demonstrate that AVT-CA consistently outperforms state-of-the-art baselines, achieving significant improvements in both accuracy and F1-score. Our source code is publicly available at https://github.com/shravan-18/AVTCA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes