CVMar 28, 2024

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

arXiv:2403.19554v19 citationsh-index: 11ICME
Originality Incremental advance
AI Analysis

This work addresses performance degradation in video-based emotion recognition systems for applications like affective computing, but it is incremental as it builds on existing cross-attention methods.

The paper tackles the problem of audio-visual emotion recognition by addressing weak complementary relationships between modalities that degrade performance, proposing Dynamic Cross-Attention to dynamically select features based on relationship strength, which consistently improves results on RECOLA and Aff-Wild2 datasets.

In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes