CVMar 28, 2024

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

arXiv:2403.19554v17.69 citationsh-index: 11ICME

Originality Incremental advance

AI Analysis

This work addresses performance degradation in video-based emotion recognition systems for applications like affective computing, but it is incremental as it builds on existing cross-attention methods.

The paper tackles the problem of audio-visual emotion recognition by addressing weak complementary relationships between modalities that degrade performance, proposing Dynamic Cross-Attention to dynamically select features based on relationship strength, which consistently improves results on RECOLA and Aff-Wild2 datasets.

In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.

View on arXiv PDF

Similar