CVMMOct 29, 2024

Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation Network

arXiv:2410.22023v31 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses speech emotion recognition, which is important for applications like human-computer interaction, but it appears incremental as it builds on existing transfer learning and multi-modal techniques.

The paper tackles multi-modal speech emotion recognition by proposing a feature distribution adaptation network that aligns visual and audio feature distributions to obtain consistent emotion representations, achieving excellent performance on two benchmark datasets compared to existing methods.

In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two benchmark datasets, and the results demonstrate that our model can achieve excellent performance compared with existing ones.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes