SDAICLMMASMar 8, 2025

Bimodal Connection Attention Fusion for Speech Emotion Recognition

arXiv:2503.05858v3h-index: 3
Originality Synthesis-oriented
AI Analysis

This work addresses speech emotion recognition, a domain-specific problem, with an incremental approach that improves accuracy for applications like human-computer interaction.

The paper tackled the challenge of multi-modal emotion recognition by proposing the Bimodal Connection Attention Fusion (BCAF) method, which achieved state-of-the-art performance on the MELD and IEMOCAP datasets.

Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes