SDAIASApr 21, 2024

MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention

arXiv:2404.13509v112 citationsh-index: 12ICME
Originality Incremental advance
AI Analysis

This work addresses the challenge of extracting emotional cues from audio for human-computer interaction, representing an incremental advance in speech emotion recognition.

The paper tackles speech emotion recognition by proposing MFHCA, a method using multi-spatial fusion and hierarchical cooperative attention on spectrograms and raw audio, achieving improvements of 2.6% in weighted accuracy and 1.87% in unweighted accuracy on the IEMOCAP dataset.

Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes