SDLGASJun 8, 2021

Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention

arXiv:2106.04133v1114 citations
Originality Incremental advance
AI Analysis

This work addresses emotion recognition from speech, which is important for applications like human-computer interaction, but it appears incremental as it builds on existing deep learning methods with specific architectural tweaks.

The paper tackled speech emotion recognition by proposing a multi-scale CNN and attention architecture that exploits acoustic and lexical information, achieving improvements of 5.0% in weighted accuracy and 5.2% in unweighted accuracy on the IEMOCAP dataset.

Emotion recognition from speech is a challenging task. Re-cent advances in deep learning have led bi-directional recur-rent neural network (Bi-RNN) and attention mechanism as astandard method for speech emotion recognition, extractingand attending multi-modal features - audio and text, and thenfusing them for downstream emotion classification tasks. Inthis paper, we propose a simple yet efficient neural networkarchitecture to exploit both acoustic and lexical informationfrom speech. The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations. Then, a statistical pooling unit (SPU)is used to further extract the features in each modality. Be-sides, an attention module can be built on top of the MSCNN-SPU (audio) and MSCNN (text) to further improve the perfor-mance. Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset with four emotion categories (i.e., angry, happy, sadand neutral) in both weighted accuracy (WA) and unweightedaccuracy (UA), with an improvement of 5.0% and 5.2% respectively under the ASR setting.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes