AS LG SDOct 20, 2019

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Jianyou Wang, Michael Xue, Ryan Culhane, Enmao Diao, Jie Ding, Vahid Tarokh

arXiv:1910.08874v415.2140 citationsh-index: 63

Originality Incremental advance

AI Analysis

This work addresses emotion recognition for human-machine interfaces, representing an incremental improvement over existing methods.

The paper tackles speech emotion recognition by proposing a dual-level model using MFCC features and mel-spectrograms, achieving a 6% improvement in accuracy over unimodal state-of-the-art models with weighted accuracy of 72.7% and unweighted accuracy of 73.3%.

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%---a 6% improvement over current state-of-the-art unimodal models---and is comparable with multimodal models that leverage textual information as well as audio signals.

View on arXiv PDF

Similar