ASLGSDOct 20, 2019

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

arXiv:1910.08874v4137 citations
Originality Incremental advance
AI Analysis

This work addresses emotion recognition for human-machine interfaces, representing an incremental improvement over existing methods.

The paper tackles speech emotion recognition by proposing a dual-level model using MFCC features and mel-spectrograms, achieving a 6% improvement in accuracy over unimodal state-of-the-art models with weighted accuracy of 72.7% and unweighted accuracy of 73.3%.

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%---a 6% improvement over current state-of-the-art unimodal models---and is comparable with multimodal models that leverage textual information as well as audio signals.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes