ASAICLSDAug 26, 2025

Amplifying Emotional Signals: Data-Efficient Deep Learning for Robust Speech Emotion Recognition

arXiv:2509.00077v11 citations
Originality Incremental advance
AI Analysis

This work addresses data scarcity in SER for human-computer interaction, but it is incremental as it applies existing methods like transfer learning and data augmentation to a known bottleneck.

The paper tackled the challenge of achieving high performance in Speech Emotion Recognition (SER) with limited datasets by developing models like SVMs, LSTMs, and CNNs, and demonstrated that using transfer learning and data augmentation led to a ResNet34 model achieving 66.7% accuracy and an F1 score of 0.631 on combined RAVDESS and SAVEE datasets.

Speech Emotion Recognition (SER) presents a significant yet persistent challenge in human-computer interaction. While deep learning has advanced spoken language processing, achieving high performance on limited datasets remains a critical hurdle. This paper confronts this issue by developing and evaluating a suite of machine learning models, including Support Vector Machines (SVMs), Long Short-Term Memory networks (LSTMs), and Convolutional Neural Networks (CNNs), for automated emotion classification in human speech. We demonstrate that by strategically employing transfer learning and innovative data augmentation techniques, our models can achieve impressive performance despite the constraints of a relatively small dataset. Our most effective model, a ResNet34 architecture, establishes a new performance benchmark on the combined RAVDESS and SAVEE datasets, attaining an accuracy of 66.7% and an F1 score of 0.631. These results underscore the substantial benefits of leveraging pre-trained models and data augmentation to overcome data scarcity, thereby paving the way for more robust and generalizable SER systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes