SDAIHCLGJun 14, 2024

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

arXiv:2406.09933v110 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of real-world generalization in SER for enhancing human-computer interaction, though it is incremental as it builds on existing methods with new benchmarks and evaluations.

The paper tackled the problem of generalizing speech emotion recognition (SER) models across diverse datasets by creating a comprehensive benchmark using 11 emotional speech datasets and addressing data imbalance with over-sampling methods, resulting in improved evaluation protocols and insights into using models like Whisper for SER.

Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes