CLAIAug 14, 2024

SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

arXiv:2408.07851v17 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of limited generalization in SER models for researchers and practitioners, though it is incremental as it builds on existing benchmarks and methods.

The authors tackled the challenge of generalizing speech emotion recognition (SER) models across languages and emotional expressions by creating a large-scale benchmark for in-domain and out-of-domain evaluation, finding that the Whisper model outperforms dedicated self-supervised learning models in cross-lingual SER.

Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes