CLSDASAug 17, 2023

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

arXiv:2308.08713v16 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses the gap in multilingual speech emotion recognition benchmarking, which is incremental but important for improving emotion detection in diverse linguistic contexts.

The study tackled the problem of evaluating transformer-based speech models for speech emotion recognition across multiple languages, finding that using features from a single optimal layer reduces error rates by 32% on average and achieves state-of-the-art results for German and Persian.

Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes