ASCLSDJun 9, 2023

Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech

Berkeley
arXiv:2306.05709v15 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses the issue of dataset imbalance for researchers and practitioners in speech processing, though it is incremental as it builds on existing augmentation and representation learning methods.

The paper tackles the problem of imbalanced speech datasets in speech emotion recognition and emotional text-to-speech by proposing an Emotion Extractor that uses augmentation to extract robust emotional representations, achieving state-of-the-art results on three imbalanced datasets and improving expressive speech synthesis.

Effective speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks. However, emotional speech samples are more difficult and expensive to acquire compared with Neutral style speech, which causes one issue that most related works unfortunately neglect: imbalanced datasets. Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations. In this paper, we propose an Emotion Extractor to address this issue. We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets. Our empirical results show that (1) for the SER task, the proposed Emotion Extractor surpasses the state-of-the-art baseline on three imbalanced datasets; (2) the produced representations from our Emotion Extractor benefit the TTS model, and enable it to synthesize more expressive speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes