ASSDOct 26, 2020

Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition

arXiv:2010.13350v272 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of limited labeled data for emotional TTS, enabling more accessible synthesis for applications like virtual assistants, though it is incremental by building on existing TTS and SER methods.

The paper tackles the problem of emotional text-to-speech synthesis without requiring emotion-labeled datasets by using a cross-domain speech emotion recognition model to predict labels and jointly train with a TTS model, resulting in generated speech with specified emotional expressiveness and minimal impact on quality.

Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use emotion labels on the TTS dataset predicted by the trained SER model to build an auxiliary SER task and jointly train it with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes