SDAIASMar 2, 2023

Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

Berkeley
arXiv:2303.01508v213 citationsh-index: 27
AI Analysis

This work addresses the need for more expressive and controllable emotional speech synthesis in TTS applications, representing an incremental improvement over existing methods.

The paper tackles the problem of fine-grained emotional control in Text-To-Speech (TTS) models, which often produce neutral speech, by proposing a model that considers both inter- and intra-class emotion intensities to synthesize speech with recognizable intensity differences, and it demonstrates superior controllability, emotion expressiveness, and naturalness compared to two state-of-the-art models.

State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech. The generated speech, however, is usually neutral in emotional expression, whereas very often one would want fine-grained emotional control of words or phonemes. Although still challenging, the first TTS models have been recently proposed that are able to control voice by manually assigning emotion intensity. Unfortunately, due to the neglect of intra-class distance, the intensity differences are often unrecognizable. In this paper, we propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances and be able to synthesize speech with recognizable intensity difference. Our subjective and objective experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion expressiveness and naturalness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes