CLApr 3, 2021

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

arXiv:2104.01408v242 citations
Originality Incremental advance
AI Analysis

This addresses the issue of unclear emotional expression in synthesized speech for applications like human-computer interaction, though it appears incremental as it builds on existing ETTS methods.

The paper tackled the problem of poor emotion discriminability in emotional text-to-speech synthesis by proposing an interactive training paradigm with reinforcement learning, resulting in speech that outperforms state-of-the-art baselines in rendering more accurate emotion styles.

Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes