CLSDNov 1, 2017

Uncovering Latent Style Factors for Expressive Speech Synthesis

arXiv:1711.00520v153 citations
Originality Incremental advance
AI Analysis

This addresses the problem of expressive speech synthesis for users needing controllable prosody, but it is incremental as it builds on an existing model.

The paper tackled the challenge of generating desirable prosody from text in speech synthesis by introducing style tokens in Tacotron to extract independent prosodic styles from training data without annotations, resulting in predictable and globally consistent control over prosodic style.

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes