ASCLSDNov 4, 2020

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

arXiv:2011.02252v121 citations
AI Analysis

This work addresses the challenge of producing natural-sounding speech with appropriate prosody for text-to-speech systems, representing an incremental advancement in the field.

The paper tackles the problem of generating contextually appropriate prosody in neural text-to-speech by introducing Kathaka, a model with a two-stage training process that learns a prosodic distribution and samples from it using contextual text information, resulting in a statistically significant 13.2% relative improvement in naturalness over a strong baseline.

In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes