SDCLASJan 3, 2019

Feature reinforcement with word embedding and parsing information in neural TTS

arXiv:1901.00707v215 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of synthesizing high-quality speech for unseen text in TTS systems, representing an incremental improvement through multi-level feature integration.

The paper tackles the problem of improving generalization in neural text-to-speech synthesis by proposing a feature reinforcement method that incorporates phoneme sequences, word embeddings, and grammatical structures as input features, resulting in significantly enhanced robustness and near recording-quality speech for out-of-domain text.

In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework. The proposed method utilizes the multiple input encoder to take three levels of text information, i.e., phoneme sequence, pre-trained word embedding, and grammatical structure of sentences from parser as the input feature for the neural TTS system. The added word and sentence level information can be viewed as the feature based pre-training strategy, which clearly enhances the model generalization ability. The proposed method not only improves the system robustness significantly but also improves the synthesized speech to near recording quality in our experiments for out-of-domain text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes