PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
This work addresses the challenge of generating high-quality, natural-sounding speech in neural TTS systems, which is incremental as it builds upon existing BERT models by incorporating multi-modal text representations.
The paper tackles the problem of improving prosody and pronunciation accuracy in neural text-to-speech (TTS) by introducing PnG BERT, an encoder that uses phoneme and grapheme inputs with word-level alignment, pre-trained on large text corpora and fine-tuned for TTS. Experimental results show that this model yields more natural prosody and accurate pronunciation than a baseline, with subjective evaluations indicating no statistically significant preference between synthesized speech and ground truth recordings from professional speakers.
This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.