CL SD ASJan 20, 2023

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

arXiv:2301.08810v16.337 citationsh-index: 45

Originality Incremental advance

AI Analysis

This work addresses the problem of enhancing prosody in TTS systems for more natural-sounding speech, though it appears incremental by building on existing BERT and TTS frameworks.

The paper tackled the inefficiency of existing word-level or sup-phoneme-level language models in text-to-speech (TTS) by proposing a phoneme-level BERT with grapheme predictions, resulting in significantly improved mean opinion scores for naturalness on out-of-distribution texts compared to the state-of-the-art StyleTTS baseline.

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

View on arXiv PDF

Similar