CLSDASMay 31, 2023

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech

arXiv:2305.19709v127 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of building efficient multilingual TTS systems, though it is incremental as it adapts existing BERT and RoBERTa methods to phoneme-level data.

The authors tackled the lack of multilingual phoneme representations for text-to-speech by introducing XPhoneBERT, a pre-trained model that improved naturalness and prosody in TTS systems and enabled high-quality speech with limited data.

We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data. We publicly release our pre-trained XPhoneBERT with the hope that it would facilitate future research and downstream TTS applications for multiple languages. Our XPhoneBERT model is available at https://github.com/VinAIResearch/XPhoneBERT

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes