Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling
This work addresses speech synthesis for applications requiring expressive and high-quality voice generation, representing an incremental advancement with specific technical improvements.
The paper tackles text-to-speech synthesis by developing an end-to-end system using generative adversarial training, which achieves improved audio quality through raw phoneme-to-audio conversion with explicit prosody modeling and a new method for character voice matching based on discreet style tokens.
We describe an end-to-end speech synthesis system that uses generative adversarial training. We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling. We experiment with several pre-trained models for contextualized and decontextualized word embeddings and we introduce a new method for highly expressive character voice matching, based on discreet style tokens.