LGOct 14, 2023

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling

Tiberiu Boros, Stefan Daniel Dumitrescu, Ionut Mironica, Radu Chivereanu

arXiv:2310.09636v12.02 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses speech synthesis for applications requiring expressive and high-quality voice generation, representing an incremental advancement with specific technical improvements.

The paper tackles text-to-speech synthesis by developing an end-to-end system using generative adversarial training, which achieves improved audio quality through raw phoneme-to-audio conversion with explicit prosody modeling and a new method for character voice matching based on discreet style tokens.

We describe an end-to-end speech synthesis system that uses generative adversarial training. We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling. We experiment with several pre-trained models for contextualized and decontextualized word embeddings and we introduce a new method for highly expressive character voice matching, based on discreet style tokens.

View on arXiv PDF Code

Similar