LGOct 14, 2023

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling

arXiv:2310.09636v12 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses speech synthesis for applications requiring expressive and high-quality voice generation, representing an incremental advancement with specific technical improvements.

The paper tackles text-to-speech synthesis by developing an end-to-end system using generative adversarial training, which achieves improved audio quality through raw phoneme-to-audio conversion with explicit prosody modeling and a new method for character voice matching based on discreet style tokens.

We describe an end-to-end speech synthesis system that uses generative adversarial training. We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling. We experiment with several pre-trained models for contextualized and decontextualized word embeddings and we introduce a new method for highly expressive character voice matching, based on discreet style tokens.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes