ASLGAug 13, 2024

PRESENT: Zero-Shot Text-to-Prosody Control

arXiv:2408.06827v12 citationsh-index: 25
Originality Highly original
AI Analysis

This addresses the problem of zero-shot prosody control for speech synthesis researchers and practitioners, offering a novel approach with significant performance gains.

The paper tackles fine-grained prosody control in speech synthesis by introducing PRESENT, a method that modifies the inference process of pretrained models without retraining, achieving over 2x reduction in character error rates for zero-shot language transfer and enabling subphoneme-level control.

Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve the prosody of questions, and use it to generate Mandarin, a tonal language where vowel pitch varies at subphoneme level. We attain 25.3% hanzi CER and 13.0% pinyin CER with the JETS model. All our code and audio samples are available online.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes