Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation
This work addresses a limitation in TTS systems' generalization of prosodic detail, which is important for improving synthetic speech authenticity and interpretability, though it is incremental as it builds on existing evaluation methods.
The study assessed neural TTS systems' ability to model consonant-induced f0 perturbation, finding they accurately reproduce it for high-frequency words but poorly generalize to low-frequency items, indicating reliance on lexical memorization rather than abstract prosodic encoding.
This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.