AS AI CL SDSep 23, 2025

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Seungyoun Shin, Dongha Ahn, Jiwoo Kim, Sungwook Jeon

arXiv:2509.18531v12 citationsh-index: 2

Originality Highly original

AI Analysis

This addresses the challenge of learning natural prosody in TTS for applications like call centers, where automatic rewards are unavailable, offering a data-efficient solution.

The paper tackles the problem of prosody collapse in neural text-to-speech (TTS) systems when trained with transcription-oriented rewards, which leads to monotone and unnatural speech. Their iterative Direct Preference Optimization (DPO) method, using only a few hundred human-labeled preference pairs per round, achieves the highest human preference (ELO) with competitive character error rate (CER) on the KoCC-TTS dataset, outperforming GRPO and commercial baselines.

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

View on arXiv PDF

Similar