AS LG SDSep 24, 2025

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

arXiv:2509.20485v12.31 citationsh-index: 25IEEE Open Journal of Signal Processing

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation metrics in speech synthesis, which is crucial for researchers and developers in the field, though it is incremental as it builds on existing token-based methods.

The authors tackled the problem of objective evaluation in speech synthesis by proposing TTScore, a reference-free framework that uses conditional prediction of discrete tokens to measure intelligibility and prosody, achieving stronger correlations with human judgments than existing metrics on benchmarks like SOMOS, VoiceMOS, and TTSArena.

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

View on arXiv PDF

Similar