SDCLASJun 24, 2025

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

arXiv:2506.19441v16 citationsh-index: 3
Originality Incremental advance
AI Analysis

This provides a robust evaluation framework for TTS researchers and developers, addressing the difficulty of comparing systems as they approach human-quality speech, though it is incremental as an enhancement to an existing metric.

The authors tackled the challenge of evaluating high-quality text-to-speech (TTS) systems by introducing TTSDS2, an improved objective metric that achieved a Spearman correlation above 0.50 across all domains and subjective scores, outperforming 16 other metrics.

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes