CL LG SD ASAug 6, 2025

The State Of TTS: A Case Study with Human Fooling Rates

Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra

arXiv:2508.04179v12 citationsh-index: 40Has CodeINTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the need for more realistic, human-centric evaluations in TTS, though it is incremental as it builds on existing subjective tests.

The study tackled the problem of evaluating text-to-speech (TTS) systems by introducing Human Fooling Rate (HFR) to measure how often machine-generated speech is mistaken for human, finding that commercial models approach human deception in zero-shot settings while open-source systems struggle with natural conversational speech.

While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

View on arXiv PDF

Similar