CL SD ASJun 19, 2025

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

arXiv:2506.16381v113 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This addresses a critical gap for researchers and developers in speech synthesis by providing a standardized benchmark to assess and optimize instruction-following TTS models, though it is incremental as it builds on existing work in instruction-driven TTS.

The paper tackles the lack of benchmarks for evaluating how well text-to-speech systems follow complex natural-language instructions to control paralinguistic features, introducing InstructTTSEval with 6,000 test cases across three tasks and using Gemini as an automatic judge, finding substantial room for improvement in existing systems.

In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.

View on arXiv PDF Code

Similar