SDApr 21

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

arXiv:2604.1621194.51 citationsh-index: 9
AI Analysis

Provides a standardized evaluation framework for a previously under-evaluated aspect of speech synthesis, enabling fair cross-system comparison.

NVBench introduces a bilingual benchmark for evaluating speech synthesis with non-verbal vocalizations, revealing that controllability often decouples from quality and identifying persistent bottlenecks in low-SNR oral cues and long-duration affective NVVs.

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes