NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Qinke Ni, Huan Liao, Dekun Chen, Yuxiang Wang, Zhizheng Wu

arXiv:2603.1535263.72 citationsh-index: 4

AI Analysis

This provides a standardized evaluation framework for researchers and developers working on expressive text-to-speech, though it is incremental as it builds on existing TTS integration of nonverbal vocalizations.

The authors tackled the lack of standardized evaluation for nonverbal vocalizations in text-to-speech systems by proposing NV-Bench, a benchmark with 1,651 multilingual utterances and a dual-dimensional protocol, showing strong correlation between objective metrics and human perception.

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.

View on arXiv PDF

Similar