LGSDJul 17, 2025

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

arXiv:2507.13155v114 citationsh-index: 6Has Code13th edition of the Speech Synthesis Workshop
Originality Synthesis-oriented
AI Analysis

This addresses a key bottleneck in expressive text-to-speech research by providing a public dataset, though it is incremental as it builds on existing sources and methods.

The authors tackled the limited availability of open-source datasets for expressive speech synthesis by introducing NonverbalTTS, a 17-hour dataset with annotations for nonverbal vocalizations and emotions, and fine-tuning TTS models on it achieved parity with closed-source systems in human evaluation and automatic metrics.

Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at https://huggingface.co/datasets/deepvk/NonverbalTTS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes