ASCLJul 7, 2024

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

arXiv:2407.05361v3245 citationsh-index: 12Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited training data for speech generation models, particularly for researchers and developers in AI and speech technology, though it is incremental as it builds on existing data collection efforts.

The authors tackled the scarcity of large, diverse, and spontaneous speech datasets for speech generation by introducing Emilia, a dataset with over 101k hours of speech across six languages, and Emilia-Pipe, an open-source preprocessing pipeline, which together enable more natural and spontaneous speech generation.

Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes