SD CL ASJan 27, 2025

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen

arXiv:2501.15907v225.831 citationsh-index: 12Has CodeIEEE Transactions on Audio, Speech, and Language Processing

Originality Synthesis-oriented

AI Analysis

This addresses the problem of generating realistic spontaneous speech for applications like conversational AI, though it is incremental as it focuses on dataset creation rather than a new model paradigm.

The authors tackled the limitation of speech generation models trained on formal audio-book datasets by introducing Emilia, a large-scale multilingual dataset of over 101k hours extracted from in-the-wild sources, which enables models to produce more spontaneous and human-like speech while maintaining intelligibility.

Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.

View on arXiv PDF Code

Similar