ASJul 7, 2024Code
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech GenerationHaorui He, Zengqiang Shang, Chaoren Wang et al.
Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.
SDJan 27, 2025Code
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech GenerationHaorui He, Zengqiang Shang, Chaoren Wang et al.
Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.
LGJan 4, 2025Code
DiffGraph: Heterogeneous Graph Diffusion ModelZongwei Li, Lianghao Xia, Hua Hua et al.
Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: https://github.com/HKUDS/DiffGraph.
SDJan 29, 2022
The HCCL-DKU system for fake audio generation task of the 2022 ICASSP ADD ChallengeZiyi Chen, Hua Hua, Yuxiang Zhang et al.
The voice conversion task is to modify the speaker identity of continuous speech while preserving the linguistic content. Generally, the naturalness and similarity are two main metrics for evaluating the conversion quality, which has been improved significantly in recent years. This paper presents the HCCL-DKU entry for the fake audio generation task of the 2022 ICASSP ADD challenge. We propose a novel ppg-based voice conversion model that adopts a fully end-to-end structure. Experimental results show that the proposed method outperforms other conversion models, including Tacotron-based and Fastspeech-based models, on conversion quality and spoofing performance against anti-spoofing systems. In addition, we investigate several post-processing methods for better spoofing power. Finally, we achieve second place with a deception success rate of 0.916 in the ADD challenge.
AIJul 28, 2018
Towards Explainable Inference about Object Motion using Qualitative ReasoningXiaoyu Ge, Jochen Renz, Hua Hua
The capability of making explainable inferences regarding physical processes has long been desired. One fundamental physical process is object motion. Inferring what causes the motion of a group of objects can even be a challenging task for experts, e.g., in forensics science. Most of the work in the literature relies on physics simulation to draw such infer- ences. The simulation requires a precise model of the under- lying domain to work well and is essentially a black-box from which one can hardly obtain any useful explanation. By contrast, qualitative reasoning methods have the advan- tage in making transparent inferences with ambiguous infor- mation, which makes it suitable for this task. However, there has been no suitable qualitative theory proposed for object motion in three-dimensional space. In this paper, we take this challenge and develop a qualitative theory for the motion of rigid objects. Based on this theory, we develop a reasoning method to solve a very interesting problem: Assuming there are several objects that were initially at rest and now have started to move. We want to infer what action causes the movement of these objects.