Shulin He

SD
h-index23
3papers
47citations
Novelty42%
AI Score23

3 Papers

ASApr 23, 2024
FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye, Zeqian Ju, Haohe Liu et al.

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

SDMay 6, 2021
DBNet: A Dual-branch Network Architecture Processing on Spectrum and Waveform for Single-channel Speech Enhancement

Kanghao Zhang, Shulin He, Hao Li et al.

In real acoustic environment, speech enhancement is an arduous task to improve the quality and intelligibility of speech interfered by background noise and reverberation. Over the past years, deep learning has shown great potential on speech enhancement. In this paper, we propose a novel real-time framework called DBNet which is a dual-branch structure with alternate interconnection. Each branch incorporates an encoder-decoder architecture with skip connections. The two branches are responsible for spectrum and waveform modeling, respectively. A bridge layer is adopted to exchange information between the two branches. Systematic evaluation and comparison show that the proposed system substantially outperforms related algorithms under very challenging environments. And in INTERSPEECH 2021 Deep Noise Suppression (DNS) challenge, the proposed system ranks the top 8 in real-time track 1 in terms of the Mean Opinion Score (MOS) of the ITU-T P.835 framework.

SDOct 25, 2020
Speakerfilter-Pro: an improved target speaker extractor combines the time domain and frequency domain

Shulin He, Hao Li, Xueliang Zhang

This paper introduces an improved target speaker extractor, referred to as Speakerfilter-Pro, based on our previous Speakerfilter model. The Speakerfilter uses a bi-direction gated recurrent unit (BGRU) module to characterize the target speaker from anchor speech and use a convolutional recurrent network (CRN) module to separate the target speech from a noisy signal.Different from the Speakerfilter, the Speakerfilter-Pro sticks a WaveUNet module in the beginning and the ending, respectively. The WaveUNet has been proven to have a better ability to perform speech separation in the time domain. In order to extract the target speaker information better, the complex spectrum instead of the magnitude spectrum is utilized as the input feature for the CRN module. Experiments are conducted on the two-speaker dataset (WSJ0-mix2) which is widely used for speaker extraction. The systematic evaluation shows that the Speakerfilter-Pro outperforms the Speakerfilter and other baselines, and achieves a signal-to-distortion ratio (SDR) of 14.95 dB.