ASSDDec 17, 2020

DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling

arXiv:2012.09547v250 citations
AI Analysis

This work provides a solution for training high-quality TTS models for speakers when only noisy speech data is available, which is a common and costly problem for speech synthesis researchers and developers.

This paper addresses the challenge of training text-to-speech (TTS) models with noisy speech data by introducing DenoiSpeech, a system that models frame-level noise. DenoiSpeech synthesizes clean speech from noisy input and outperforms previous methods by 0.31 and 0.66 MOS on real-world data.

While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that DenoiSpeech outperforms the previous two methods by 0.31 and 0.66 MOS respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes