Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR
This work addresses the problem of ASR systems underperforming on real speech despite training on human-natural TTS output, with incremental improvements in synthetic data efficiency for speech synthesis.
The study investigated whether oversmoothing in TTS models causes poor ASR performance on real speech when trained on synthetic data, comparing DDPM and MSE models for scalability with varying data hours and speakers. It found that DDPM utilizes more data and diverse speakers better than MSE, achieving a best-reported real-to-synthetic WER ratio of 1.46, though a significant gap persists.
Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work, we explore whether this phenomenon is due to the oversmoothing behaviour of models commonly used in TTS, with a particular focus on the behaviour of TTS-for-ASR as the amount of TTS training data is scaled up. We systematically compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We test the scalability of the two approaches, varying both the number hours, and the number of different speakers. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.