AS AI SDNov 17, 2022

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

arXiv:2211.09383v218.664 citationsh-index: 43

Originality Highly original

AI Analysis

This work addresses the challenge of generating speech that accurately mimics any target speaker's voice for applications in personalized TTS, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles the problem of any-speaker adaptive text-to-speech synthesis, where existing methods have suboptimal accuracy in mimicking target speakers' styles, and presents Grad-StyleSpeech, a diffusion model-based framework that generates highly natural speech with extremely high similarity to target speakers' voices using a few seconds of reference speech, significantly outperforming recent baselines on English benchmarks.

There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.

View on arXiv PDF

Similar