SDAIASNov 23, 2021

Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance

arXiv:2111.11755v4136 citations
Originality Highly original
AI Analysis

This enables text-to-speech synthesis for speakers with untranscribed data, addressing a practical bottleneck in speech synthesis applications.

The paper tackles the problem of generating high-quality speech from text without needing transcripts of the target speaker, achieving performance comparable to the state-of-the-art Grad-TTS model on the LJSpeech dataset.

We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcript of target speaker using classifier guidance. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for classifier guidance. Our unconditional diffusion model learns to generate speech without any context from untranscribed speech data. For TTS synthesis, we guide the generative process of the diffusion model with a phoneme classifier trained on a large-scale speech recognition dataset. We present a norm-based scaling method that reduces the pronunciation errors of classifier guidance in Guided-TTS. We show that Guided-TTS achieves a performance comparable to that of the state-of-the-art TTS model, Grad-TTS, without any transcript for LJSpeech. We further demonstrate that Guided-TTS performs well on diverse datasets including a long-form untranscribed dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes