SDCLASDec 4, 2021

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

arXiv:2112.02418v4613 citations
Originality Highly original
AI Analysis

This addresses the problem of generating speech for diverse speakers and low-resource languages, offering incremental improvements in zero-shot TTS and voice conversion.

The paper tackles zero-shot multi-speaker text-to-speech and voice conversion by introducing YourTTS, a multilingual model based on VITS with novel modifications, achieving state-of-the-art results on VCTK and enabling fine-tuning with less than 1 minute of speech for high voice similarity.

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes