Better speech synthesis through scaling
This work addresses speech synthesis for users needing high-quality, multi-voice TTS, but it is incremental as it adapts existing methods from another domain.
The paper tackled the problem of speech synthesis by applying image generation techniques like autoregressive transformers and DDPMs to create TorToise, an expressive, multi-voice text-to-speech system, with the model code and trained weights open-sourced.
In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.