SD CL ASMay 12, 2023

Better speech synthesis through scaling

arXiv:2305.07243v230.1126 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses speech synthesis for users needing high-quality, multi-voice TTS, but it is incremental as it adapts existing methods from another domain.

The paper tackled the problem of speech synthesis by applying image generation techniques like autoregressive transformers and DDPMs to create TorToise, an expressive, multi-voice text-to-speech system, with the model code and trained weights open-sourced.

In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.

View on arXiv PDF Code

Similar