SDCLASMay 12, 2023

Better speech synthesis through scaling

arXiv:2305.07243v2124 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses speech synthesis for users needing high-quality, multi-voice TTS, but it is incremental as it adapts existing methods from another domain.

The paper tackled the problem of speech synthesis by applying image generation techniques like autoregressive transformers and DDPMs to create TorToise, an expressive, multi-voice text-to-speech system, with the model code and trained weights open-sourced.

In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes