Simple and Effective Unsupervised Speech Synthesis
This enables speech synthesis for applications where labeled data is scarce or unavailable, though it is incremental as it builds on existing unsupervised recognition and neural synthesis techniques.
The paper tackles the problem of speech synthesis without human-labeled data by developing the first unsupervised system using only unlabeled speech audio, unlabeled text, and a lexicon, achieving results comparable to supervised methods in naturalness and intelligibility as measured by human evaluation.
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.