CL LG SD ASMar 5, 2021

Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis

Mutian He, Jingzhou Yang, Lei He, Frank K. Soong

arXiv:2103.03541v23.020 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of scalable speech synthesis for low-resource languages, enabling synthesis without per-language resources, though it is incremental as it builds on existing multilingual and byte-based approaches.

The authors tackled the problem of scaling neural speech synthesis to many languages, especially low-resource ones, by developing a multilingual end-to-end framework that maps byte inputs to spectrograms, achieving strong results on over 40 languages and adapting to new languages with as little as 40 seconds of transcribed recording.

To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts. Besides strong results on 40+ languages, the framework demonstrates capabilities to adapt to new languages under extreme low-resource and even few-shot scenarios of merely 40s transcribed recording, without the need of per-language resources like lexicon, extra corpus, auxiliary models, or linguistic expertise, thus ensuring scalability. While it retains satisfactory intelligibility and naturalness matching rich-resource models. Exhaustive comparative and ablation studies are performed to reveal the potential of the framework for low-resource languages. Furthermore, we propose a novel method to extract language-specific sub-networks in a multilingual model for a better understanding of its mechanism.

View on arXiv PDF Code

Similar