SD CLNov 10, 2020

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi

arXiv:2011.04839v16.23 citations

Originality Synthesis-oriented

AI Analysis

This work addresses improving synthetic speech quality and efficiency for multi-speaker applications, but it is incremental as it builds on existing methods with optimizations.

The study tackled zero-shot multi-speaker speech synthesis by evaluating pretraining strategies, neural vocoders, and acoustic configurations, finding that fine-tuning a multi-speaker model with quality-filtered audiobook data improved naturalness and speaker similarity, and that WaveRNN matched WaveNet quality with faster inference while listeners could distinguish 16kHz from 24kHz sampling rates.

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

View on arXiv PDF

Similar