SDCLNov 10, 2020

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

arXiv:2011.04839v13 citations
AI Analysis

This work addresses improving synthetic speech quality and efficiency for multi-speaker applications, but it is incremental as it builds on existing methods with optimizations.

The study tackled zero-shot multi-speaker speech synthesis by evaluating pretraining strategies, neural vocoders, and acoustic configurations, finding that fine-tuning a multi-speaker model with quality-filtered audiobook data improved naturalness and speaker similarity, and that WaveRNN matched WaveNet quality with faster inference while listeners could distinguish 16kHz from 24kHz sampling rates.

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes