AS SDNov 15, 2018

Towards achieving robust universal neural vocoding

Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

arXiv:1811.06292v225.358 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of creating a universal vocoder for speech synthesis that generalizes across speakers, languages, and recording conditions, representing an incremental improvement over existing methods.

The paper tackles the problem of achieving robust universal neural vocoding by training a WaveRNN-based vocoder on 74 speakers from 17 languages, resulting in consistently good speech quality (98% relative mean MUSHRA) for in-domain scenarios and outperforming speaker-dependent vocoders in out-of-domain conditions (75% MUSHRA).

This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

View on arXiv PDF

Similar