SDASOct 29, 2021

VRAIN-UPV MLLP's system for the Blizzard Challenge 2021

arXiv:2110.15792v16 citations
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement in speech synthesis for Spanish, addressing the challenge of building high-quality TTS systems with limited data.

The paper describes a Spanish text-to-speech system for the Blizzard Challenge 2021, using a two-stage neural pipeline with a non-autoregressive acoustic model and HiFi-GAN vocoder, which achieved a naturalness MOS of 3.61, second-best among 12 participants.

This paper presents the VRAIN-UPV MLLP's speech synthesis system for the SH1 task of the Blizzard Challenge 2021. The SH1 task consisted in building a Spanish text-to-speech system trained on (but not limited to) the corpus released by the Blizzard Challenge 2021 organization. It included 5 hours of studio-quality recordings from a native Spanish female speaker. In our case, this dataset was solely used to build a two-stage neural text-to-speech pipeline composed of a non-autoregressive acoustic model with explicit duration modeling and a HiFi-GAN neural vocoder. Our team is identified as J in the evaluation results. Our system obtained very good results in the subjective evaluation tests. Only one system among other 11 participants achieved better naturalness than ours. Concretely, it achieved a naturalness MOS of 3.61 compared to 4.21 for real samples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes