ASSDDec 3, 2019

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

arXiv:1912.01167v115 citations
Originality Incremental advance
AI Analysis

This work addresses the need for higher-quality synthesized speech in speech synthesis systems, representing an incremental improvement.

The paper tackles the problem of over-smooth mel-spectrograms in speech synthesis by using a learning-based post-filter with Pix2PixHD and ResUnet for super-resolution reconstruction, achieving improved mean opinion scores of 3.71 and 4.01 compared to baselines of 3.29 and 3.84.

In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes