ASLGSDJun 29, 2021

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

arXiv:2106.15649v1
Originality Incremental advance
AI Analysis

This work addresses speech synthesis quality for text-to-speech systems, but it appears incremental as it builds on existing spectrogram modelling techniques with a multi-scale twist.

The paper tackles the problem of synthesizing speech with improved prosody by proposing a Multi-Scale Spectrogram modelling approach, which predicts coarser and finer scale mel-spectrograms to capture suprasegmental and fine-grained prosodic information, resulting in Word-level MSS performing statistically significantly better than the baseline on two voices.

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes