SDLGOct 6, 2015

A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis

arXiv:1510.01443v13 citations
Originality Incremental advance
AI Analysis

This work addresses speech quality issues for speech synthesis applications, representing an incremental improvement by focusing on phase spectrum modeling.

The paper tackles the problem of low-quality synthesized speech in statistical parametric speech synthesis by proposing a phase-embedded waveform representation framework, which outperforms a leading baseline system in objective evaluation metrics.

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes