ASSDMay 18, 2020

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

arXiv:2005.08654v22 citations
AI Analysis

This work addresses pitch controllability issues in non-autoregressive speech generation for applications like text-to-speech, but it is incremental as it builds on existing PWG architecture.

The paper tackled the problem of pitch accuracy degradation in Parallel WaveGAN (PWG) vocoders when handling unseen pitches, by proposing a Quasi-Periodic Parallel WaveGAN (QPPWG) with a pitch-dependent dilated convolution network, resulting in higher pitch accuracy and comparable speech quality with a model size reduced to 70% of vanilla PWG.

In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic features, PWG generates high-fidelity speech. However, when the input acoustic features include unseen pitches, the pitch accuracy of PWG-generated speech degrades because of the fixed and generic network of PWG without prior knowledge of speech periodicity. The proposed QPPWG adopts a pitch-dependent dilated convolution network (PDCNN) module, which introduces the pitch information into PWG via the dynamically changed network architecture, to improve the pitch controllability and speech modeling capability of vanilla PWG. Both objective and subjective evaluation results show the higher pitch accuracy and comparable speech quality of QPPWG-generated speech when the QPPWG model size is only 70 % of that of vanilla PWG.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes