ASLGSDSep 28, 2021

MSR-NV: Neural Vocoder Using Multiple Sampling Rates

arXiv:2109.13714v3
Originality Incremental advance
AI Analysis

This addresses the need for flexible speech synthesis across applications with varying quality-speed trade-offs, though it is incremental as an extension of existing methods.

The authors tackled the problem of neural vocoders requiring separate training for each sampling rate by proposing MSR-NV, a method that handles multiple sampling rates in a single model, achieving higher subjective quality than Parallel WaveGAN at 16, 24, and 48 kHz without increasing inference time.

The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV. By generating waveforms step-by-step starting from a low sampling rate, MSR-NV can efficiently learn the characteristics of each frequency band and synthesize high-quality speech at multiple sampling rates. It can be regarded as an extension of the previously proposed NVs, and in this study, we extend the structure of Parallel WaveGAN (PWG). Experimental evaluation results demonstrate that the proposed method achieves remarkably higher subjective quality than the original PWG trained separately at 16, 24, and 48 kHz, without increasing the inference time. We also show that MSR-NV can leverage speech with lower sampling rates to further improve the quality of the synthetic speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes