ASLGSDMLMar 31, 2022

SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

arXiv:2203.16749v258 citations
AI Analysis

This work addresses speech quality enhancement for neural vocoders, representing an incremental improvement over existing diffusion-based methods.

The authors tackled the problem of improving speech quality in neural vocoders by adapting the diffusion noise distribution to match the spectral envelope of the conditioning features, resulting in higher-fidelity speech generation in both analysis-synthesis and speech enhancement scenarios.

Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at wavegrad.github.io/specgrad/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes