ASCLLGSDMLSep 21, 2020

DiffWave: A Versatile Diffusion Model for Audio Synthesis

arXiv:2009.09761v31946 citations
Originality Highly original
AI Analysis

This addresses the problem of slow and low-quality audio synthesis for applications like speech generation and music production, offering a versatile and efficient alternative.

The authors tackled audio synthesis by proposing DiffWave, a diffusion model for waveform generation, achieving speech quality matching a strong WaveNet vocoder (MOS: 4.44 vs. 4.43) with much faster synthesis and outperforming autoregressive and GAN-based models in unconditional generation.

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Code Implementations11 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes