SDLGASOct 14, 2022

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

arXiv:2210.07508v220 citationsh-index: 27
Originality Incremental advance
AI Analysis

This work addresses the problem of singing voice synthesis for applications like music production, but it is incremental as it builds on existing diffusion model techniques.

The paper tackles the challenge of generating high-quality singing voices, which is difficult due to varied musical expressions, by proposing a hierarchical diffusion model that uses multiple diffusion models at different sampling rates to progressively generate waveforms. The method outperforms state-of-the-art neural vocoders in quality for multiple singers while maintaining similar computational costs.

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. Experimental results show that the proposed method produces high-quality singing voices for multiple singers, outperforming state-of-the-art neural vocoders with a similar range of computational costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes