SDASOct 6, 2020

VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

arXiv:2010.02977v327 citations
Originality Incremental advance
AI Analysis

This addresses voice conversion challenges for speech processing applications, but is incremental as it builds on WaveGrad and score matching concepts.

The paper tackles the problem of non-parallel any-to-many voice conversion, enabling conversion from arbitrary speakers without parallel training data, and achieves this by using annealed Langevin dynamics with a score approximator network.

In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes