ASLGSDMLJul 15, 2019

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

arXiv:1907.07769v13 citations
Originality Synthesis-oriented
AI Analysis

This work addresses voice conversion for applications requiring speaker adaptation with limited data, representing an incremental improvement by adapting existing sequence-to-sequence methods from related fields.

The paper tackles voice conversion with limited parallel data by using a hierarchical sequence-to-sequence model trained as an autoencoder on a large single-speaker dataset and adapted to multispeaker datasets, achieving results that leverage mel spectrograms and a wavenet vocoder without specifying concrete performance numbers.

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes