AS LG SD MLJul 15, 2019

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Praveen Narayanan, Punarjay Chakravarty, Francois Charette, Gint Puskorius

arXiv:1907.07769v14.33 citations

Originality Synthesis-oriented

AI Analysis

This work addresses voice conversion for applications requiring speaker adaptation with limited data, representing an incremental improvement by adapting existing sequence-to-sequence methods from related fields.

The paper tackles voice conversion with limited parallel data by using a hierarchical sequence-to-sequence model trained as an autoencoder on a large single-speaker dataset and adapted to multispeaker datasets, achieving results that leverage mel spectrograms and a wavenet vocoder without specifying concrete performance numbers.

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.

View on arXiv PDF

Similar