LGSDASMLApr 13, 2019

Unsupervised Singing Voice Conversion

arXiv:1904.06590v358 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of singing voice conversion for music production or entertainment, but it is incremental as it builds on existing unsupervised methods.

The paper tackles the problem of converting one singer's voice to another's without supervision, achieving natural-sounding results that are recognizable as the target singer.

We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any form of supervision: no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a single CNN encoder for all singers, a single WaveNet decoder, and a classifier that enforces the latent representation to be singer-agnostic. Each singer is represented by one embedding vector, which the decoder is conditioned on. In order to deal with relatively small datasets, we propose a new data augmentation scheme, as well as new training losses and protocols that are based on backtranslation. Our evaluation presents evidence that the conversion produces natural signing voices that are highly recognizable as the target singer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes