ASLGSDApr 8, 2022

Karaoker: Alignment-free singing voice synthesis with speech training data

arXiv:2204.04127v23 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the challenge of data scarcity and alignment errors in singing voice synthesis for applications like music production and voice conversion, though it is incremental as it builds on existing Tacotron-based architectures.

The paper tackles the problem of singing voice synthesis without requiring singing training data or time-alignment, proposing Karaoker, a model trained on speech data that synthesizes singing voice and transfers style from source waveforms, achieving results comparable to state-of-the-art methods in objective and subjective evaluations.

Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes