AS LG SDApr 8, 2022

Karaoker: Alignment-free singing voice synthesis with speech training data

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

arXiv:2204.04127v22.33 citationsh-index: 20

Originality Incremental advance

AI Analysis

This addresses the challenge of data scarcity and alignment errors in singing voice synthesis for applications like music production and voice conversion, though it is incremental as it builds on existing Tacotron-based architectures.

The paper tackles the problem of singing voice synthesis without requiring singing training data or time-alignment, proposing Karaoker, a model trained on speech data that synthesizes singing voice and transfers style from source waveforms, achieving results comparable to state-of-the-art methods in objective and subjective evaluations.

Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.

View on arXiv PDF

Similar