ASLGSDOct 8, 2021

KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms

arXiv:2110.04005v16 citations
Originality Incremental advance
AI Analysis

This addresses a less-studied problem in singing voice synthesis for applications like karaoke or music generation, but it is incremental as it builds on existing VQ-VAE and language model techniques.

The paper tackles score-free singing voice synthesis by proposing KaraSinger, a model that uses VQ-VAE and a language model to generate singing from lyrics without pre-defined scores, achieving high scores in intelligibility, musicality, and overall quality in listening tests.

In this paper, we propose a novel neural network model called KaraSinger for a less-studied singing voice synthesis (SVS) task named score-free SVS, in which the prosody and melody are spontaneously decided by machine. KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics. For the VQ-VAE part, we employ a Connectionist Temporal Classification (CTC) loss to encourage the discrete codes to carry phoneme-related information. For the LM part, we use location-sensitive attention for learning a robust alignment between the input phoneme sequence and the output discrete code. We keep the architecture of both the VQ-VAE and LM light-weight for fast training and inference speed. We validate the effectiveness of the proposed design choices using a proprietary collection of 550 English pop songs sung by multiple amateur singers. The result of a listening test shows that KaraSinger achieves high scores in intelligibility, musicality, and the overall quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes