SDLGASFeb 22, 2021

Anyone GAN Sing

arXiv:2102.11058v1
Originality Incremental advance
AI Analysis

This work addresses audio synthesis for singing voices, but it is incremental as it builds on existing GAN methods like WGANSing.

The paper tackles singing voice synthesis by proposing a ConvLSTM-based GAN optimized with Wasserstein loss, achieving results tested via Mel-Cepstral Distance and subjective listening with 18 participants.

The problem of audio synthesis has been increasingly solved using deep neural networks. With the introduction of Generative Adversarial Networks (GAN), another efficient and adjective path has opened up to solve this problem. In this paper, we present a method to synthesize the singing voice of a person using a Convolutional Long Short-term Memory (ConvLSTM) based GAN optimized using the Wasserstein loss function. Our work is inspired by WGANSing by Chandna et al. Our model inputs consecutive frame-wise linguistic and frequency features, along with singer identity and outputs vocoder features. We train the model on a dataset of 48 English songs sung and spoken by 12 non-professional singers. For inference, sequential blocks are concatenated using an overlap-add procedure. We test the model using the Mel-Cepstral Distance metric and a subjective listening test with 18 participants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes