SDASAug 6, 2019

Adversarially Trained End-to-end Korean Singing Voice Synthesis System

arXiv:1908.01919v178 citations
AI Analysis

This work addresses singing voice synthesis for Korean language applications, representing an incremental advancement with domain-specific improvements.

The paper tackled the problem of synthesizing Korean singing voice from lyrics and melody by proposing an end-to-end system with three novel approaches, resulting in improved phonetic control and realistic voice generation as confirmed through evaluations.

In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods -- local conditioning of text and pitch, and conditional adversarial training -- are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes