SDCLASMLMar 21, 2022

WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

arXiv:2203.10750v521 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses singing voice synthesis for Chinese music applications, representing an incremental improvement with specific technical enhancements.

The paper tackles singing voice synthesis by developing WeSinger, a multi-singer Chinese system that uses data augmentation and auxiliary losses, achieving state-of-the-art performance on the Opencpop corpus with improved accuracy and naturalness.

In this paper, we develop a new multi-singer Chinese neural singing voice synthesis (SVS) system named WeSinger. To improve the accuracy and naturalness of synthesized singing voice, we design several specifical modules and techniques: 1) A deep bi-directional LSTM-based duration model with multi-scale rhythm loss and post-processing step; 2) A Transformer-alike acoustic model with progressive pitch-weighted decoder loss; 3) a 24 kHz pitch-aware LPCNet neural vocoder to produce high-quality singing waveforms; 4) A novel data augmentation method with multi-singer pre-training for stronger robustness and naturalness. To our knowledge, WeSinger is the first SVS system to adopt 24 kHz LPCNet and multi-singer pre-training simultaneously. Both quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness, and WeSinger achieves state-of-the-art performance on the recent public Chinese singing corpus Opencpop\footnote{https://wenet.org.cn/opencpop/}. Some synthesized singing samples are available online\footnote{https://zzw922cn.github.io/wesinger/}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes