SDLGASOct 22, 2019

Sequence-to-sequence Singing Synthesis Using the Feed-forward Transformer

arXiv:1910.09989v257 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and robust singing synthesis for applications in music production and entertainment, representing an incremental improvement over existing methods.

The paper tackles the problem of singing synthesis without pre-aligned training data by proposing a sequence-to-sequence model using a feed-forward Transformer, which achieves faster inference and avoids exposure bias compared to an autoregressive baseline.

We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance of the accuracy of the duration model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes