ASCLLGSDDec 12, 2019

Singing Synthesis: with a little help from my attention

arXiv:1912.05881v217 citations
Originality Incremental advance
AI Analysis

This addresses the problem of generating natural-sounding singing voices for applications in music production and entertainment, representing an incremental advance by adapting text-to-speech techniques to singing synthesis.

The paper tackles singing synthesis by proposing UTACO, an attention-based sequence-to-sequence model with a dilated causal convolution vocoder, which improves naturalness over state-of-the-art neural models without requiring explicit modeling of voice features like F0 patterns and durations.

We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis field and improves naturalness over the state of the art. The system requires considerably less explicit modelling of voice features such as F0 patterns, vibratos, and note and phoneme durations, than previous models in the literature. Despite this, it shows a strong improvement in naturalness with respect to previous neural singing synthesis models. The model does not require any durations or pitch patterns as inputs, and learns to insert vibrato autonomously according to the musical context. However, we observe that, by completely dispensing with any explicit duration modelling it becomes harder to obtain the fine control of timing needed to exactly match the tempo of a song.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes