ASAIApr 14, 2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

arXiv:2404.09313v319 citationsh-index: 29ACL
AI Analysis

This addresses the lack of integrated song synthesis for music creators, though it is incremental as it builds on existing singing voice and music generation techniques.

The paper tackles the problem of generating complete songs with both vocals and accompaniment from text, proposing a two-stage method called Melodist that achieves comparable quality and style consistency on a new Chinese song dataset.

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes