SDAIJun 5

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

arXiv:2606.0701538.4
Originality Highly original
AI Analysis

This work addresses the isolation between song generation and SVC by providing a unified framework that enables cross-task timbre control and vocal-accompaniment synergy, benefiting music production and AI research.

UniSinger unifies song generation and singing voice conversion (SVC) in a single end-to-end framework, enabling zero-shot speaker cloning and accompaniment co-generation. It achieves state-of-the-art performance on both tasks, with complementary benefits for intelligent music production.

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes