SDAINov 26, 2025

Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

arXiv:2511.21342v11 citationsh-index: 5WASPAA
Originality Incremental advance
AI Analysis

This addresses the problem of music analysis and practice by providing a flexible, user-controllable method for separating singing vocals, though it is incremental as it builds on prior generative systems.

The paper tackles singing voice separation from real music recordings by training a diffusion model to generate solo vocals conditioned on the mixture, achieving competitive objective scores against non-generative baselines when trained with supplementary data.

Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes