Improving Music Source Separation with Diffusion and Consistency Refinement
For practitioners of music source separation, this work offers a plug-and-play refinement that boosts performance of existing separators with minimal inference cost.
The paper proposes a diffusion-based refinement stage for music source separation that improves quality, then uses consistency distillation to reduce inference to a single step while maintaining or exceeding quality, achieving state-of-the-art results on Slakh2100 and MUSDB18.
In this work, we propose an approach to music source separation that uses a generative diffusion model as a last-stage refinement on top of a deterministic separator, progressively enhancing the separated sources through iterative denoising. While the diffusion refinement yields measurable quality gains, it requires iterative steps at inference, increasing computational cost. To speed up the inference process, we apply consistency distillation, reducing inference to a single step while maintaining quality; with two or more steps, the distilled model even surpasses the diffusion-based approach. Crucially, our method is architecture-agnostic: we demonstrate state-of-the-art results when applied to both a custom U-Net-based separator on Slakh2100 and the state-of-the-art BS-RoFormer model on MUSDB18, showing that the refinement generalizes across backbone architectures. Sound examples are available at: https://consistency-separation.github.io/.