Stylus: Repurposing Stable Diffusion for Training-Free Music Style Transfer on Mel-Spectrograms
This addresses the problem of personalized music creation for users by offering a training-free, efficient method, though it is incremental as it builds on existing diffusion models.
The paper tackles music style transfer by repurposing Stable Diffusion for training-free mel-spectrogram manipulation, achieving 34.1% higher content preservation and 25.7% better perceptual quality compared to state-of-the-art baselines.
Music style transfer enables personalized music creation by blending the structure of a source with the stylistic attributes of a reference. Existing text-conditioned and diffusion-based approaches show promise but often require paired datasets, extensive training, or detailed annotations. We present Stylus, a training-free framework that repurposes a pre-trained Stable Diffusion model for music style transfer in the mel-spectrogram domain. Stylus manipulates self-attention by injecting style key-value features while preserving source queries to maintain musical structure. To improve fidelity, we introduce a phase-preserving reconstruction strategy that avoids artifacts from Griffin-Lim reconstruction, and we adopt classifier-free-guidance-inspired control for adjustable stylization and multi-style blending. In extensive evaluations, Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality without any additional training.