SDSep 14, 2024
Prevailing Research Areas for Music AI in the Era of Foundation ModelsMegan Wei, Mateusz Modrzejewski, Aswin Sivaraman et al. · mit
Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
SDOct 27, 2024Code
Symbotunes: unified hub for symbolic music generative modelsPaweł Skierś, Maksymilian Łazarski, Michał Kopeć et al.
Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.
SDFeb 12
TADA! Tuning Audio Diffusion Models through Activation SteeringŁukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski et al.
Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.
10.4LGMay 6
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion ModelsAdam Szokalski, Mateusz Modrzejewski
This paper proposes latent visualization by optimization (LVO), a mechanistic interpretability technique that extends feature visualization by optimization - originally developed for convolutional neural networks - to latent diffusion models. LVO employs sparse autoencoders (SAEs) to disentangle polysemantic layer representations into monosemantic features. Key contributions include latent-space optimization, time-step activity analysis, schedule-matched noise injection, prior initialization through feature steering, and suitable regularization strategies. We demonstrate the method on Stable Diffusion 1.5 fine-tuned on the Style50 dataset, showing that SAE features produce clear visualizations of recognizable concepts - including diagonal compositions, human figures, roses, cables, and waterfall foam - that correlate with dataset examples, while the baseline without disentanglement produces less coherent results. We further show that regularization techniques from pixel-space feature visualization transfer to the latent domain, though they require different configurations for the raw-layer and SAE variants. Compared to dataset examples and steering, LVO provides complementary insights by directly revealing what activates a feature rather than its downstream effects.
SDJan 20
ConceptCaps -- a Distilled Concept Dataset for Interpretability in Music ModelsBruno Sienkiewicz, Łukasz Neumann, Mateusz Modrzejewski
Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.
SDDec 10, 2024
Frechet Music Distance: A Metric For Generative Symbolic Music EvaluationJan Retkowski, Jakub Stępniak, Mateusz Modrzejewski
In this paper we introduce the Frechet Music Distance (FMD), a novel evaluation metric for generative symbolic music models, inspired by the Frechet Inception Distance (FID) in computer vision and Frechet Audio Distance (FAD) in generative audio. FMD calculates the distance between distributions of reference and generated symbolic music embeddings, capturing abstract musical features. We validate FMD across several datasets and models. Results indicate that FMD effectively differentiates model quality, providing a domain-specific metric for evaluating symbolic music generation, and establishing a reproducible standard for future research in symbolic music modeling.
SDJul 7, 2025
Evaluating Fake Music Detection Performance Under Audio AugmentationsTomasz Sroka, Tomasz Wężowicz, Dominik Sidorczuk et al.
With the rapid advancement of generative audio models, distinguishing between human-composed and generated music is becoming increasingly challenging. As a response, models for detecting fake music have been proposed. In this work, we explore the robustness of such systems under audio augmentations. To evaluate model generalization, we constructed a dataset consisting of both real and synthetic music generated using several systems. We then apply a range of audio transformations and analyze how they affect classification accuracy. We test the performance of a recent state-of-the-art musical deepfake detection model in the presence of audio augmentations. The performance of the model decreases significantly even with the introduction of light augmentations.
SDOct 27, 2024
MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic musicMichał Wiszenko, Kacper Stefański, Piotr Malesa et al.
Symbolic music research plays a crucial role in music-related machine learning, but MIDI data can be complex for those without musical expertise. To address this issue, we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MIDI tokenization methods from the MidiTok Python package. MidiTok Visualizer offers numerous customizable parameters, enabling users to upload MIDI files to visualize tokenized data alongside an interactive piano roll.
SDOct 25, 2024
CloserMusicDB: A Modern Multipurpose Dataset of High Quality MusicAleksandra Piekarzewicz, Tomasz Sroka, Aleksander Tym et al.
In this paper, we introduce CloserMusicDB, a collection of full length studio quality tracks annotated by a team of human experts. We describe the selected qualities of our dataset, along with three example tasks possible to perform using this dataset: hook detection, contextual tagging and artist identification. We conduct baseline experiments and provide initial benchmarks for these tasks.