SDAug 5, 2024
Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility EstimationAlain Riou, Stefan Lattner, Gaëtan Hadjeres et al.
This paper explores the automated process of determining stem compatibility by identifying audio recordings of single instruments that blend well with a given musical context. To tackle this challenge, we present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset using a self-supervised learning approach. Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems from the embeddings of a given context, typically a mix of several instruments. Training a model in this manner allows its use in estimating stem compatibility - retrieving, aligning, or generating a stem to match a given mix - or for downstream tasks such as genre or key estimation, as the training paradigm requires the model to learn information related to timbre, harmony, and rhythm. We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix and through a subjective user study. We also show that the learned embeddings capture temporal alignment information and, finally, evaluate the representations learned by our model on several downstream tasks, highlighting that they effectively capture meaningful musical features.
SDFeb 16
S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline QuantizationZineb Lahrichi, Gaëtan Hadjeres, Gaël Richard et al.
Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
SDApr 15, 2021Code
Spectrogram Inpainting for Interactive Generation of Instrument SoundsThéis Bazin, Gaëtan Hadjeres, Philippe Esling et al.
Modern approaches to sound synthesis using deep neural networks are hard to control, especially when fine-grained conditioning information is not available, hindering their adoption by musicians. In this paper, we cast the generation of individual instrumental notes as an inpainting-based task, introducing novel and unique ways to iteratively shape sounds. To this end, we propose a two-step approach: first, we adapt the VQ-VAE-2 image generation architecture to spectrograms in order to convert real-valued spectrograms into compact discrete codemaps, we then implement token-masked Transformers for the inpainting-based generation of these codemaps. We apply the proposed architecture on the NSynth dataset on masked resampling tasks. Most crucially, we open-source an interactive web interface to transform sounds by inpainting, for artists and practitioners alike, opening up to new, creative uses.
SDAug 2, 2025
PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant ObjectiveAlain Riou, Bernardo Torres, Ben Hayes et al.
In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.
SDMay 14, 2024
Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation LearningAlain Riou, Stefan Lattner, Gaëtan Hadjeres et al.
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
SDNov 29, 2024
Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive ArchitecturesAlain Riou, Antonin Gagneré, Gaëtan Hadjeres et al.
In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.
SDJul 13, 2021
The Piano Inpainting ApplicationGaëtan Hadjeres, Léopold Crestel
Autoregressive models are now capable of generating high-quality minute-long expressive MIDI piano performances. Even though this progress suggests new tools to assist music composition, we observe that generative algorithms are still not widely used by artists due to the limited control they offer, prohibitive inference times or the lack of integration within musicians' workflows. In this work, we present the Piano Inpainting Application (PIA), a generative model focused on inpainting piano performances, as we believe that this elementary operation (restoring missing parts of a piano performance) encourages human-machine interaction and opens up new ways to approach music composition. Our approach relies on an encoder-decoder Linear Transformer architecture trained on a novel representation for MIDI piano performances termed Structured MIDI Encoding. By uncovering an interesting synergy between Linear Transformers and our inpainting task, we are able to efficiently inpaint contiguous regions of a piano performance, which makes our model suitable for interactive and responsive A.I.-assisted composition. Finally, we introduce our freely-available Ableton Live PIA plugin, which allows musicians to smoothly generate or modify any MIDI clip using PIA within a widely-used professional Digital Audio Workstation.
SDJun 14, 2021
CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound SynthesisSimon Rouard, Gaëtan Hadjeres
In this paper, we propose a novel score-base generative model for unconditional raw audio synthesis. Our proposal builds upon the latest developments on diffusion process modeling with stochastic differential equations, which already demonstrated promising results on image generation. We motivate novel heuristics for the choice of the diffusion processes better suited for audio generation, and consider the use of a conditional U-Net to approximate the score function. While previous approaches on diffusion models on audio were mainly designed as speech vocoders in medium resolution, our method termed CRASH (Controllable Raw Audio Synthesis with High-resolution) allows us to generate short percussive sounds in 44.1kHz in a controllable way. Through extensive experiments, we showcase on a drum sound generation task the numerous sampling schemes offered by our method (unconditional generation, deterministic generation, inpainting, interpolation, variations, class-conditional sampling) and propose the class-mixing sampling, a novel way to generate "hybrid" sounds. Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities with lighter and easier-to-train models.
SDJun 23, 2020
Incorporating Music Knowledge in Continual Dataset Augmentation for Music GenerationAlisa Liu, Alexander Fang, Gaëtan Hadjeres et al.
Deep learning has rapidly become the state-of-the-art approach for music generation. However, training a deep model typically requires a large training set, which is often not available for specific musical styles. In this paper, we present augmentative generation (Aug-Gen), a method of dataset augmentation for any music generation system trained on a resource-constrained domain. The key intuition of this method is that the training data for a generative system can be augmented by examples the system produces during the course of training, provided these examples are of sufficiently high quality and variety. We apply Aug-Gen to Transformer-based chorale generation in the style of J.S. Bach, and show that this allows for longer training and results in better generative output.
ASApr 21, 2020
Vector Quantized Contrastive Predictive Coding for Template-based Music GenerationGaëtan Hadjeres, Léopold Crestel
In this work, we propose a flexible method for generating variations of discrete sequences in which tokens can be grouped into basic units, like sentences in a text or bars in music. More precisely, given a template sequence, we aim at producing novel sequences sharing perceptible similarities with the original template without relying on any annotation; so our problem of generating variations is intimately linked to the problem of learning relevant high-level representations without supervision. Our contribution is two-fold: First, we propose a self-supervised encoding technique, named Vector Quantized Contrastive Predictive Coding which allows to learn a meaningful assignment of the basic units over a discrete set of codes, together with mechanisms allowing to control the information content of these learnt discrete representations. Secondly, we show how these compressed representations can be used to generate variations of a template sequence by using an appropriate attention pattern in the Transformer architecture. We illustrate our approach on the corpus of J.S. Bach chorales where we discuss the musical meaning of the learnt discrete codes and show that our proposed method allows to generate coherent and high-quality variations of a given template.
LGFeb 19, 2020
Schoenberg-Rao distances: Entropy-based and geometry-aware statistical Hilbert distancesGaëtan Hadjeres, Frank Nielsen
Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term the Schoenberg-Rao distances, a generalization of the MMD that allows one to consider a broader class of kernels, namely the conditionally negative semi-definite kernels. In particular, we introduce a principled way to construct such kernels and derive novel closed-form distances between mixtures of Gaussian distributions. These distances, derived from the concave Rao's quadratic entropy, enjoy nice theoretical properties and possess interpretable hyperparameters which can be tuned for specific applications. Our method constitutes a practical alternative to Wasserstein distances and we illustrate its efficiency on a broad range of machine learning tasks such as density estimation, generative modeling and mixture simplification.
ITSep 19, 2019
A note on the quasiconvex Jensen divergences and the quasiconvex Bregman divergences derived thereofFrank Nielsen, Gaëtan Hadjeres
We first introduce the class of strictly quasiconvex and strictly quasiconcave Jensen divergences which are oriented (asymmetric) distances, and study some of their properties. We then define the strictly quasiconvex Bregman divergences as the limit case of scaled and skewed quasiconvex Jensen divergences, and report a simple closed-form formula which shows that these divergences are only pseudo-divergences at countably many inflection points of the generators. To remedy this problem, we propose the $δ$-averaged quasiconvex Bregman divergences which integrate the pseudo-divergences over a small neighborhood in order obtain a proper divergence. The formula of $δ$-averaged quasiconvex Bregman divergences extend even to non-differentiable strictly quasiconvex generators. These quasiconvex Bregman divergences between distinct elements have the property to always have one orientation finite while the other orientation is infinite. We show that these quasiconvex Bregman divergences can also be interpreted as limit cases of generalized skewed Jensen divergences with respect to comparative convexity by using power means. Finally, we illustrate how these quasiconvex Bregman divergences naturally appear as equivalent divergences for the Kullback-Leibler divergences between probability densities belonging to a same parametric family of distributions with nested supports.
HCJul 23, 2019
NONOTO: A Model-agnostic Web Interface for Interactive Music Composition by InpaintingThéis Bazin, Gaëtan Hadjeres
Inpainting-based generative modeling allows for stimulating human-machine interactions by letting users perform stylistically coherent local editions to an object using a statistical model. We present NONOTO, a new interface for interactive music generation based on inpainting models. It is aimed both at researchers, by offering a simple and flexible API allowing them to connect their own models with the interface, and at musicians by providing industry-standard features such as audio playback, real-time MIDI output and straightforward synchronization with DAWs using Ableton Link.
SDJul 4, 2019
Neural Drum Machine : An Interactive System for Real-time Synthesis of Drum SoundsCyran Aouameur, Philippe Esling, Gaëtan Hadjeres
In this work, we introduce a system for real-time generation of drum sounds. This system is composed of two parts: a generative model for drum sounds together with a Max4Live plugin providing intuitive controls on the generative process. The generative model consists of a Conditional Wasserstein autoencoder (CWAE), which learns to generate Mel-scaled magnitude spectrograms of short percussion samples, coupled with a Multi-Head Convolutional Neural Network (MCNN) which estimates the corresponding audio signal from the magnitude spectrogram. The design of this model makes it lightweight, so that it allows one to perform real-time generation of novel drum sounds on an average CPU, removing the need for the users to possess dedicated hardware in order to use this system. We then present our Max4Live interface designed to interact with this generative model. With this setup, the system can be easily integrated into a studio-production environment and enhance the creative process. Finally, we discuss the advantages of our system and how the interaction of music producers with such tools could change the way drum tracks are composed.
LGJul 2, 2019
Learning to Traverse Latent Spaces for Musical Score InpaintingAshis Pati, Alexander Lerch, Gaëtan Hadjeres
Music Inpainting is the task of filling in missing or lost information in a piece of music. We investigate this task from an interactive music creation perspective. To this end, a novel deep learning-based approach for musical score inpainting is proposed. The designed model takes both past and future musical context into account and is capable of suggesting ways to connect them in a musically meaningful manner. To achieve this, we leverage the representational power of the latent space of a Variational Auto-Encoder and train a Recurrent Neural Network which learns to traverse this latent space conditioned on the past and future musical contexts. Consequently, the designed model is capable of generating several measures of music to connect two musical excerpts. The capabilities and performance of the model are showcased by comparison with competitive baselines using several objective and subjective evaluation methods. The results show that the model generates meaningful inpaintings and can be used in interactive music creation applications. Overall, the method demonstrates the merit of learning complex trajectories in the latent spaces of deep generative models.
ITMar 14, 2019
On power chi expansions of $f$-divergencesFrank Nielsen, Gaëtan Hadjeres
We consider both finite and infinite power chi expansions of $f$-divergences derived from Taylor's expansions of smooth generators, and elaborate on cases where these expansions yield closed-form formula, bounded approximations, or analytic divergence series expressions of $f$-divergences.
LGJan 11, 2019
Variation Network: Learning High-level Attributes for Controlled Input ManipulationGaëtan Hadjeres, Frank Nielsen
This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this model applicable to a wide variety of tasks. Further, VarNet has a sound information-theoretic interpretation which grants us with interpretable means to control how these high-level attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and meaningful.
LGMar 20, 2018
Monte Carlo Information Geometry: The dually flat caseFrank Nielsen, Gaëtan Hadjeres
Exponential families and mixture families are parametric probability models that can be geometrically studied as smooth statistical manifolds with respect to any statistical divergence like the Kullback-Leibler (KL) divergence or the Hellinger divergence. When equipping a statistical manifold with the KL divergence, the induced manifold structure is dually flat, and the KL divergence between distributions amounts to an equivalent Bregman divergence on their corresponding parameters. In practice, the corresponding Bregman generators of mixture/exponential families require to perform definite integral calculus that can either be too time-consuming (for exponentially large discrete support case) or even do not admit closed-form formula (for continuous support case). In these cases, the dually flat construction remains theoretical and cannot be used by information-geometric algorithms. To bypass this problem, we consider performing stochastic Monte Carlo (MC) estimation of those integral-based mixture/exponential family Bregman generators. We show that, under natural assumptions, these MC generators are almost surely Bregman generators. We define a series of dually flat information geometries, termed Monte Carlo Information Geometries, that increasingly-finely approximate the untractable geometry. The advantage of this MCIG is that it allows a practical use of the Bregman algorithmic toolbox on a wide range of probability distribution families. We demonstrate our approach with a clustering task on a mixture family manifold.
AISep 19, 2017
Interactive Music Generation with Positional Constraints using Anticipation-RNNsGaëtan Hadjeres, Frank Nielsen
Recurrent Neural Networks (RNNS) are now widely used on sequence generation tasks due to their ability to learn long-range dependencies and to generate sequences of arbitrary length. However, their left-to-right generation procedure only allows a limited control from a potential user which makes them unsuitable for interactive and creative usages such as interactive music generation. This paper introduces a novel architecture called Anticipation-RNN which possesses the assets of the RNN-based generative models while allowing to enforce user-defined positional constraints. We demonstrate its efficiency on the task of generating melodies satisfying positional constraints in the style of the soprano parts of the J.S. Bach chorale harmonizations. Sampling using the Anticipation-RNN is of the same order of complexity than sampling from the traditional RNN model. This fast and interactive generation of musical sequences opens ways to devise real-time systems that could be used for creative purposes.
SDSep 5, 2017
Deep Learning Techniques for Music Generation -- A SurveyJean-Pierre Briot, Gaëtan Hadjeres, François-David Pachet
This paper is a survey and an analysis of different ways of using deep learning (deep artificial neural networks) to generate musical content. We propose a methodology based on five dimensions for our analysis: Objective - What musical content is to be generated? Examples are: melody, polyphony, accompaniment or counterpoint. - For what destination and for what use? To be performed by a human(s) (in the case of a musical score), or by a machine (in the case of an audio file). Representation - What are the concepts to be manipulated? Examples are: waveform, spectrogram, note, chord, meter and beat. - What format is to be used? Examples are: MIDI, piano roll or text. - How will the representation be encoded? Examples are: scalar, one-hot or many-hot. Architecture - What type(s) of deep neural network is (are) to be used? Examples are: feedforward network, recurrent network, autoencoder or generative adversarial networks. Challenge - What are the limitations and open challenges? Examples are: variability, interactivity and creativity. Strategy - How do we model and control the process of generation? Examples are: single-step feedforward, iterative feedforward, sampling or input manipulation. For each dimension, we conduct a comparative analysis of various models and techniques and we propose some tentative multidimensional typology. This typology is bottom-up, based on the analysis of many existing deep-learning based systems for music generation selected from the relevant literature. These systems are described and are used to exemplify the various choices of objective, representation, architecture, challenge and strategy. The last section includes some discussion and some prospects.
IRSep 3, 2017
Deep rank-based transposition-invariant distances on musical sequencesGaëtan Hadjeres, Frank Nielsen
Distances on symbolic musical sequences are needed for a variety of applications, from music retrieval to automatic music generation. These musical sequences belong to a given corpus (or style) and it is obvious that a good distance on musical sequences should take this information into account; being able to define a distance ex nihilo which could be applicable to all music styles seems implausible. A distance could also be invariant under some transformations, such as transpositions, so that it can be used as a distance between musical motives rather than musical sequences. However, to our knowledge, none of the approaches to devise musical distances seem to address these issues. This paper introduces a method to build transposition-invariant distances on symbolic musical sequences which are learned from data. It is a hybrid distance which combines learned feature representations of musical sequences with a handcrafted rank distance. This distance depends less on the musical encoding of the data than previous methods and gives perceptually good results. We demonstrate its efficiency on the dataset of chorale melodies by J.S. Bach.
LGJul 14, 2017
GLSR-VAE: Geodesic Latent Space Regularization for Variational AutoEncoder ArchitecturesGaëtan Hadjeres, Frank Nielsen, François Pachet
VAEs (Variational AutoEncoders) have proved to be powerful in the context of density modeling and have been used in a variety of contexts for creative purposes. In many settings, the data we model possesses continuous attributes that we would like to take into account at generation time. We propose in this paper GLSR-VAE, a Geodesic Latent Space Regularization for the Variational AutoEncoder architecture and its generalizations which allows a fine control on the embedding of the data into the latent space. When augmenting the VAE loss with this regularization, changes in the learned latent space reflects changes of the attributes of the data. This deeper understanding of the VAE latent space structure offers the possibility to modulate the attributes of the generated data in a continuous way. We demonstrate its efficiency on a monophonic music generation task where we manage to generate variations of discrete sequences in an intended and playful way.
AIDec 3, 2016
DeepBach: a Steerable Model for Bach Chorales GenerationGaëtan Hadjeres, François Pachet, Frank Nielsen
This paper introduces DeepBach, a graphical model aimed at modeling polyphonic music and specifically hymn-like pieces. We claim that, after being trained on the chorale harmonizations by Johann Sebastian Bach, our model is capable of generating highly convincing chorales in the style of Bach. DeepBach's strength comes from the use of pseudo-Gibbs sampling coupled with an adapted representation of musical data. This is in contrast with many automatic music composition approaches which tend to compose music sequentially. Our model is also steerable in the sense that a user can constrain the generation by imposing positional constraints such as notes, rhythms or cadences in the generated score. We also provide a plugin on top of the MuseScore music editor making the interaction with DeepBach easy to use.
AISep 16, 2016
Style Imitation and Chord Invention in Polyphonic Music with Exponential FamiliesGaëtan Hadjeres, Jason Sakellariou, François Pachet
Modeling polyphonic music is a particularly challenging task because of the intricate interplay between melody and harmony. A good model should satisfy three requirements: statistical accuracy (capturing faithfully the statistics of correlations at various ranges, horizontally and vertically), flexibility (coping with arbitrary user constraints), and generalization capacity (inventing new material, while staying in the style of the training corpus). Models proposed so far fail on at least one of these requirements. We propose a statistical model of polyphonic music, based on the maximum entropy principle. This model is able to learn and reproduce pairwise statistics between neighboring note events in a given corpus. The model is also able to invent new chords and to harmonize unknown melodies. We evaluate the invention capacity of the model by assessing the amount of cited, re-discovered, and invented chords on a corpus of Bach chorales. We discuss how the model enables the user to specify and enforce user-defined constraints, which makes it useful for style-based, interactive music generation.