SDNov 1, 2022
Modelling black-box audio effects with time-varying feature modulationMarco Comunità, Christian J. Steinmetz, Huy Phan et al.
Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. To address this, we propose the integration of time-varying feature-wise linear modulation into existing temporal convolutional backbones, an approach that enables learnable adaptation of the intermediate activations. We demonstrate that our approach more accurately captures long-range dependencies for a range of fuzz and compressor implementations across both time and frequency domain metrics. We provide sound examples, source code, and pretrained models to faciliate reproducibility.
SDOct 23, 2023
SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley SynthesisMarco Comunità, Riccardo F. Gramaccioni, Emilian Postolache et al.
Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility
ASApr 11, 2024Code
Differentiable All-pole Filters for Time-varying Audio SystemsChin-Yun Yu, Christopher Mitcheltree, Alistair Carson et al.
Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-to-end training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by re-expressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin at https://diffapf.github.io/web/.
SDOct 7, 2025Code
Modulation Discovery with Differentiable Digital Signal ProcessingChristopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss
Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that allow users to modulate the output with ease. However, determining the modulation signals used to create a sound is difficult, and existing sound-matching / parameter estimation systems are often uninterpretable black boxes or predict high-dimensional framewise parameter values without considering the shape, structure, and routing of the underlying modulation curves. We propose a neural sound-matching approach that leverages modulation extraction, constrained control signal parameterizations, and differentiable digital signal processing (DDSP) to discover the modulations present in a sound. We demonstrate the effectiveness of our approach on highly modulated synthetic and real audio samples, its applicability to different DDSP synth architectures, and investigate the trade-off it incurs between interpretability and sound-matching accuracy. We make our code and audio samples available and provide the trained DDSP synths in a VST plugin.
SDMay 22, 2023Code
Modulation Extraction for LFO-driven Audio EffectsChristopher Mitcheltree, Christian J. Steinmetz, Marco Comunità et al.
Low frequency oscillator (LFO) driven audio effects such as phaser, flanger, and chorus, modify an input signal using time-varying filters and delays, resulting in characteristic sweeping or widening effects. It has been shown that these effects can be modeled using neural networks when conditioned with the ground truth LFO signal. However, in most cases, the LFO signal is not accessible and measurement from the audio signal is nontrivial, hindering the modeling process. To address this, we propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations. Since our system imposes no restrictions on the LFO signal shape, we demonstrate its ability to extract quasiperiodic, combined, and distorted modulation signals that are relevant to effect modeling. Furthermore, we show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects using only dry and wet audio pairs, overcoming the need to access the audio effect or internal LFO signal. We make our code available and provide the trained audio effect models in a real-time VST plugin.
SDDec 19, 2024
FolAI: Synchronized Foley Sound Generation with Semantic and Temporal AlignmentRiccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache et al.
Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal scaffold for the audio. In the second stage, a diffusion-based generative model produces sound effects conditioned both on this temporal envelope and on high-level semantic embeddings, provided by the user, that define the desired auditory content (e.g., material or action type). This modular design enables precise control over both timing and timbre, streamlining repetitive tasks while preserving creative flexibility in professional Foley workflows. Results on diverse visual contexts, such as footstep generation and action-specific sonorization, demonstrate that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic. These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings. Supplementary materials are accessible on our dedicated demo page at https://ispamm.github.io/FolAI.
SDNov 25, 2025
Differentiable Attenuation Filters for Feedback Delay NetworksIlias Ibnyahya, Joshua D. Reiss
We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling fine control over frequency-dependent reverberation decay. Unlike traditional graphic equalizer designs, which require numerous filters per delay line, we propose a scalable solution where the number of filters can be adjusted. The frequency, gain, and quality factor (Q) parameters are shared parameters across delay lines and only the gain is adjusted based on delay length. This design not only reduces the number of optimization parameters, but also remains fully differentiable and compatible with gradient-based learning frameworks. Leveraging principles of analog filter design, our method allows for efficient and accurate filter fitting using supervised learning. Our method delivers a flexible and differentiable design, achieving state-of-the-art performance while significantly reducing computational cost.
SDMar 8, 2025
Bimodal Connection Attention Fusion for Speech Emotion RecognitionJiachen Luo, Huy Phan, Lin Wang et al.
Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.
ASDec 6, 2021
Steerable discovery of neural audio effectsChristian J. Steinmetz, Joshua D. Reiss
Applications of deep learning for audio effects often focus on modeling analog effects or learning to control effects to emulate a trained audio engineer. However, deep learning approaches also have the potential to expand creativity through neural audio effects that enable new sound transformations. While recent work demonstrated that neural networks with random weights produce compelling audio effects, control of these effects is limited and unintuitive. To address this, we introduce a method for the steerable discovery of neural audio effects. This method enables the design of effects using example recordings provided by the user. We demonstrate how this method produces an effect similar to the target effect, along with interesting inaccuracies, while also providing perceptually relevant controls.
SDOct 18, 2021
Neural Synthesis of Footsteps Sound Effects with Generative Adversarial NetworksMarco Comunità, Huy Phan, Joshua D. Reiss
Footsteps are among the most ubiquitous sound effects in multimedia applications. There is substantial research into understanding the acoustic features and developing synthesis models for footstep sound effects. In this paper, we present a first attempt at adopting neural synthesis for this task. We implemented two GAN-based architectures and compared the results with real recordings as well as six traditional sound synthesis methods. Our architectures reached realism scores as high as recorded samples, showing encouraging results for the task at hand.
SPOct 7, 2021
Direct design of biquad filter cascades with deep learning by sampling random polynomialsJoseph T. Colonel, Christian J. Steinmetz, Marcus Michelen et al.
Designing infinite impulse response filters to match an arbitrary magnitude response requires specialized techniques. Methods like modified Yule-Walker are relatively efficient, but may not be sufficiently accurate in matching high order responses. On the other hand, iterative optimization techniques often enable superior performance, but come at the cost of longer run-times and are sensitive to initial conditions, requiring manual tuning. In this work, we address some of these limitations by learning a direct mapping from the target magnitude response to the filter coefficient space with a neural network trained on millions of random filters. We demonstrate our approach enables both fast and accurate estimation of filter coefficients given a desired response. We investigate training with different families of random filters, and find training with a variety of filter families enables better generalization when estimating real-world filters, using head-related transfer functions and guitar cabinets as case studies. We compare our method against existing methods including modified Yule-Walker and gradient descent and show our approach is, on average, both faster and more accurate.
ASOct 4, 2021
WaveBeat: End-to-end beat and downbeat tracking in the time domainChristian J. Steinmetz, Joshua D. Reiss
Deep learning approaches for beat and downbeat tracking have brought advancements. However, these approaches continue to rely on hand-crafted, subsampled spectral features as input, restricting the information available to the model. In this work, we propose WaveBeat, an end-to-end approach for joint beat and downbeat tracking operating directly on waveforms. This method forgoes engineered spectral features, and instead, produces beat and downbeat predictions directly from the waveform, the first of its kind for this task. Our model utilizes temporal convolutional networks (TCNs) operating on waveforms that achieve a very large receptive field ($\geq$ 30 s) at audio sample rates in a memory efficient manner by employing rapidly growing dilation factors with fewer layers. With a straightforward data augmentation strategy, our method outperforms previous state-of-the-art methods on some datasets, while producing comparable results on others, demonstrating the potential for time domain approaches.
ASFeb 11, 2021
Efficient neural networks for real-time modeling of analog dynamic range compressionChristian J. Steinmetz, Joshua D. Reiss
Deep learning approaches have demonstrated success in modeling analog audio effects. Nevertheless, challenges remain in modeling more complex effects that involve time-varying nonlinear elements, such as dynamic range compressors. Existing neural network approaches for modeling compression either ignore the device parameters, do not attain sufficient accuracy, or otherwise require large noncausal models prohibiting real-time operation. In this work, we propose a modification to temporal convolutional networks (TCNs) enabling greater efficiency without sacrificing performance. By utilizing very sparse convolutional kernels through rapidly growing dilations, our model attains a significant receptive field using fewer layers, reducing computation. Through a detailed evaluation we demonstrate our efficient and causal approach achieves state-of-the-art performance in modeling the analog LA-2A, is capable of real-time operation on CPU, and only requires 10 minutes of training data.
SDDec 6, 2020
Guitar Effects Recognition and Parameter Estimation with Convolutional Neural NetworksMarco Comunità, Dan Stowell, Joshua D. Reiss
Despite the popularity of guitar effects, there is very little existing research on classification and parameter estimation of specific plugins or effect units from guitar recordings. In this paper, convolutional neural networks were used for classification and parameter estimation for 13 overdrive, distortion and fuzz guitar effects. A novel dataset of processed electric guitar samples was assembled, with four sub-datasets consisting of monophonic or polyphonic samples and discrete or continuous settings values, for a total of about 250 hours of processed samples. Results were compared for networks trained and tested on the same or on a different sub-dataset. We found that discrete datasets could lead to equally high performance as continuous ones, whilst being easier to design, analyse and modify. Classification accuracy was above 80\%, with confusion matrices reflecting similarities in the effects timbre and circuits design. With parameter values between 0.0 and 1.0, the mean absolute error is in most cases below 0.05, while the root mean square error is below 0.1 in all cases but one.
ASOct 8, 2020
Randomized Overdrive Neural NetworksChristian J. Steinmetz, Joshua D. Reiss
By processing audio signals in the time-domain with randomly weighted temporal convolutional networks (TCNs), we uncover a wide range of novel, yet controllable overdrive effects. We discover that architectural aspects, such as the depth of the network, the kernel size, the number of channels, the activation function, as well as the weight initialization, all have a clear impact on the sonic character of the resultant effect, without the need for training. In practice, these effects range from conventional overdrive and distortion, to more extreme effects, as the receptive field grows, similar to a fusion of distortion, equalization, delay, and reverb. To enable use by musicians and producers, we provide a real-time plugin implementation. This allows users to dynamically design networks, listening to the results in real-time. We provide a demonstration and code at https://csteinmetz1.github.io/ronn.
ASOct 22, 2019
Modeling plate and spring reverberation using a DSP-informed deep neural networkMarco A. Martínez Ramírez, Emmanouil Benetos, Joshua D. Reiss
Plate and spring reverberators are electromechanical systems first used and researched as means to substitute real room reverberation. Nowadays they are often used in music production for aesthetic reasons due to their particular sonic characteristics. The modeling of these audio processors and their perceptual qualities is difficult since they use mechanical elements together with analog electronics resulting in an extremely complex response. Based on digital reverberators that use sparse FIR filters, we propose a signal processing-informed deep learning architecture for the modeling of artificial reverberators. We explore the capabilities of deep neural networks to learn such highly nonlinear electromechanical responses and we perform modeling of plate and spring reverberators. In order to measure the performance of the model, we conduct a perceptual evaluation experiment and we also analyze how the given task is accomplished and what the model is actually learning.
ASMay 15, 2019
A general-purpose deep learning approach to model time-varying audio effectsMarco A. Martínez Ramírez, Emmanouil Benetos, Joshua D. Reiss
Audio processors whose parameters are modified periodically over time are often referred as time-varying or modulation based audio effects. Most existing methods for modeling these type of effect units are often optimized to a very specific circuit and cannot be efficiently generalized to other time-varying effects. Based on convolutional and recurrent neural networks, we propose a deep learning architecture for generic black-box modeling of audio processors with long-term memory. We explore the capabilities of deep neural networks to learn such long temporal dependencies and we show the network modeling various linear and nonlinear, time-varying and time-invariant audio effects. In order to measure the performance of the model, we propose an objective metric based on the psychoacoustics of modulation frequency perception. We also analyze what the model is actually learning and how the given task is accomplished.
MLJan 31, 2019
End-to-End Probabilistic Inference for Nonstationary Audio AnalysisWilliam J. Wilkinson, Michael Riis Andersen, Joshua D. Reiss et al.
A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix factorisation can be jointly formulated as a spectral mixture Gaussian process model with nonstationary priors over the amplitude variance parameters. Further, we formulate this nonlinear model's state space representation, making it amenable to infinite-horizon Gaussian process regression with approximate inference via expectation propagation, which scales linearly in the number of time steps and quadratically in the state dimensionality. By doing so, we are able to process audio signals with hundreds of thousands of data points. We demonstrate, on various tasks with empirical data, how this inference scheme outperforms more standard techniques that rely on extended Kalman filtering.
SPNov 6, 2018
Unifying Probabilistic Models for Time-Frequency AnalysisWilliam J. Wilkinson, Michael Riis Andersen, Joshua D. Reiss et al.
In audio signal processing, probabilistic time-frequency models have many benefits over their non-probabilistic counterparts. They adapt to the incoming signal, quantify uncertainty, and measure correlation between the signal's amplitude and phase information, making time domain resynthesis straightforward. However, these models are still not widely used since they come at a high computational cost, and because they are formulated in such a way that it can be difficult to interpret all the modelling assumptions. By showing their equivalence to Spectral Mixture Gaussian processes, we illuminate the underlying model assumptions and provide a general framework for constructing more complex models that better approximate real-world signals. Our interpretation makes it intuitive to inspect, compare, and alter the models since all prior knowledge is encoded in the Gaussian process kernel functions. We utilise a state space representation to perform efficient inference via Kalman smoothing, and we demonstrate how our interpretation allows for efficient parameter learning in the frequency domain.
ASOct 15, 2018
Modeling of nonlinear audio effects with end-to-end deep neural networksMarco A. Martínez Ramirez, Joshua D. Reiss
In the context of music production, distortion effects are mainly used for aesthetic reasons and are usually applied to electric musical instruments. Most existing methods for nonlinear modeling are often either simplified or optimized to a very specific circuit. In this work, we investigate deep learning architectures for audio processing and we aim to find a general purpose end-to-end deep neural network to perform modeling of nonlinear audio effects. We show the network modeling various nonlinearities and we discuss the generalization capabilities among different instruments.
IVMar 29, 2018
An empirical approach to the relationship between emotion and music production qualityDavid Ronan, Joshua D. Reiss, Hatice Gunes
In music production, the role of the mix engineer is to take recorded music and convey the expressed emotions as professionally sounding as possible. We investigated the relationship between music production quality and musically induced and perceived emotions. A listening test was performed where 10 critical listeners and 10 non-critical listeners evaluated 10 songs. There were two mixes of each song, the low quality mix and the high quality mix. Each participant's subjective experience was measured directly through questionnaire and indirectly by examining peripheral physiological changes, change in facial expressions and the number of head nods and shakes they made as they listened to each mix. We showed that music production quality had more of an emotional impact on critical listeners. Also, critical listeners had significantly different emotional responses to non-critical listeners for the high quality mixes and to a lesser extent the low quality mixes. The findings suggest that having a high level of skill in mix engineering only seems to matter in an emotional context to a subset of music listeners.
ASMar 27, 2018
Automatic Minimisation of Masking in Multitrack Audio using SubgroupsDavid Ronan, Zheng Ma, Paul Mc Namara et al.
The iterative process of masking minimisation when mixing multitrack audio is a challenging optimisation problem, in part due to the complexity and non-linearity of auditory perception. In this article, we first propose a multitrack masking metric inspired by the MPEG psychoacoustic model. We investigate different audio processing techniques to manipulate the frequency and dynamic characteristics of the signal in order to reduce masking based on the proposed metric. We also investigate whether or not automatically mixing using subgrouping is beneficial or not to perceived quality and clarity of a mix. Evaluation results suggest that our proposed masking metric when used in an automatic mixing framework can be used to reduce inter-channel auditory masking as well as improve the perceived quality and perceived clarity of a mix. Furthermore, our results suggest that using subgrouping in an automatic mixing framework can be used to improve the perceived quality and perceived clarity of a mix.
LGFeb 2, 2018
A Generative Model for Natural Sounds Based on Latent Force ModellingWilliam J. Wilkinson, Joshua D. Reiss, Dan Stowell
Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but existing approaches don't incorporate prior knowledge about the physical behaviour of amplitude envelopes, such as exponential decay and feedback. We use latent force modelling, a probabilistic learning paradigm that incorporates physical knowledge into Gaussian process regression, to model correlation across spectral subband envelopes. We augment the standard latent force model approach by explicitly modelling correlations over multiple time steps. Incorporating this prior knowledge strengthens the interpretation of the latent functions as the source that generated the signal. We examine this interpretation via an experiment which shows that sounds generated by sampling from our probabilistic model are perceived to be more realistic than those generated by similar models based on nonnegative matrix factorisation, even in cases where our model is outperformed from a reconstruction error perspective.