ASAug 14, 2024
Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion ModelsJean-Marie Lemercier, Eloi Moliner, Simon Welker et al.
This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy. The algorithm is rooted in Bayesian posterior sampling: it combines a likelihood model enforcing fidelity to the reverberant measurement, and an anechoic speech prior implemented by an unconditional diffusion model. We design a parametric filter representing the RIR, with exponential decay for each frequency subband. Room acoustics estimation and speech dereverberation are jointly carried out, as the filter parameters are iteratively estimated and the speech utterance refined along the reverse diffusion trajectory. In a blind scenario where the RIR is unknown, BUDDy successfully performs speech dereverberation in various acoustic scenarios, significantly outperforming other blind unsupervised baselines. Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions. This paper extends our previous work by offering new experimental results and insights into the algorithm's versatility. We demonstrate the robustness of our proposed method to new acoustic and speaker conditions, as well as its adaptability to high-resolution singing voice dereverberation, using both instrumental metrics and subjective listening evaluation. We study BUDDy's performance for RIR estimation and observe it surpasses a state-of-the-art supervised DNN-based estimator on mismatched acoustic conditions. Finally, we investigate the sensitivity of informed dereverberation methods to RIR estimation errors, thereby motivating the joint acoustic estimation and dereverberation design. Audio examples and code can be found online.
ASOct 20, 2023
HRTF Interpolation using a Spherical Neural Process Meta-LearnerEtienne Thuillier, Craig Jin, Vesa Välimäki
Several individualization methods have recently been proposed to estimate a subject's Head-Related Transfer Function (HRTF) using convenient input modalities such as anthropometric measurements or pinnae photographs. There exists a need for adaptively correcting the estimation error committed by such methods using a few data point samples from the subject's HRTF, acquired using acoustic measurements or perceptual feedback. To this end, we introduce a Convolutional Conditional Neural Process meta-learner specialized in HRTF error interpolation. In particular, the model includes a Spherical Convolutional Neural Network component to accommodate the spherical geometry of HRTF data. It also exploits potential symmetries between the HRTF's left and right channels about the median axis. In this work, we evaluate the proposed model's performance purely on time-aligned spectrum interpolation grounds under a simplified setup where a generic population-mean HRTF forms the initial estimates prior to corrections instead of individualized ones. The trained model achieves up to 3 dB relative error reduction compared to state-of-the-art interpolation methods despite being trained using only 85 subjects. This improvement translates up to nearly a halving of the data point count required to achieve comparable accuracy, in particular from 50 to 28 points to reach an average of -20 dB relative error per interpolated feature. Moreover, we show that the trained model provides well-calibrated uncertainty estimates. Accordingly, such estimates can inform the sequential decision problem of acquiring as few correcting HRTF data points as needed to meet a desired level of HRTF individualization accuracy.
ASMay 4, 2022
Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential EquationsJan Wilczek, Alec Wright, Vesa Välimäki et al.
Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters. We show that this approach does not require oversampling and allows to increase the sampling rate after the training has completed, which results in increased accuracy. Using a sophisticated numerical solver allows to increase the accuracy at the cost of slower processing. ODEs learned this way do not require closed forms but are still physically interpretable.
ASJun 2, 2023
Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot ApproachEloi Moliner, Filip Elvander, Vesa Välimäki
Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)
ASFeb 15, 2024
Diffusion Models for Audio RestorationJean-Marie Lemercier, Julius Richter, Simon Welker et al.
With the development of audio playback devices and fast data transmission, the demand for high sound quality is rising for both entertainment and communications. In this quest for better sound quality, challenges emerge from distortions and interferences originating at the recording side or caused by an imperfect transmission pipeline. To address this problem, audio restoration methods aim to recover clean sound signals from the corrupted input data. We present here audio restoration algorithms based on diffusion models, with a focus on speech enhancement and music restoration tasks. Traditional approaches, often grounded in handcrafted rules and statistical heuristics, have shaped our understanding of audio signals. In the past decades, there has been a notable shift towards data-driven methods that exploit the modeling capabilities of DNNs. Deep generative models, and among them diffusion models, have emerged as powerful techniques for learning complex data distributions. However, relying solely on DNN-based learning approaches carries the risk of reducing interpretability, particularly when employing end-to-end models. Nonetheless, data-driven approaches allow more flexibility in comparison to statistical model-based frameworks, whose performance depends on distributional and statistical assumptions that can be difficult to guarantee. Here, we aim to show that diffusion models can combine the best of both worlds and offer the opportunity to design audio restoration algorithms with a good degree of interpretability and a remarkable performance in terms of sound quality. We explain the diffusion formalism and its application to the conditional generation of clean audio signals. We believe that diffusion models open an exciting field of research with the potential to spawn new audio restoration algorithms that are natural-sounding and remain robust in difficult acoustic situations.
ASMay 7, 2024
BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion ModelsEloi Moliner, Jean-Marie Lemercier, Simon Welker et al.
In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models. We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined along the reverse diffusion trajectory. A measurement consistency criterion enforces the fidelity of the generated speech with the reverberant measurement, while an unconditional diffusion model implements a strong prior for clean speech generation. Without any knowledge of the room impulse response nor any coupled reverberant-anechoic data, we can successfully perform dereverberation in various acoustic scenarios. Our method significantly outperforms previous blind unsupervised baselines, and we demonstrate its increased robustness to unseen acoustic conditions in comparison to blind supervised methods. Audio samples and code are available online.
ASApr 7, 2025
Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approachesEloi Moliner, Michal Švento, Alec Wright et al.
Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem. This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.
ASJan 30, 2025
Resampling Filter Design for Multirate Neural Audio Effect ProcessingAlistair Carson, Vesa Välimäki, Alec Wright et al.
Neural networks have become ubiquitous in audio effects modelling, especially for guitar amplifiers and distortion pedals. One limitation of such models is that the sample rate of the training data is implicitly encoded in the model weights and therefore not readily adjustable at inference. Recent work explored modifications to recurrent neural network architecture to approximate a sample rate independent system, enabling audio processing at a rate that differs from the original training rate. This method works well for integer oversampling and can reduce aliasing caused by nonlinear activation functions. For small fractional changes in sample rate, fractional delay filters can be used to approximate sample rate independence, but in some cases this method fails entirely. Here, we explore the use of real-time signal resampling at the input and output of the neural network as an alternative solution. We investigate several resampling filter designs and show that a two-stage design consisting of a half-band IIR filter cascaded with a Kaiser window FIR filter can give similar or better results to the previously proposed model adjustment method with many fewer filtering operations per sample and less than one millisecond of latency at typical audio rates. Furthermore, we investigate interpolation and decimation filters for the task of integer oversampling and show that cascaded half-band IIR and FIR designs can be used in conjunction with the model adjustment method to reduce aliasing in a range of distortion effect models.
ASSep 19, 2025
Similarity-Guided Diffusion for Long-Gap Music InpaintingSean Turland, Eloi Moliner, Vesa Välimäki
Music inpainting aims to reconstruct missing segments of a corrupted recording. While diffusion-based generative models improve reconstruction for medium-length gaps, they often struggle to preserve musical plausibility over multi-second gaps. We introduce Similarity-Guided Diffusion Posterior Sampling (SimDPS), a hybrid method that combines diffusion-based inference with similarity search. Candidate segments are first retrieved from a corpus based on contextual similarity, then incorporated into a modified likelihood that guides the diffusion process toward contextually consistent reconstructions. Subjective evaluation on piano music inpainting with 2-s gaps shows that the proposed SimDPS method enhances perceptual plausibility compared to unguided diffusion and frequently outperforms similarity search alone when moderately similar candidates are available. These results demonstrate the potential of a hybrid similarity approach for diffusion-based audio enhancement with long gaps.
ASMay 26, 2023
Neural modeling of magnetic tape recordersOtto Mikkonen, Alec Wright, Eloi Moliner et al.
The sound of magnetic recording media, such as open-reel and cassette tape recorders, is still sought after by today's sound practitioners due to the imperfections embedded in the physics of the magnetic recording process. This paper proposes a method for digitally emulating this character using neural networks. The signal chain of the proposed system consists of three main components: the hysteretic nonlinearity and filtering jointly produced by the magnetic recording process as well as the record and playback amplifiers, the fluctuating delay originating from the tape transport, and the combined additive noise component from various electromagnetic origins. In our approach, the hysteretic nonlinear block is modeled using a recurrent neural network, while the delay trajectories and the noise component are generated using separate diffusion models, which employ U-net deep convolutional neural networks. According to the conducted objective evaluation, the proposed architecture faithfully captures the character of the magnetic tape recorder. The results of this study can be used to construct virtual replicas of vintage sound recording devices with applications in music production and audio antiquing tasks.
ASFeb 17, 2022
A Two-Stage U-Net for High-Fidelity Denoising of Historical RecordingsEloi Moliner, Vesa Välimäki
Enhancing the sound quality of historical music recordings is a long-standing problem. This paper presents a novel denoising method based on a fully-convolutional deep neural network. A two-stage U-Net model architecture is designed to model and suppress the degradations with high fidelity. The method processes the time-frequency representation of audio, and is trained using realistic noisy data to jointly remove hiss, clicks, thumps, and other common additive disturbances from old analog discs. The proposed model outperforms previous methods in both objective and subjective metrics. The results of a formal blind listening test show that real gramophone recordings denoised with this method have significantly better quality than the baseline methods. This study shows the importance of realistic training data and the power of deep learning in audio restoration.
ASOct 8, 2021
A Method for Capturing and Reproducing Directional Reverberation in Six Degrees of FreedomBenoit Alary, Vesa Välimäki
The reproduction of acoustics is an important aspect of the preservation of cultural heritage. A common approach is to capture an impulse response in a hall and auralize it by convolving an input signal with the measured reverberant response. For immersive applications, it is typical to acquire spatial impulse responses using a spherical microphone array to capture the reverberant sound field. While this allows a listener to freely rotate their head from the captured location during reproduction, delicate considerations must be made to allow a full six degrees of freedom auralization. Furthermore, the computational cost of convolution with a high-order Ambisonics impulse response remains prohibitively expensive for current real-time applications, where most of the resources are dedicated towards rendering graphics. For this reason, simplifications are often made in the reproduction of reverberation, such as using a uniform decay around the listener. However, recent work has highlighted the importance of directional characteristics in the late reverberant sound field and more efficient reproduction methods have been developed. In this article, we propose a framework that extracts directional decay properties from a set of captured spatial impulse responses to characterize a directional feedback delay network. For this purpose, a data set was acquired in the main auditorium of the Finnish National Opera and Ballet in Helsinki from multiple source-listener positions, in order to analyze the anisotropic characteristics of this auditorium and illustrate the proposed reproduction framework.
ASNov 20, 2019
Perceptual Loss Function for Neural Modelling of Audio SystemsAlec Wright, Vesa Välimäki
This work investigates alternate pre-emphasis filters used as part of the loss function during neural network training for nonlinear audio processing. In our previous work, the error-to-signal ratio loss function was used during network training, with a first-order highpass pre-emphasis filter applied to both the target signal and neural network output. This work considers more perceptually relevant pre-emphasis filters, which include lowpass filtering at high frequencies. We conducted listening tests to determine whether they offer an improvement to the quality of a neural network model of a guitar tube amplifier. Listening test results indicate that the use of an A-weighting pre-emphasis filter offers the best improvement among the tested filters. The proposed perceptual loss function improves the sound quality of neural network models in audio processing without affecting the computational cost.
ASNov 1, 2018
Deep Learning for Tube Amplifier EmulationEero-Pekka Damskägg, Lauri Juvela, Etienne Thuillier et al.
Analog audio effects and synthesizers often owe their distinct sound to circuit nonlinearities. Faithfully modeling such significant aspect of the original sound in virtual analog software can prove challenging. The current work proposes a generic data-driven approach to virtual analog modeling and applies it to the Fender Bassman 56F-A vacuum-tube amplifier. Specifically, a feedforward variant of the WaveNet deep neural network is trained to carry out a regression on audio waveform samples from input to output of a SPICE model of the tube amplifier. The output signals are pre-emphasized to assist the model at learning the high-frequency content. The results of a listening test suggest that the proposed model accurately emulates the reference device. In particular, the model responds to user control changes, and faithfully restitutes the range of sonic characteristics found across the configurations of the original device.