SDAug 22, 2024
Modeling Time-Variant Responses of Optical Compressors with Selective State Space ModelsRiccardo Simionato, Stefano Fasciani
This paper presents a method for modeling optical dynamic range compressors using deep neural networks with Selective State Space models. The proposed approach surpasses previous methods based on recurrent layers by employing a Selective State Space block to encode the input audio. It features a refined technique integrating Feature-wise Linear Modulation and Gated Linear Units to adjust the network dynamically, conditioning the compression's attack and release phases according to external parameters. The proposed architecture is well-suited for low-latency and real-time applications, crucial in live audio processing. The method has been validated on the analog optical compressors TubeTech CL 1B and Teletronix LA-2A, which possess distinct characteristics. Evaluation is performed using quantitative metrics and subjective listening tests, comparing the proposed method with other state-of-the-art models. Results show that our black-box modeling methods outperform all others, achieving accurate emulation of the compression process for both seen and unseen settings during training. We further show a correlation between this accuracy and the sampling density of the control parameters in the dataset and identify settings with fast attack and slow release as the most challenging to emulate.
SDSep 10, 2024
Sines, Transient, Noise Neural Modeling of Piano NotesRiccardo Simionato, Stefano Fasciani
This paper introduces a novel method for emulating piano sounds. We propose to exploit the sines, transient, and noise decomposition to design a differentiable spectral modeling synthesizer replicating piano notes. Three sub-modules learn these components from piano recordings and generate the corresponding harmonic, transient, and noise signals. Splitting the emulation into three independently trainable models reduces the modeling tasks' complexity. The quasi-harmonic content is produced using a differentiable sinusoidal model guided by physics-derived formulas, whose parameters are automatically estimated from audio recordings. The noise sub-module uses a learnable time-varying filter, and the transients are generated using a deep convolutional network. From singular notes, we emulate the coupling between different keys in trichords with a convolutional-based network. Results show the model matches the partial distribution of the target while predicting the energy in the higher part of the spectrum presents more challenges. The energy distribution in the spectra of the transient and noise components is accurate overall. While the model is more computationally and memory efficient, perceptual tests reveal limitations in accurately modeling the attack phase of notes. Despite this, it generally achieves perceptual accuracy in emulating single notes and trichords.
SDMay 7, 2024
Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects ModelingRiccardo Simionato, Stefano Fasciani
Artificial neural networks are a promising technique for virtual analog modeling, having shown particular success in emulating distortion circuits. Despite their potential, enhancements are needed to enable effect parameters to influence the network's response and to achieve a low-latency output. While hybrid solutions, which incorporate both analytical and black-box techniques, offer certain advantages, black-box approaches, such as neural networks, can be preferable in contexts where rapid deployment, simplicity, or adaptability are required, and where understanding the internal mechanisms of the system is less critical. In this article, we explore the application of recent machine learning advancements for virtual analog modeling. We compare State-Space models and Linear Recurrent Units against the more common LSTM networks, with a variety of audio effects. We evaluate the performance and limitations of these models using multiple metrics, providing insights for future research and development. Our metrics aim to assess the models' ability to accurately replicate the signal's energy and frequency contents, with a particular focus on transients. The Feature-wise Linear Modulation method is employed to incorporate effect parameters that influence the network's response, enabling dynamic adaptability based on specified conditions. Experimental results suggest that LSTM networks offer an advantage in emulating distortions and equalizers, although performance differences are sometimes subtle yet statistically significant. On the other hand, encoder-decoder configurations of Long Short-Term Memory networks and State-Space models excel in modeling saturation and compression, effectively managing the dynamic aspects inherent in these effects. However, no models effectively emulate the low-pass filter, and Linear Recurrent Units show inconsistent performance across various audio effects.
ASApr 27, 2021
DASEE A Synthetic Database of Domestic Acoustic Scenes and Events in Dementia Patients EnvironmentAbigail Copiaco, Christian Ritz, Stefano Fasciani et al.
Access to informative databases is a crucial part of notable research developments. In the field of domestic audio classification, there have been significant advances in recent years. Although several audio databases exist, these can be limited in terms of the amount of information they provide, such as the exact location of the sound sources, and the associated noise levels. In this work, we detail our approach on generating an unbiased synthetic domestic audio database, consisting of sound scenes and events, emulated in both quiet and noisy environments. Data is carefully curated such that it reflects issues commonly faced in a dementia patients environment, and recreate scenarios that could occur in real-world settings. Similarly, the room impulse response generated is based on a typical one-bedroom apartment at Hebrew SeniorLife Facility. As a result, we present an 11-class database containing excerpts of clean and noisy signals at 5-seconds duration each, uniformly sampled at 16 kHz. Using our baseline model using Continues Wavelet Transform Scalograms and AlexNet, this yielded a weighted F1-score of 86.24 percent.