SDAIMay 7, 2024

Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects Modeling

arXiv:2405.04124v64 citationsh-index: 3EURASIP Journal on Audio, Speech, and Music Processing
Originality Synthesis-oriented
AI Analysis

This incremental study helps audio engineers by comparing neural network architectures for virtual analog modeling, but it does not introduce new methods.

The researchers compared State-Space models, Linear Recurrent Units, and LSTM networks for virtual analog audio effects modeling, finding that LSTMs performed best for distortions and equalizers while encoder-decoder LSTMs and State-Space models excelled at saturation and compression, though no model effectively emulated low-pass filters.

Artificial neural networks are a promising technique for virtual analog modeling, having shown particular success in emulating distortion circuits. Despite their potential, enhancements are needed to enable effect parameters to influence the network's response and to achieve a low-latency output. While hybrid solutions, which incorporate both analytical and black-box techniques, offer certain advantages, black-box approaches, such as neural networks, can be preferable in contexts where rapid deployment, simplicity, or adaptability are required, and where understanding the internal mechanisms of the system is less critical. In this article, we explore the application of recent machine learning advancements for virtual analog modeling. We compare State-Space models and Linear Recurrent Units against the more common LSTM networks, with a variety of audio effects. We evaluate the performance and limitations of these models using multiple metrics, providing insights for future research and development. Our metrics aim to assess the models' ability to accurately replicate the signal's energy and frequency contents, with a particular focus on transients. The Feature-wise Linear Modulation method is employed to incorporate effect parameters that influence the network's response, enabling dynamic adaptability based on specified conditions. Experimental results suggest that LSTM networks offer an advantage in emulating distortions and equalizers, although performance differences are sometimes subtle yet statistically significant. On the other hand, encoder-decoder configurations of Long Short-Term Memory networks and State-Space models excel in modeling saturation and compression, effectively managing the dynamic aspects inherent in these effects. However, no models effectively emulate the low-pass filter, and Linear Recurrent Units show inconsistent performance across various audio effects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes