Time-Varying Audio Effect Modeling by End-to-End Adversarial Training
This addresses the challenge of black-box modeling for time-varying audio effects, which is incremental as it builds on existing deep learning approaches for audio effect modeling.
The paper tackled the problem of modeling time-varying audio effects without needing control signals, by introducing a GAN framework with a two-stage training strategy and a new objective metric. Experiments on a vintage hardware phaser demonstrated the method's ability to capture time-varying dynamics in a fully black-box context.
Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, removing the need for modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new objective metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method's ability to capture time-varying dynamics in a fully black-box context.