Conditioning Trick for Training Stable GANs
This addresses instability in GAN training for audio synthesis, offering incremental improvements in fidelity and variety for environmental and voice sounds.
The paper tackles GAN training instability by proposing a conditioning trick that forces the generator to match the departure from normality of real samples in the spectral domain, applied to audio spectrogram generation. Experimental results on UrbanSound8k, ESC-50, and Mozilla Common Voice datasets show the method outperforms baselines in inception score, Frechet inception distance, and signal-to-noise ratio.
In this paper we propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training. We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition. This binding makes the generator amenable to truncation and does not limit exploring all the possible modes. We slightly modify the BigGAN architecture incorporating residual network for synthesizing 2D representations of audio signals which enables reconstructing high quality sounds with some preserved phase information. Additionally, the proposed conditional training scenario makes a trade-off between fidelity and variety for the generated spectrograms. The experimental results on UrbanSound8k and ESC-50 environmental sound datasets and the Mozilla common voice dataset have shown that the proposed GAN configuration with the conditioning trick remarkably outperforms baseline architectures, according to three objective metrics: inception score, Frechet inception distance, and signal-to-noise ratio.