SDLGASOct 12, 2020

Conditioning Trick for Training Stable GANs

arXiv:2010.05844v1
Originality Incremental advance
AI Analysis

This addresses instability in GAN training for audio synthesis, offering incremental improvements in fidelity and variety for environmental and voice sounds.

The paper tackles GAN training instability by proposing a conditioning trick that forces the generator to match the departure from normality of real samples in the spectral domain, applied to audio spectrogram generation. Experimental results on UrbanSound8k, ESC-50, and Mozilla Common Voice datasets show the method outperforms baselines in inception score, Frechet inception distance, and signal-to-noise ratio.

In this paper we propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training. We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition. This binding makes the generator amenable to truncation and does not limit exploring all the possible modes. We slightly modify the BigGAN architecture incorporating residual network for synthesizing 2D representations of audio signals which enables reconstructing high quality sounds with some preserved phase information. Additionally, the proposed conditional training scenario makes a trade-off between fidelity and variety for the generated spectrograms. The experimental results on UrbanSound8k and ESC-50 environmental sound datasets and the Mozilla common voice dataset have shown that the proposed GAN configuration with the conditioning trick remarkably outperforms baseline architectures, according to three objective metrics: inception score, Frechet inception distance, and signal-to-noise ratio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes