Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data
This work addresses the challenge of data scarcity in audio generation by proposing a method that leverages perceptual metrics, though it is incremental as it builds on existing ideas in a novel context.
The paper tackled the problem of training generative models without natural data by using perceptual metrics as loss functions, showing that models trained on uniform noise with perceptual losses improved spectrogram and audio reconstruction over Euclidean loss, with better generalization to unseen natural signals.
Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea to the extreme in the audio domain by training a compressive autoencoder to reconstruct uniform noise, in lieu of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalisation to unseen natural signals when using perceptual metrics.