SDAILGASMay 4, 2021

VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

arXiv:2105.01531v27 citations
Originality Incremental advance
AI Analysis

This addresses a practical limitation for audio synthesis applications, particularly in music, where fixed-length outputs are restrictive, though it appears incremental as it builds on existing GAN and VQCPC methods.

The paper tackles the problem of generating variable-length audio, which is challenging with standard GANs that use fixed-size spectrograms, by proposing VQCPC-GAN, an adversarial framework that uses Vector-Quantized Contrastive Predictive Coding tokens as conditional input; results show it achieves performance comparable to strong baselines even when generating variable-length audio.

Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the "image data". However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variable-length audio by exploiting Vector-Quantized Contrastive Predictive Coding (VQCPC). A sequence of VQCPC tokens extracted from real audio data serves as conditional input to a GAN architecture, providing step-wise time-dependent features of the generated content. The input noise z (characteristic in adversarial architectures) remains fixed over time, ensuring temporal consistency of global features. We evaluate the proposed model by comparing a diverse set of metrics against various strong baselines. Results show that, even though the baselines score best, VQCPC-GAN achieves comparable performance even when generating variable-length audio. Numerous sound examples are provided in the accompanying website, and we release the code for reproducibility.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes