One Billion Audio Sounds from GPU-enabled Modular Synthesis
This provides a massive-scale synthetic audio dataset and efficient generation tools for audio ML research, though it's incremental in dataset scaling.
The authors created synth1B1, a dataset of 1 billion synthesized audio sounds paired with parameters, which is 100x larger than existing audio datasets, and introduced torchsynth, a GPU-based synthesizer that generates samples 16200x faster than real-time. They also released two new audio datasets and demonstrated new evaluation criteria for audio representations.
We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, paired with the synthesis parameters used to generate them. The dataset is 100x larger than any audio dataset in the literature. We also introduce torchsynth, an open source modular synthesizer that generates the synth1B1 samples on-the-fly at 16200x faster than real-time (714MHz) on a single GPU. Finally, we release two new audio datasets: FM synth timbre and subtractive synth pitch. Using these datasets, we demonstrate new rank-based evaluation criteria for existing audio representations. Finally, we propose a novel approach to synthesizer hyperparameter optimization.