Contrastive timbre representations for musical instrument and synthesizer retrieval
This addresses a challenge in digital music production for musicians and producers, offering an incremental improvement over existing methods.
The paper tackles the problem of retrieving specific instrument timbres from audio mixtures in digital music production by introducing a contrastive learning framework, achieving 81.7% top-1 and 95.7% top-5 accuracies for three-instrument mixtures.
Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7\% top-1 and 95.7\% top-5 accuracies for three-instrument mixtures.