Towards Realistic Synthetic Data for Automatic Drum Transcription
This addresses the data bottleneck for researchers and practitioners in music information retrieval by providing a scalable solution for training drum transcription models without requiring paired datasets.
The paper tackles the scarcity of paired audio-MIDI data for Automatic Drum Transcription by introducing a semi-supervised method to curate a diverse corpus of one-shot drum samples from unlabeled audio, which is used to synthesize a high-quality dataset and train a model that achieves new state-of-the-art results on ENST and MDB test sets, significantly outperforming previous methods.
Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a sequence-to-sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state-of-the-art results, significantly outperforming both fully supervised methods and previous synthetic-data approaches. The code for reproducing our experiments is publicly available at https://github.com/pier-maker92/ADT_STR