Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset
This work addresses the need for better perceptual quality in drum transcription for music generation applications, though it is incremental as it builds on existing datasets and methods.
The authors tackled the problem of automatic drum transcription by introducing the Expanded Groove MIDI dataset (E-GMD), which includes 444 hours of audio and human-performed velocity annotations, and showed that optimizing classifiers for expressive dynamics improved perceptual quality in listening tests, despite similar classification metrics.
We introduce the Expanded Groove MIDI dataset (E-GMD), an automatic drum transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits, making it an order of magnitude larger than similar datasets, and the first with human-performed velocity annotations. We use E-GMD to optimize classifiers for use in downstream generation by predicting expressive dynamics (velocity) and show with listening tests that they produce outputs with improved perceptual quality, despite similar results on classification metrics. Via the listening tests, we argue that standard classifier metrics, such as accuracy and F-measure score, are insufficient proxies of performance in downstream tasks because they do not fully align with the perceptual quality of generated outputs.