Few-Shot Drum Transcription in Polyphonic Music
This addresses the challenge of open vocabulary ADT for music processing, allowing adaptation to new drum sounds without retraining, though it is incremental as it builds on existing few-shot learning methods.
The paper tackles the problem of automatic drum transcription (ADT) with limited predefined vocabularies by introducing few-shot learning, enabling recognition of out-of-vocabulary classes and adaptation to finer-grained vocabularies. It shows that the model matches or outperforms a state-of-the-art supervised approach in fixed vocabulary settings and successfully generalizes to unseen vocabularies.
Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the-art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony.