Music Transcription with (Almost) No Supervision
For music transcription researchers, this work demonstrates a practical method to reduce reliance on scarce paired data by leveraging abundant unpaired data, offering a path to high-quality transcription for instruments with limited labeled data.
The paper tackles music transcription with limited paired audio-score data by using a cycle-consistent translation framework that leverages unpaired audio and score data. Results show that unpaired data yields large gains, with unpaired audio contributing more than unpaired scores, and that incorporating unlabeled audio from a new instrument improves transcription without paired supervision.
Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.