JS Fake Chorales: a Synthetic Dataset of Polyphonic Music with Human Annotation
This addresses the bottleneck of dataset scarcity for developing interactive music generation technology, though it is incremental as it uses a known algorithm for data generation.
The paper tackles the lack of large-scale datasets for polyphonic symbolic music by introducing JS Fake Chorales, a synthetic dataset of 500 pieces generated by a learning-based algorithm, and finds that human listeners were only 7% better than random at distinguishing them from real Bach chorales, while augmentation with this dataset improves state-of-the-art validation loss on the JSB Chorales dataset.
High-quality datasets for learning-based modelling of polyphonic symbolic music remain less readily-accessible at scale than in other domains, such as language modelling or image classification. Deep learning algorithms show great potential for enabling the widespread use of interactive music generation technology in consumer applications, but the lack of large-scale datasets remains a bottleneck for the development of algorithms that can consistently generate high-quality outputs. We propose that models with narrow expertise can serve as a source of high-quality scalable synthetic data, and open-source the JS Fake Chorales, a dataset of 500 pieces generated by a new learning-based algorithm, provided in MIDI form. We take consecutive outputs from the algorithm and avoid cherry-picking in order to validate the potential to further scale this dataset on-demand. We conduct an online experiment for human evaluation, designed to be as fair to the listener as possible, and find that respondents were on average only 7% better than random guessing at distinguishing JS Fake Chorales from real chorales composed by JS Bach. Furthermore, we make anonymised data collected from experiments available along with the MIDI samples. Finally, we conduct ablation studies to demonstrate the effectiveness of using the synthetic pieces for research in polyphonic music modelling, and find that we can improve on state-of-the-art validation set loss for the canonical JSB Chorales dataset, using a known algorithm, by simply augmenting the training set with the JS Fake Chorales.