SDAILGApr 21, 2025

Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

arXiv:2504.15071v110 citationsh-index: 3Has CodeICLR
Originality Synthesis-oriented
AI Analysis

This provides a large-scale resource for researchers in music AI, though it is incremental as it focuses on data collection rather than novel modeling.

The authors tackled the problem of limited training data for symbolic music modeling by creating Aria-MIDI, a dataset of over one million piano MIDI files transcribed from audio recordings, totaling roughly 100,000 hours of content.

We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes