A holistic approach to polyphonic music transcription with neural networks
This work addresses the challenge of creating complete music scores from audio for musicians and researchers, but it is incremental as it builds on existing neural network methods in a simplified setting.
The authors tackled the problem of automatic music transcription by developing an end-to-end neural network framework that generates music notation directly from polyphonic audio, bypassing intermediate steps like piano-roll representations. Results indicate the model can learn to transcribe scores from audio, though it was evaluated in a simplified scenario with synthesized classical music.
We present a framework based on neural networks to extract music scores directly from polyphonic audio in an end-to-end fashion. Most previous Automatic Music Transcription (AMT) methods seek a piano-roll representation of the pitches, that can be further transformed into a score by incorporating tempo estimation, beat tracking, key estimation or rhythm quantization. Unlike these methods, our approach generates music notation directly from the input audio in a single stage. For this, we use a Convolutional Recurrent Neural Network (CRNN) with Connectionist Temporal Classification (CTC) loss function which does not require annotated alignments of audio frames with the score rhythmic information. We trained our model using as input Haydn, Mozart, and Beethoven string quartets and Bach chorales synthesized with different tempos and expressive performances. The output is a textual representation of four-voice music scores based on **kern format. Although the proposed approach is evaluated in a simplified scenario, results show that this model can learn to transcribe scores directly from audio signals, opening a promising avenue towards complete AMT.