Sequence-to-Sequence Piano Transcription with Transformers
This work simplifies music transcription for researchers and practitioners by reducing the need for domain-specific model design, though it is incremental as it applies an existing method to a new domain.
The authors tackled automatic music transcription by showing that a generic encoder-decoder Transformer can achieve equivalent performance to custom deep neural networks, simplifying the process by directly translating spectrograms to MIDI-like events without task-specific architectures.
Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like output events for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and labeling rather than custom model design.