CVNov 13, 2024

Pay Attention to the Keys: Visual Piano Transcription Using Transformers

arXiv:2411.09037v2h-index: 1IJCAI
Originality Incremental advance
AI Analysis

This work addresses the problem of automatically transcribing piano performances from video for musicians and researchers, representing an incremental improvement over existing methods.

The authors tackled visual piano transcription from top-down videos by proposing a vision transformer-based system that outperforms previous CNN methods, achieving state-of-the-art results on onset prediction in the PianoYT dataset and on both onsets and offsets in the R3 dataset.

Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes