CVNov 13, 2024

Pay Attention to the Keys: Visual Piano Transcription Using Transformers

Uros Zivanovic, Ivan Pilkov, Carlos Eduardo Cancino-Chacón

arXiv:2411.09037v2h-index: 1IJCAI

Originality Incremental advance

AI Analysis

This work addresses the problem of automatically transcribing piano performances from video for musicians and researchers, representing an incremental improvement over existing methods.

The authors tackled visual piano transcription from top-down videos by proposing a vision transformer-based system that outperforms previous CNN methods, achieving state-of-the-art results on onset prediction in the PianoYT dataset and on both onsets and offsets in the R3 dataset.

Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.

View on arXiv PDF

Similar