SDAICVMMASSep 10, 2025

PianoVAM: A Multimodal Piano Performance Dataset

arXiv:2509.08800v14 citationsh-index: 8ISMIR
Originality Synthesis-oriented
AI Analysis

This dataset addresses a data scarcity problem for the music information retrieval community, enabling research in multimodal analysis of piano performances, though it is incremental as it builds on existing data collection efforts.

The authors tackled the lack of multimodal piano performance data by introducing PianoVAM, a dataset that includes videos, audio, MIDI, hand landmarks, and fingering labels, recorded from amateur pianists in varied conditions, and they provided benchmarking results for piano transcription tasks.

The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes