PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination
This work addresses the need for realistic, real-time piano motion generation for virtual characters and music visualization, offering a significant improvement in inference speed and coordination quality.
PianoFlow introduces a flow-matching framework for audio-driven bimanual piano motion generation that uses MIDI as a privileged modality during training for better musical understanding, an asymmetric role-gated interaction module for dynamic cross-hand coordination, and an autoregressive flow continuation scheme for real-time streaming of long sequences. It achieves superior performance on PianoMotion10M and accelerates inference by over 9× compared to prior methods.
Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.