CVNov 5, 2020

Lets Play Music: Audio-driven Performance Video Generation

arXiv:2011.02631v17 citations
AI Analysis

This addresses the challenge of audio-driven video generation for performance applications, representing an incremental advance in multimodal synthesis.

The authors tackled the problem of generating realistic and synchronized videos of a person playing an instrument from music audio, proposing a multi-staged framework that achieved this new task with validated effectiveness.

We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip. It is a challenging task to gener-ate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realisticand synchronized performance video from given music. Firstly,we provide both global appearance and local spatial informationby generating the coarse videos and keypoints of body and handsfrom a given music respectively. Then, we propose to transformthe generated keypoints to heatmap via a differentiable spacetransformer, since the heatmap offers more spatial informationbut is harder to generate directly from audio. Finally, wepropose a Structured Temporal UNet (STU) to extract bothintra-frame structured information and inter-frame temporalconsistency. They are obtained via graph-based structure module,and CNN-GRU based high-level temporal module respectively forfinal video generation. Comprehensive experiments validate theeffectiveness of our proposed framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes