CVJul 23, 2022

Audio-driven Neural Gesture Reenactment with Video Motion Graphs

Yang Zhou, Jimei Yang, Dingzeyu Li, Jun Saito, Deepali Aneja, Evangelos Kalogerakis

arXiv:2207.11524v114.130 citationsh-index: 41

Originality Incremental advance

AI Analysis

This addresses the challenge of generating realistic gesture-synchronized videos for applications like virtual avatars or content creation, though it is incremental as it builds on existing reenactment techniques.

The paper tackles the problem of reenacting a high-quality video with gestures that match a target speech audio by splitting and reassembling clips from a reference video using a video motion graph and pose-aware blending. It demonstrates higher quality and consistency with audio compared to previous methods, as shown in quantitative, qualitative, and user study evaluations.

Human speech is often accompanied by body gestures including arm and hand gestures. We present a method that reenacts a high-quality video with gestures matching a target speech audio. The key idea of our method is to split and re-assemble clips from a reference video through a novel video motion graph encoding valid transitions between clips. To seamlessly connect different clips in the reenactment, we propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips. Moreover, we developed an audio-based gesture searching algorithm to find the optimal order of the reenacted frames. Our system generates reenactments that are consistent with both the audio rhythms and the speech content. We evaluate our synthesized video quality quantitatively, qualitatively, and with user studies, demonstrating that our method produces videos of much higher quality and consistency with the target audio compared to previous work and baselines.

View on arXiv PDF

Similar