CVNov 27, 2022

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, Nannan Wang

Tsinghua

arXiv:2211.14758v129.5169 citationsh-index: 58

Originality Incremental advance

AI Analysis

This addresses the need for automated, high-quality video editing for content creators and media professionals, though it is incremental as it builds on existing lip-sync and face editing techniques.

The paper tackles the problem of editing talking head videos to synchronize lip movements with new audio inputs, even with different emotions, by proposing VideoReTalking, a system that achieves high-quality, lip-synced outputs through a sequential pipeline of face generation, lip-sync, and enhancement, demonstrating superiority over state-of-the-art methods in accuracy and visual quality on datasets and in-the-wild examples.

We present VideoReTalking, a new system to edit the faces of a real-world talking head video according to input audio, producing a high-quality and lip-syncing output video even with a different emotion. Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for improving photo-realism. Given a talking-head video, we first modify the expression of each frame according to the same expression template using the expression editing network, resulting in a video with the canonical expression. This video, together with the given audio, is then fed into the lip-sync network to generate a lip-syncing video. Finally, we improve the photo-realism of the synthesized faces through an identity-aware face enhancement network and post-processing. We use learning-based approaches for all three steps and all our modules can be tackled in a sequential pipeline without any user intervention. Furthermore, our system is a generic approach that does not need to be retrained to a specific person. Evaluations on two widely-used datasets and in-the-wild examples demonstrate the superiority of our framework over other state-of-the-art methods in terms of lip-sync accuracy and visual quality.

View on arXiv PDF

Similar