CVAIMay 9, 2024

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

arXiv:2405.05636v1ICASSP
Originality Incremental advance
AI Analysis

This work addresses the need for cost-effective, high-quality talking face generation for applications like video editing or virtual avatars, but it is incremental as it builds on existing latent space methods.

The paper tackles the problem of generating customized talking faces by combining face swapping and lip synchronization, proposing SwapTalk, a unified framework that operates in a latent space to reduce interference and improve clarity, achieving significant gains in video quality, lip sync accuracy, and identity consistency on the HDTF dataset.

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at http://swaptalk.cc.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes