CVJul 18, 2023

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

arXiv:2307.09368v314 citationsh-index: 34
Originality Incremental advance
AI Analysis

This work addresses problems in generating realistic talking face videos for applications like virtual avatars or video editing, representing an incremental improvement over existing methods.

The paper tackled issues in audio-driven talking face generation, such as unstable training and lip synchronization problems, by introducing a silent-lip generator and stabilized synchronization loss, resulting in a model that outperforms state-of-the-art methods in visual quality and lip synchronization.

Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by identifying several issues with existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss, SyncNet, and lip leaking from the identity reference. To address these issues, we first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. We then introduce stabilized synchronization loss and AVSyncNet to overcome problems caused by lip-sync loss and SyncNet. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions and their cohesive effects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes