Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation
This work addresses lip synchronization issues in talking face generation for video synthesis applications, but it is incremental as it builds on existing methods with new metrics.
The paper tackled the challenge of generating talking face videos with accurate lip synchronization while preserving visual quality, and introduced a method using an audio-visual speech representation expert (AV-HuBERT) for training loss and evaluation metrics, showing effectiveness in experiments.
In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.