CVMay 7, 2024

Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation

arXiv:2405.04327v120 citationsh-index: 332024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This work addresses lip synchronization issues in talking face generation for video synthesis applications, but it is incremental as it builds on existing methods with new metrics.

The paper tackled the challenge of generating talking face videos with accurate lip synchronization while preserving visual quality, and introduced a method using an audio-visual speech representation expert (AV-HuBERT) for training loss and evaluation metrics, showing effectiveness in experiments.

In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes