CVAIAug 3, 2024

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

arXiv:2408.01732v16 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses the challenge of producing realistic and synchronized virtual avatars for applications like film production and online conferences, representing an incremental improvement over existing methods.

The paper tackles the problem of generating high-quality, temporally coherent talking head videos from audio by introducing a two-stage diffusion model that uses generated facial landmarks as a condition to optimize lip synchronization and reduce jitter, achieving state-of-the-art performance in experiments.

Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes