CVSep 14, 2023

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

arXiv:2309.07509v110 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating articulate-speaking faces for applications in video synthesis and human-computer interaction, representing an incremental improvement by integrating landmarks with diffusion models.

The paper tackles the problem of generating realistic talking faces by proposing DiffTalker, a model that uses audio and landmark co-driving to produce lifelike results, achieving superior performance in clarity and geometric accuracy without requiring additional audio-image alignment.

Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes