AICVLGDec 11, 2023

DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers

arXiv:2312.06400v17 citationsh-index: 2ICAART
Originality Incremental advance
AI Analysis

This work addresses the need for scalable and high-quality talking head synthesis for applications like virtual assistants, entertainment, and education, but it appears incremental as it builds on existing diffusion and transformer methods.

The paper tackled the problem of high-resolution talking head synthesis by proposing DiT-Head, a diffusion transformer-based pipeline that uses audio as a condition, and showed it can compete with existing methods in visual quality and lip-sync accuracy.

We propose a novel talking head synthesis pipeline called "DiT-Head", which is based on diffusion transformers and uses audio as a condition to drive the denoising process of a diffusion model. Our method is scalable and can generalise to multiple identities while producing high-quality results. We train and evaluate our proposed approach and compare it against existing methods of talking head synthesis. We show that our model can compete with these methods in terms of visual quality and lip-sync accuracy. Our results highlight the potential of our proposed approach to be used for a wide range of applications, including virtual assistants, entertainment, and education. For a video demonstration of the results and our user study, please refer to our supplementary material.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes