CVNov 27, 2025

AI killed the video star. Audio-driven diffusion model for expressive talking head generation

arXiv:2511.22488v1
Originality Incremental advance
AI Analysis

This work addresses the problem of generating realistic talking heads from audio for applications in video synthesis, though it appears incremental as it builds on existing diffusion models.

The authors tackled audio-driven talking head generation by proposing Dimitra++, a framework that learns lip motion, facial expression, and head pose motion using a conditional Motion Diffusion Transformer, and it outperformed existing approaches on datasets like VoxCeleb2 and CelebV-HQ.

We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes