Talking Head Generation via AU-Guided Landmark Prediction
This work addresses the problem of generating realistic and expressive talking head videos for applications like virtual avatars or video editing, offering fine-grained control over facial expressions, though it is incremental by building on prior methods with explicit AU modeling.
The paper tackles audio-driven talking head generation by proposing a two-stage framework that uses explicit facial Action Units (AUs) to guide landmark prediction, resulting in improved expression accuracy, temporal stability, and visual realism, with experiments on the MEAD dataset showing it outperforms state-of-the-art baselines across multiple metrics.
We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.