CVFeb 2, 2025

EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

arXiv:2502.00654v18 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating emotionally expressive talking heads for applications like virtual avatars or entertainment, though it is incremental as it builds on existing 3D Gaussian splatting techniques.

The paper tackles the problem of limited emotional diversity in 3D Gaussian splatting-based talking head synthesis by proposing EmoTalkingGaussian, which manipulates facial emotions using continuous emotion values while maintaining lip synchronization with audio, achieving better results than state-of-the-art methods in image quality, emotion expression, and lip synchronization metrics.

3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes