AICVDec 3, 2024

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

arXiv:2412.02508v42 citationsh-index: 9EMNLP
Originality Highly original
AI Analysis

This addresses the need for rich emotional expression in digital humans for applications like dialogue systems and gaming, representing a novel method for a known bottleneck.

The paper tackles the problem of generating diverse and emotionally coherent facial expressions from text, which is overlooked in existing talking head synthesis, and introduces a model that significantly outperforms baselines on multiple evaluation metrics using a new dataset.

Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes