HCJun 1, 2018

Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks

arXiv:1806.00154v152 citations
Originality Incremental advance
AI Analysis

This work improves the naturalness of virtual agents by enabling speech-driven, emotion-aware lip synthesis, though it is incremental as it builds on existing GAN frameworks.

The paper tackles generating expressive lip movements from speech without transcripts by modeling the interplay between articulation and emotion, achieving superior results over three state-of-the-art baselines in objective and subjective evaluations, with adapted models showing improvements in trajectory accuracy and happiness emotion.

Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents (VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN (CSG), which learns the relationship between emotion and lexical content in a principled manner. This model uses a set of articulatory and emotional features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model (CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes