Speech Driven Talking Face Generation from a Single Image and an Emotion Condition
This work addresses the challenge of enhancing emotional realism in audiovisual communication for applications like virtual assistants or entertainment, though it is incremental as it builds on existing talking face generation methods.
The authors tackled the problem of generating talking face videos from speech and a single image while incorporating visual emotion expression, achieving superior performance over a state-of-the-art baseline in objective and subjective evaluations. They also found that humans rely more on visual than audio cues for emotion recognition in mismatched scenarios.
Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video synchronized with the speech and expressing the conditioned emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a human emotion recognition pilot study using generated videos with mismatched emotions among the audio and visual modalities. Results show that humans respond to the visual modality more significantly than the audio modality on this task.