CVAug 20, 2025

Taming Transformer for Emotion-Controllable Talking Face Generation

arXiv:2508.14359v13.6h-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of generating realistic emotional talking faces for applications like virtual avatars or video editing, though it appears incremental as it builds on existing transformer-based approaches.

The paper tackled emotion-controllable talking face generation by proposing a method that uses pre-training strategies, emotion-anchor representations, and an autoregressive transformer to synthesize identity-preserving emotional videos from audio, achieving superior qualitative and quantitative results on the MEAD dataset.

Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.

View on arXiv PDF

Similar