MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control
This addresses the challenge of generating realistic and stable talking faces for applications in digital media and virtual avatars, representing an incremental improvement over existing methods.
The paper tackles the problem of generating temporally consistent and customizable talking faces from audio, proposing MAGIC-Talk, a one-shot diffusion-based framework that outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy.
Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.