SD CV MM ASJul 25, 2025

Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

arXiv:2507.19225v14.01 citationsh-index: 1INTERSPEECH

Originality Incremental advance

AI Analysis

This work solves the face-voice consistency problem for text-driven talking face generation, which is incremental by extending speech-driven methods to a more challenging setting.

The paper tackles the problem of generating both talking face animations and corresponding speeches from a face image and text, addressing face-voice mismatch issues, and achieves state-of-the-art visual and audio performance on a single 40GB GPU.

Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity \& Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.

View on arXiv PDF

Similar