PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis
This work addresses lip sync accuracy for talking head synthesis, which is important for applications like virtual avatars and video generation, and it is incremental by building on existing pre-trained acoustic models.
The paper tackled the problem of inaccurate and unstable lip motion in talking head synthesis due to phoneme-viseme alignment ambiguity by proposing PASE, a phoneme-aware speech encoder that introduces phoneme embeddings and contrastive alignment, resulting in state-of-the-art performance with improvements of 13.7% and 14.2% over conventional methods in lip sync accuracy.
Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.