CVJan 26

Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding

arXiv:2601.18849v12024 IEEE Smart World Congress (SWC)
Originality Incremental advance
AI Analysis

This work addresses a specific problem in audio-driven talking face generation, likely incremental improvements for applications in virtual avatars or entertainment.

The paper tackled the challenge of accurately capturing mouth movements in talking portraits by proposing an automatic method based on blink embedding and hash grid landmarks encoding, which enhanced fidelity as validated in experiments.

Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in the rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have validated the superiority of our approach to existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes