GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting
This addresses the problem of slow training and rendering in previous talking face generation methods for applications like virtual avatars or video conferencing, though it appears incremental as it builds on existing 3D modeling frameworks.
The paper tackles real-time audio-driven talking face generation by introducing GSTalker, which uses deformable Gaussian splatting to achieve fast training (40 minutes) and real-time rendering (125 FPS) with high-fidelity, audio-lips synchronized results.
We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.